5 Lessons from Healthcare.gov Software Disaster of the Century

Healthcare.gov

Healthcare.gov (Photo credit: jurno)

The most visible software disaster I’ve ever seen is the Healthcare.gov problems that have received unprecedented media coverage and raised concerns both technical and political.

There is plenty of coverage of the saga so I won’t bother to repost or comment on all that is happening.  But instead will attempt to extract some lessons that can be applied to software projects in general.

1. Too Many Cooks

The healthcare.gov website was built by 55 different vendors, and was overseen by the federal government itself.  According to testimony to congress, there was no single vendor that was responsible for the overall delivery and quality of the application.  None of the people building the various components features were accountable for overall quality.

The complexity of even simple software systems is relative not only to the number of lines of code, but also by the number of individual players involved.  The lesson here is if everyone is in charge, NO ONE is in charge.  When considering outsourcing a project, make sure that you know who’s “throat you can choke” if there are delivery and risk problems, because otherwise you will be left holding  the bag.

2. Big Bang Completion

What is referred to as “integration cost/risk” has to be paid at some point in the software project lifecycle.  As you bring components together from individual developers or teams, they won’t quite fit.  There will be special cases that aren’t handled properly, API calls that drifted from spec to implementation, and unexpected resource contention that will lead to performance problems. The trick is to take that integration cost incrementally throughout the project.  Ideally your teams will lock on the integration interfaces early, and then complete implementation in a continuous cycle.  It’s also a great idea to prioritize a couple end-to-end scenarios and complete them first so that you can see all the pieces working together as early as possible.

If you wait until the end, you will likely spend more time correcting more code because it is built upon poor assumptions.  Think of it this way: It’s cheaper to move a wall on a blue-print than once the drywall & electrical is in place.

3. Testing? What Testing?

It appears on this project (as in so many others) that end-to-end testing was left to the end and the results of the testing were ignored.  The industry best practice (although rarely followed) is to test first.  In fact, Test Driven Development says that you should define and write automated tests that express what you expect first, and then code away at the product until the tests pass.  It’s easy to underestimate the cost (and the value!) of building a comprehensive automated test suite.  Don’t make the same mistake.  Think of your specification in terms of tests that pass/fail and you will be able to measure the progress and quality of your application throughout the project.

4. Lack of Trust – Raising Concerns

An interesting line of questioning from the healthcare.gov hearings related to if any of the teams raised concerns.  In a multi-million dollar government project with dozens of teams, presumably lots of red tape, and no clear leader that no one raised the alarm bells.  Don’t think that your software team is immune from this!  Under pressure to ship software for a customer or looming launch date, teams may get weary and become short sited in their thinking.  “We’ll fix it in the service pack” is an easy excuse.  As a software team leader, if you want to know what’s really going on, you need to make bad news not only OK, but actively seek it out.  Make sure that raising risks and concerns are rewarded.  Don’t play the blame game, but rather look at how the team can make data-driven project management decisions to adjust scope, timelines, and resources for a successful launch.

5. Handling a Crisis

Be honest.  Be clear of the technical difficulties. Tell people what you’re doing about it.  Vagueness does not build confidence. Your boss or customer may not understand all the technical details, but they will recognize that building, scaling, and running technical systems is complicated. Everyone will want an ETA for a fix.  This is the hardest question to handle.  The best way I’ve seen this handled is to provide a schedule for when you will give updates on progress, and stick to that religiously.  Also providing stats on the current number of issues being worked and what stage those are in helps build confidence as that trends in the right direction.  There are many other nuances to handling the PR when you find yourself in this situation but the core principles are simple.

Bonus Lesson – Too Little, Too Late

Adding more people to a huge project is probably not the right solution.  If you set your team up with clear ownership, trust to raise issues, and iterative integration, you probably just need more time.  If you need more expertise, bring in the best talent you can and give them full reign with access to the code, people, and documentation you have and let them work their magic.  You also want to tightly triage the problems you fix to avoid the unintended consequences associated with additional “churn” in the product code.  Also look at what problems can be temporarily fixed with manual reports, processes, or support teams.  Depending on the scope of your project, this may be better for you business in the long run as you build confidence with customers and users.

Conclusion

Building highly scalable, easy to use, helpful software is never easy.  You can increase your chances for success by ensuring clear ownership, integrating & testing early, building trust in your team. And if you find yourself in crisis mode, you have to get up at your “rose garden” and explain clearly what happened and what you’re doing about it. Don’t expect throwing more people & money at the problem to fix it, but sometimes external high-quality help is the only thing that can bring your project back on track. Most importantly, stay positive and focus on making progress on the most important problems.

Take a deep breath and remember that at least you aren’t working on the software disaster of the century!

Posted in Business Continuity, Disaster Recovery | Tagged , , , , , , | Leave a comment

Solve Human Error Disclosures

English: ABLOY keys Русский: Ключи дискового з...

English: ABLOY keys Русский: Ключи дискового замка ABLOY (Photo credit: Wikipedia)

It doesn’t matter how good your technology systems are if you trust people to follow certain steps to keep data secure as a prison in England learned the hard way.

The best part of this story is that they “were reminded how to handle personal and sensitive information of patients and employees.”  Unfortunately reminding people simply doesn’t work if you want to really make a change.

So what should they have done?

First of all, question the need for USB sticks in the first place.  Why can’t the data be stored securely in the cloud and transferred on an encrypted channel?

And the data-at-rest on the USB keys could be encrypted with public/private keys.  If the USB keys are lost, they would be of no use to anyone who found them.

When you run an organization of any size that requires the protection & care of any personal data, you have to assume that people will mess up.  Empower them to do the right thing, give them the right tools, and make sure you have failsafe systems that prevent risky & costly disclosures.

Posted in Cloud, Security, Technology | Tagged , , , , , , , , , , | Leave a comment

Proactive Server Monitoring Pitfalls

Svet Stefanov

Svet Stefanov

The following is a guest post by Svet Stefanov.  He is a freelance writer and
a contributing author to WebSitePulse’s blog. Svet is currently exploring the topic of proactive
monitoring, and how it can help small and big business steer clear of shallow waters.

The first step in fixing a problem is admitting you have one.  Fixing problems in software systems is no different. Server issues are not a thing of the past and as the Cloud continues to grow and further meshes into our lives, these problems are not going away anytime soon. Monitoring different types of servers, SaaS, or even a single website is not only a good practice, but a long-term investment. If you have the time, I would like share my thoughts on why it is better to be prepared for the worst rather than blissfully avoiding it.

I’ve heard people say that monitoring is the first line of defense. In my opinion, the first line of defense is a great recovery procedure and embedded redundancy. Adequate monitoring systems have two main functions – to detect a problem and to alert concerned parties. More advanced systems with extensions & complex configuration can also take action without human intervention based on a predefined set of rules.  But even a few small investments in monitoring can yield great improvements in reliability.

Internal or external monitoring? Continue reading

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , | Leave a comment

Software You Hope You Never Use

Photo by miguelb

I was reminded recently of a project I worked on several years ago for a local university.  This software was a simple system to be used in emergency situations to help account for students who may have been affected and get them connected to medical services or parents as necessary.

Usually when I write software, I get excited by the idea of people using it and being able to enjoy it or at least have it solve a problem for them.  In this case, you hope that no one will ever have to use the software. 

As I think about preparedness for an emergency and creating software to be used only in extreme and pottentially catastrophic circumstances, it creates some unique design challenges.

Prediction

Continue reading

Posted in Uncategorized | 1 Comment

What You Wish You Knew During a Crisis…

From my guest post at ContinuityInsights.com

Continuity InsightsDuring a crisis, there is almost by definition a shortage of accessible information. Because of the time pressure a disaster creates, anything considered noise gets filtered out and ignored. However, if you could create a plan to track the right informatoin and make it available during difficult times, it could mean the difference between tragedy and a close call.

Continue reading (at ContinuityInsights.com)…

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , | Leave a comment

Thoughts on Windows Azure Leap Day Downtime

I’d be remiss not to mention the Windows Azure Downtime on Leap Day.  Because of my employment at Microsoft I won’t speculate or say too much on the situation.   I have said before that cloud computing does not completely alleviate the risks of downtime.

I would like to reiterate that there are always inherent risks in building and running software, and failure is to be expected not avoided.  The best designed systems are set up for failure, and can handle these cases with grace.  This particular event with Windows Azure further highlights the need to design applications that sit on top of any infrastructure (traditional, cloud, or hybrid) in such a way that they can work when (not if) a major portion of the infrastructure fails.

Don’t be fooled into thinking that any cloud service provides a silver bullet to resiliency.  Outsourcing your IT infrastructure to a cloud provider greatly improves your resiliency to for the cost you have to pay; most of us cannot afford to build & maintain a fault tolerant world-wide infrastructure.   When a failure does occurs, don’t overlook the economies of scale that benefit the application tenants most of the time when things are working properly.

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , , | Leave a comment

Through the Storm – Interview with Arterian IT Founder Jamison West

Jamison West

"Having a comprehensive plan that's bigger than just IT is key, but often IT can be the forcing function to get you started."

I recently had a chance to interview Jamison West of Arterian. Jamison, who founded the company that is now Arterian in 1995, envisions a future where every small to mid-sized company will have an IT partner become a vital part of its core operations team keeping them free from disaster and flourishing.

SoftwareDisastersBlog: How do you help your customers prevent and prepare for IT disasters?

Jamison West: We see with our customers that reliance on connectivity is higher than it’s ever been for businesses to execute and support their customers. People now expect email to work like instant messaging, sent and received as fast as they type it.  We try to prevent IT issues  by adding redundancy to make sure that if there are problems — natural disasters or bad weather like we had recently in Seattle  — our customers are still up and running at least for critical operations.

Continue reading

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , , | Leave a comment