I recently found this slideshare about how to run a post-mortem here: Post Mortems for Humans

I am personally a firm believer that if you want a reliable, problem-resistant system, you have to incrementally improve by asking questions and systematically eradicating problems as they arise. Designing a system in a vacuum to be perfect simply does not work in practice because the subtly and variety of environmental issues are unknown at design time. Really, you have to design in a facility for not having a design for every problem.

The slideshow also talks about humor and using past mistakes to diffuse the tension of a post-mortem. This resonates with me as well, as I try to use humor to break through barriers all the time (probably not apparent from the writing on this blog however…).

Labeling a root cause as “human error” is basically worthless. This is well articulated in the slideshow. In fact, using root causes at all is a fool’s errand. I recently have had this debate with co-workers as we tried to decide which of the several reasons something broke was the One True Root Cause. Who cares? The more important issue is how do we prevent the broadest range of possible future problems. Too many times I have seen people get down to a single root cause and then fix that one exact problem just to have a similar (but new!) issue pop up the next day.

What do you think?

Link | Posted on by | Tagged , , , , , , | Leave a comment

5 Lessons from Healthcare.gov Software Disaster of the Century

Healthcare.gov

Healthcare.gov (Photo credit: jurno)

The most visible software disaster I’ve ever seen is the Healthcare.gov problems that have received unprecedented media coverage and raised concerns both technical and political.

There is plenty of coverage of the saga so I won’t bother to repost or comment on all that is happening.  But instead will attempt to extract some lessons that can be applied to software projects in general.

1. Too Many Cooks

Continue reading

Posted in Business Continuity, Disaster Recovery | Tagged , , , , , , | 2 Comments

Solve Human Error Disclosures

English: ABLOY keys Русский: Ключи дискового з...

English: ABLOY keys Русский: Ключи дискового замка ABLOY (Photo credit: Wikipedia)

It doesn’t matter how good your technology systems are if you trust people to follow certain steps to keep data secure as a prison in England learned the hard way.

The best part of this story is that they “were reminded how to handle personal and sensitive information of patients and employees.”  Unfortunately reminding people simply doesn’t work if you want to really make a change.

So what should they have done?

First of all, question the need for USB sticks in the first place.  Why can’t the data be stored securely in the cloud and transferred on an encrypted channel?

And the data-at-rest on the USB keys could be encrypted with public/private keys.  If the USB keys are lost, they would be of no use to anyone who found them.

When you run an organization of any size that requires the protection & care of any personal data, you have to assume that people will mess up.  Empower them to do the right thing, give them the right tools, and make sure you have failsafe systems that prevent risky & costly disclosures.

Posted in Cloud, Security, Technology | Tagged , , , , , , , , , , | Leave a comment

Proactive Server Monitoring Pitfalls

Svet Stefanov

Svet Stefanov

The following is a guest post by Svet Stefanov.  He is a freelance writer and
a contributing author to WebSitePulse’s blog. Svet is currently exploring the topic of proactive
monitoring, and how it can help small and big business steer clear of shallow waters.

The first step in fixing a problem is admitting you have one.  Fixing problems in software systems is no different. Server issues are not a thing of the past and as the Cloud continues to grow and further meshes into our lives, these problems are not going away anytime soon. Monitoring different types of servers, SaaS, or even a single website is not only a good practice, but a long-term investment. If you have the time, I would like share my thoughts on why it is better to be prepared for the worst rather than blissfully avoiding it.

I’ve heard people say that monitoring is the first line of defense. In my opinion, the first line of defense is a great recovery procedure and embedded redundancy. Adequate monitoring systems have two main functions – to detect a problem and to alert concerned parties. More advanced systems with extensions & complex configuration can also take action without human intervention based on a predefined set of rules.  But even a few small investments in monitoring can yield great improvements in reliability.

Internal or external monitoring? Continue reading

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , | Leave a comment

Software You Hope You Never Use

Photo by miguelb

I was reminded recently of a project I worked on several years ago for a local university.  This software was a simple system to be used in emergency situations to help account for students who may have been affected and get them connected to medical services or parents as necessary.

Usually when I write software, I get excited by the idea of people using it and being able to enjoy it or at least have it solve a problem for them.  In this case, you hope that no one will ever have to use the software.

As I think about preparedness for an emergency and creating software to be used only in extreme and potentially catastrophic circumstances, it creates some unique design challenges.

Prediction

Continue reading

Posted in Uncategorized | 1 Comment

What You Wish You Knew During a Crisis…

From my guest post at ContinuityInsights.com

Continuity InsightsDuring a crisis, there is almost by definition a shortage of accessible information. Because of the time pressure a disaster creates, anything considered noise gets filtered out and ignored. However, if you could create a plan to track the right informatoin and make it available during difficult times, it could mean the difference between tragedy and a close call.

Continue reading (at ContinuityInsights.com)…

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , | Leave a comment

Thoughts on Windows Azure Leap Day Downtime

I’d be remiss not to mention the Windows Azure Downtime on Leap Day.  Because of my employment at Microsoft I won’t speculate or say too much on the situation.   I have said before that cloud computing does not completely alleviate the risks of downtime.

I would like to reiterate that there are always inherent risks in building and running software, and failure is to be expected not avoided.  The best designed systems are set up for failure, and can handle these cases with grace.  This particular event with Windows Azure further highlights the need to design applications that sit on top of any infrastructure (traditional, cloud, or hybrid) in such a way that they can work when (not if) a major portion of the infrastructure fails.

Don’t be fooled into thinking that any cloud service provides a silver bullet to resiliency.  Outsourcing your IT infrastructure to a cloud provider greatly improves your resiliency to for the cost you have to pay; most of us cannot afford to build & maintain a fault tolerant world-wide infrastructure.   When a failure does occurs, don’t overlook the economies of scale that benefit the application tenants most of the time when things are working properly.

Posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime | Tagged , , , , , | Leave a comment