Learning from the Costa Concordia Shipwreck

Costa Concodia

Photo from csmonitor.com

On Friday January 13th, the Costa Concordia had a disaster – running into rocks off the shores off Italy’s western coast and eventually rolling onto its side in the water.   The toll on human life is tragic – several are dead (the number still growing), more missing, and everyone involved went through a traumatic experience.  

The saddest part of the story is that it appears it could have been prevented.  And if not prevented, could have been handled better.

As I’ve been following the story and reflecting on it, a few things have jumped out that I think we can learn from. 

Human Error

I don’t know the details of the timeline of events that caused the accident, but I’m guessing there were a few design problems at play.  For example, I’d people were ignoring early warning systems because they seemed like false alarms.  It’s not whether the alarm goes off, but how seriously you take it that matters. 

Also, I’d wager that the “designed” social structure on the ship prevented junior staff from questioning senior staff.  This was evidenced in a video I saw where the crew would not release the lifeboats until the captain ordered abandon ship.  And unfortunately it was too late for a smooth exit at that point.

Emergency Readiness

Another issue that keeps coming up is the readiness of the crew to deal with an emergency.  This video shows the descent into chaos.  It definitely raises questions about what is the right way to train a team to deal with bad situations.  How often should this be done?  How would the training or simulation be performed?  Is it really worth it?

There are two levels that I think about with emergency training.  The first is “What information will be needed and what procedures will need to be performed?” This is relatively domain specific, and needs to be thought through carefully.  What knowledge must be kept in the head vs. being readily available at the time of crisis?  And how can you reduce and simplify all these procedures down to the absolute minimum to ensure success? 

The second level of preparedness is more general – “What mindset should an emergency responder be in and how will they perform under pressure?”  If you know all the procedures but choke under pressure, you will not be successful in handling a crisis.  So any training program you have must include a “field” portion where people are asked to experience something that feels like a real crisis – confusing circumstances, time pressure, and dire consequences for non-performance. 

If you can combine the two readiness techniques into a safe, comprehensive, repeatable training program, you can be ready for a disaster.

Prevention

One assumption of how the ship operates is that the captain is in charge and knows what he’s doing. In this case, that assumption turned out to be wrong. How could the design of the system have been improved to avoid it?

Why did the ship go off course in the first place? Where there any systems on board (computerized or human) to notice and raise the issue? We don’t know yet, but hopefully will learn when the full review is completed.

They may have had computerized mapping systems could watching the route and complaining when things changed.  Sensors below the ship could also have been watching for rocks and debris and alerted the crew when something was approaching. And the crew might have been aware that something was wrong. 

Ultimately whatever detection systems they had either didn’t work or were ignored.  Sometimes too many errors makes you numb.

Conclusion

We’ve been looking at this in the context of emergency procedures on a cruise ship, but the exact same principles apply to your software systems.  If you can think about human error as flawed design, improve your signal the noise ratio, and make errors OK, then you can improve the resiliency of your systems.

How has this tragedy made you think differently about disaster preparedness and recovery? How ready are you?

Advertisements

About Kit Merker

Product Manager @ Google - working on Kubernetes / Google Container Engine.
This entry was posted in Business Continuity, Cloud, Disaster Recovery, Downtime, Technology, Uptime and tagged , , , , , , . Bookmark the permalink.

One Response to Learning from the Costa Concordia Shipwreck

  1. NJ IT services company Corporate core Solutions offers full IT support, it consulting, computer service, data recovery, web development.
    datarecovery

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s