The Three Mile Island event and cybersecurity incident response

Managing the deluge of data and alerts in a SOC can be challenging for any size organization. Observing the lessons learned from the Three Mile Island nuclear facility can help drive home some best practices for how to avoid common pitfalls.

nuclear power plant
Greg Dunlap (CC BY 2.0)

On March 28, 1979, at 4:00 am, reactor two of the Three Mile Island nuclear plant suffered a catastrophic failure.  The problem was relatively simple:  a stuck value prevented reactor coolant from returning to reactor core, causing the core to overheat and ultimately melt down.  Around seven hours later, pressurized radioactive steam was vented into the atmosphere, causing radiation levels down-wind of the plant to be up to nine times higher up to 100 miles away.  Luckily, the radiation levels were not severe enough to contaminate local food supplies or cause serious health risks for the local residents or livestock.

Three Mile Island became the worst known nuclear power accident up until that time and the name “Three Mile Island” became synonymous with catastrophic disasters. This continued until the Chernobyl accident seven years later.  So, how does this event corelate to your cybersecurity situation?  Let’s look at the root causes of this catastrophe.

The Three Mile Island event

When the valve regulating reactor coolant got stuck, there was no effective feedback to the operators that indicated that this was a problem.  While there was a sensor that indicated the “intended” state, when the valve failed to open the control panel reflected that the value was indeed open despite its reported state.  If the sensor had measured the actual state rather than the intended state, the reactor technicians may have been able to resolve the problem in a timely manner, effectively preventing the catastrophe. 

When the problem went unnoticed, the ultimate result was a gradual rise in core temperature triggering system after system to generate alerts and deluging the operators with hundreds of alerts, but not providing a root-cause of the instigating problem.  After 80 minutes, the system began to accumulate dangerous collections of steam, and after 165 minutes a site alarm was triggered.  By this point, radiation levels were 300 times greater than expected levels. It would be nearly three and a half hours from the beginning of the incident before a general alarm would be triggered, warning the surrounding residents of the potential danger.

Lessons learned for the CISO office

There is definitely a correlation between the Three Mile Island event and the alert deluge problem faced by all SecOps organizations.  The unfortunate truth is that the Three Mile Island incident could have been avoided, or at least the damage could have been minimized.  Like many severe incidents, Three Mile Island was a series of problems that culminated in the historic nuclear event.  Here are three fundamental take-aways from this story that may apply to many security operations teams.

1. Monitor the right thing

While acknowledging that this is simpler said than done, the core problem at Three Mile Island was that the control panel monitored where the valve was supposed to be, rather than where it actually was.  What does this mean in the cybersecurity context?  In the world where "fileless” malware is becoming more commonplace, monitoring a file system for known malware, or monitoring for files being transferred to an endpoint may miss the attack completely.  Monitoring for malware resident in memory may provide different results than monitoring your file system or network for signatures.

2. Have solutions that help identify root cause

When Three Mile Island started alerting as to the potential problem, it didn’t highlight the stuck valve.  It probably started alerting that temperature was increasing, pressure buildup, high radiation levels, etc.  While the nuclear technicians played “whack-a-mole” with the individual symptoms of the problem, the user interface did not help to identify the common culprit.  Many SIEM solutions help organize the deluge of alerts and present the correlated root cause.  This may help shed valuable minutes from your incident response.

3. Rapid and effective disclosure

One of the biggest criticisms of how the Three Mile Island incident unfolded was in how long it took to signal the alert, but internally and publicly.  It took three and half hours before the general alarm was sounded, and four and a half hours before this warning reached the public.  Similarly, when Equifax was breached, public disclosure did not follow for one and a half months.  While public disclosure of a breach holds risk, when the breach finally becomes public (which it ultimately will) the cover up will become a bigger issue than the breach itself.

A common trend in any high-tech field is the tendency to relearn the same lessons over again.  Sometimes it can be comforting and also enlightening to learn how similar mistakes were handled by differing industries not that long ago.

This article is published as part of the IDG Contributor Network. Want to Join?

SUBSCRIBE! Get the best of CSO delivered to your email inbox.