I recently finished the book Riding Rockets by Mike Mullane. The book is a memoir of the author’s time as an astronaut at NASA during the heyday of the Space Shuttle Program. With the recent 35th anniversary of the Challenger disaster, I could not help but reflect on some of the insights from the book. Mullane, who became an astronaut in 1978, flew on three shuttle missions over the course of his career. However, instead of solely focusing on trips into space, a large portion of his book focuses on the events leading up to and the aftermath of the Challenger disaster. Mullane discusses the mechanical failure that led to the tragedy in detail, but what I found more interesting was his emphasis on the organizational failures that allowed the event to occur.
The ensuing investigation of the explosion that occurred on January 26, 1986, turned up the following as key failures that contributed to the tragedy:
- Normalization of Deviation: This occurs when deviation from acceptable operating windows becomes standard practice. As more deviation occurs without catastrophic results, it becomes the new accepted norm. The solid rocket boosters on the shuttle experienced O-ring failure with combustion gas blowby as early as the first shuttle launch (STS-1). Instead of addressing the issues, engineers began rating the damage seen after launch. As long as the damage was not as extensive as the worst blowby incident since the beginning of the program, NASA deemed the shuttle still safe to fly. This type of organizational failure also led to the Columbia disaster in 2003 when normalization of foam dislodging from the external fuel tank caused damage to the heat shield.
- “Launch Fever”: Leading up to the Challenger disaster, shuttles were launching as little as 17 days apart. This high launch rate was a result of program and commercial pressures to get payloads into space. The quick turnaround between launches stretched the engineering staff at NASA thin and engineers were getting burned out with the large workload. “Launch Fever” caused many in the organization to overlook or minimize issues regarding crew safety in order to get the next shuttle off the ground.
- Overreliance on Safety Systems: The shuttle safety systems often had redundant sensors or safety interlocks, but there was no crew escape mechanism to protect the crew in the event of a failure. While this did not directly result in the disaster, this design flaw was never addressed, partially because the failure rate of the shuttle was estimated to be 1 in 100,000 (when in actuality it was closer to 1 in 100). Even though the mission would still have been a failure, having a crew escape system as a last line of defense may have allowed the astronauts to abort safely.
While the problems and constraints of spaceflight are much different than those of the oil, gas, and chemicals industries, some of the takeaways from the Challenger incident remain the same. We should always be wary of allowing organizational failures to creep into the workplace. Questions we should keep at the forefront of our mind are:
Are there areas where deviation from acceptable windows is being normalized? For example:
- Are we operating too close to safe operating limits or design limits?
- Are equipment or piping leaks identified and addressed appropriately and in a timely manner?
- Are we assuming the carseal list is correct or are we double-checking the list regularly to ensure that valves are in their proper position and that carseals have not fallen off?
- Are we ignoring certain alarms because they are always going off and have become a nuisance?
- Have we become comfortable with doing things a certain way because it saves time when there might be a safer alternative?
Are production pressures or scheduling pressures causing potential safety issues to be missed or minimized?
- Are changes being properly managed or are they being rushed through?
- Are safety systems being bypassed without following the appropriate procedures and chain of command?
Are the risks being understated?
- Do the risk profiles in the PHA accurately reflect the process as it is today?
- Is there pressure to “game the system” or “play the numbers” so that the PHA does not show an unacceptable level of risk?
- Is too much credit being given to existing layers of protection or are conditional modifiers applied incorrectly such that the need for additional protection layers is downplayed?
At the end of the day, no task is so important that it cannot be done safely. Making sure everyone returns home to their loved ones is of the utmost importance. We can become complacent and risk falling prey to the same decision making that led to the Challenger disaster 35 years ago, or we can choose to be proactive against organizational failures that can lead to process safety incidents. I know what my choice is every day.