Tag: Incidents (Page 2 of 2)

What can we learn from the Fernie Ammonia fatalities?

The October 17th, 2017 Ammonia release in Fernie, BC resulted in three fatalities:

On October 16, 2017, the curling brine chiller at the Fernie Memorial Arena was put back into operation after a seasonal shutdown. During the shutdown and seasonal maintenance, ammonia had been detected in the curling brine system, indicating that the curling brine chiller was leaking… A total of three people were found deceased in the mechanical room: the director of leisure services, the refrigeration operator, and a refrigeration contractor mechanic.

 

Three people died in a completely avoidable incident. If you want to know the particulars of the incident, I’d recommend you go read the Incident Report itself. While we can’t go back in time and avoid this particular incident, we can extract some valuable lessons from it to prevent a similar incident in the future.

There’s a lot that went wrong, but we’re going to focus on a few key failures in Mechanical Integrity, Process Safety, and Release / Incident Response. We’ll briefly discuss each failure and provide ten opportunities for improving your current Process Safety system.

Note: While this incident occurred in Canada, which does not have robust Process Safety regulation, we’re going to provide our analysis as if it was a PSM/RMP plant. Even if this incident had occurred in the US, the total system inventory was estimated at less than 1,000 pounds, placing it in the General Duty category. Most operators of these General Duty systems do not choose to implement a PSM system – hopefully this incident will cause them to re-evaluate that choice.

 

Equipment Age and installation: In 2011, the facility received a recommendation from their mechanical contractor to replace the chiller due to its age. It had been in service for about 24 years and had a life expectancy of 20-25yrs. (At the time of failure the chiller was in service for approximately 31yrs.) The facility actually budgeted for this replacement, deferred it, and then dropped the idea altogether. The report (and appendices) detail this decision making and indicates that the people making these decisions didn’t understand the underlying safety issues or the possible repercussions of these decisions. In part this was due to management turnover – the people who received the initial recommendation no longer worked at the facility when those recommendations were due to be implemented. Additionally, post-release, it was determined that the failed coupling was not properly supported.

Possible PSM citations: 1910.119(d)(3)(ii) for not installing the coupling per the manufacturers recommendations. 1910.119(d)(3)(ii) for equipment operating outside manufacturer’s recommended lifespan. 1910.119(e)(1) for the PHA not analyzing the hazards associated with operating outside the manufacturer’s recommended lifespan. 1910.119(j)(5) for operating the equipment with a known (service life) deficiency without assuring safe operation. 1910.119(m)(5) for not addressing and resolving a recommendation. (if the recommendation was made due to an indication of NH3 in the brine)

Opportunity #1: When a piece of equipment has a stated service life, you need to either replace the equipment per the recommendation or support your decision to keep it in service with a suitable engineering rationale.

Opportunity #2: When operators & contractors make recommendations, they need to provide CLEAR and defensible reasons for those recommendations.

Opportunity #3: When recommendations are delayed, deferred, or not completed, the operators & contractors need to ensure that the decision makers understand the implications of their decisions.

Opportunity #4: A Pre-Startup Safety Review (PSSR) and ongoing MI tasks need to ensure that equipment is installed correctly and maintained in a safe manner / arrangement.

 

Signs of Failure and Deficiency Response: The facility detected NH3 in the brine (by scent) in April of 2017 and then followed it up with a lab test of the brine showing over 3,000ppm of NH3 in June. The facility decided to continue operating the chiller and “monitor” it. A second test in August showed an NH3 concentration near 2,000ppm. Again, the facility decided to keep “monitoring” the situation. The report indicated that the personnel performing the tests and receiving the results didn’t understand the safety implications of them. Even after receiving the tests showing the chiller had failed, the facility decided to keep operating it. According to the report, there was no evidence the facility understood the hazards associated with a leaking chiller.

Furthermore, due to a miscommunication, the contractor believed the facility had taken the chiller out-of-service and they were preparing a bid to replace the leaking unit. The contractor’s recommendation to “monitor” the unit was likely meant to monitor it to see if the valves were leaking by, but the facility interpreted it as a go-ahead to continue operating the defective chiller until it could be replaced as long as they “monitored” it.

The contractor had no policy or procedure in place to deal with a failed chiller outside the usual troubleshooting, repair and replace activities. The investigators concluded that none of the people involved with the decision to continue operating the chiller had training or qualifications involving condition/risk assessment.

Possible PSM citations: 1910.119(j)(5) for operating the equipment with a known (integrity) deficiency without assuring safe operation. 1910.119(m)(5) for not addressing and resolving a recommendation. 1910.119(g)(1)(i) for not training personnel of the hazards associated with a leaking chiller.

Opportunity #5: Personnel reviewing test results need to understand the meaning of the test results and the safety implication of those test results.

Opportunity #6: When test results are provided to decision-makers, these results need to provide adequate information so that the decision-makers understand them and their safety implications.

Opportunity #7: When contractors are called to deal with deficient equipment, they will almost always provide guidance / estimates on how to repair / replace the equipment, but facilities should demand a risk assessment on continued operation of the equipment if they intend to continue its operation while planning and preparing for the repair / replacement.

From Appendix V of the report: “In the majority of instances, owner/operators relied heavily on the refrigeration contractor’s assessment of the equipment and evaluation of the NH3 indication in the brine samples. The owner is accountable for the safe condition and operation of the equipment but in some instances, deferment to the refrigeration contractor’s assessment and recommendations for the equipment was observed.”

Opportunity #8: When a facility outsources maintenance work, they often erroneously think that they are outsourcing the responsibility as well. It is important for a facility to understand that this remains their process and their responsibility. Ask tough questions of your contractors to ensure that you understand the condition of your system.

 

Facility and Contractor Incident Release Response: On the day of the release at 03:53 the machine room NH3 alarm registered 300ppm. Responding facility personnel observed the brine expansion tank shaking and spilling brine. At 04:30, the facility personnel shutdown the system and closed the chiller suction valve, observing that the shaking in the brine tank stopped. This should have indicated to the facility personnel that the separation between the brine and NH3 sides was completely compromised and that the brine loop was now full of ammonia. At 05:18 the facility personnel called the contractor to come in and re-configure the system to operate without the brine chiller.

At some point during the work, the personnel isolated the brine chiller, trapping the ammonia-laden brine in the chiller with no outlet available for it. As this ammonia-laden brine warmed up, the pressure inside the brine chiller rose and, at an estimated pressure of 30-150psig, a coupling on the brine-side of the brine chiller failed releasing the contents into the machine room and onto the personnel in the room. The estimated total NH3 release was 22 pounds (9lbs immediately vaporizing) resulting in an immediate concentration in the area of 20,000ppm which dissipated to about 5,000ppm over a period of 5 minutes.

The report uses electricity demand to conclude that the personnel did not attempt a pump-out of the brine chiller. Unlike a CSB report, the report does not go into the fatalities. We have no idea where the personnel were positioned in the room, or what – if any – PPE they were wearing at the time of the release. It can reasonably be surmised that they weren’t wearing any respiratory PPE at all.

Possible PSM citations: 1910.119(g)(1)(i) for not training personnel of the hazards associated with NH3 contaminated brine and the hazards of trapping it. 1910.119(h)(3)(ii) for the contractor not being trained in the hazards associated with NH3 contaminated brine and 1910.119(h)(2)(v) for the facility not ensuring this training occurred. 1910.119(n) for not providing “procedures to handle small releases.” 1910.119(f)(1)(i)(D) for not providing an emergency shutdown procedure. 1910.119(f)(1)(i)(E) for not providing an emergency operations procedure.

Opportunity #9: While we often train on the dangers associated with trapping NH3, the dangers of trapping NH3 contamination in a secondary loop is rarely discussed. Operator training in facilities that utilize secondary cooling loops must address contamination and its possible safety implications.

Opportunity #10: While it’s not possible to know for sure, it is extremely likely that all three of these fatalities could have been avoided if the personnel were wearing full-face APRs at the time of release. Note: They would have to have been wearing them, not have them “near-by.” APR’s aren’t magic.

 

090618 Update: Full WorkSafeBC Incident Report

Learning from Failure

“Failure is only opportunity to begin again. Only this time, more wisely.” –Henry Ford

We often push PSM practitioners to perform Incident Investigations for fairly minor events in the hopes that the lessons learned from those minor incidents will stop the larger incidents from happening. This is, in part, due to CCPS (Center for Chemical Process Safety) guidance that, for every single catastrophic accident, there are typically nearly 9,900 minor issues / process upsets and 99 near misses.

So, if you only investigate the catastrophic incidents, then you are only acting on 0.010% of the opportunities available to you to improve your control over the process.

OSHA has promoted this idea as far back as a decade ago…

OSHA and industry have found that when major incidents have occurred, most of these incidents have included precursor incidents. Additionally, OSHA and industry (See CCPS [Ref. 41], Section 5, “Reporting and Investigating Near Misses” have concluded based on past investigations, that if employers had properly responded to precursor incidents, later major incidents might not have occurred. Consequently, anytime an employer has an “opportunity” to investigate a near-miss/precursor incident (i.e., an incident that could reasonably have resulted in a catastrophic release) it is important that the required investigation is conducted and that the findings and recommendations are resolved, communicated, and integrated into other PSM elements/systems so a later major incident at the facility is prevented. …It is RAGAGEP to investigate incidents involving system upsets or abnormal operations which result in operating parameters which exceed operating limits or when layers of protection have been activated such as relief valves. (An example RAGAGEP for investigating incidents, including near-miss incidents is CCPS [Guidelines for Investigating Chemical Process Incidents, 2nd Ed.], this document presents some common examples of near-miss incidents). (OSHA, Refinery PSM NEP, 2007)

Going a step further, it’s often true that you can learn something about managing complex operations from businesses in entirely different fields. One field that I like to follow – in part because it’s endlessly re-inventing itself – is information technology.

Google recently published an article on their Post-Mortem culture, with a farcical worked-example that includes the movie “Back to the Future” and a newly discovered sonnet by Shakespeare. The practice of learning from their failures is actually part of their Sight Reliability Engineer handbook and you can read the entire chapter if it appeals to you.

“Failures are an inevitable part of innovation and can provide great data to make products, services, and organizations better. Google uses ‘postmortems’ to capture and share the lessons of failure…

… For us, it’s not about pointing fingers at any given person or team, but about using what we’ve learned to build resilience and prepare for future issues that may arise along the way. By discussing our failures in public and working together to investigate their root causes, everyone gets the opportunity to learn from each incident and to be involved with any next steps. Documentation of this process provides our team and future teams with a lasting resource that they can turn to whenever necessary.

And while our team has used postmortems primarily to understand engineering problems, organizations everywhere — tech and non-tech — can benefit from postmortems as a critical analysis tool after any event, crisis, or launch. We believe a postmortem’s influence extends beyond that of any document and singular team, and into the organization’s culture itself.”

Google’s Pre-Mortem Tool – Anticipating what can go wrong.

Google’s Post-Mortem Tool – Dealing with what actually went wrong.

Pencil-Whipping can Kill

What is it? Pencil-whipping is when you complete a form, record, or document without having performed the implied work or without supporting data or evidence.

Here are some common examples in NH3 refrigeration:

  • Completing “word orders” without conducting the work
  • “Signing off” on SOP reviews or PHA revalidations without actually reviewing or revalidating the documents.
  • Certifying training – or signing training attendance forms – without the training actually occurring.

Why take it seriously? There are several reasons, but here are some obvious ones:

  • You can be prosecuted for false statements resulting in fines and/or jail time.
  • There is significant legal liability if the action leads to an incident.
  • You can be fired for false statements
  • There can be significant safety repercussions to documenting work that wasn’t done.

I want to briefly focus on the last one – what can happen when you document that work was done when it actually wasn’t. If you are being assigned a task, we have to assume that the performance of that task is important to the system as a whole.

Imagine your job was to inspect some equipment that was prone to long-term wear – equipment that was relied upon for normal function. Now imagine that you didn’t conduct those inspections leading the users of that equipment to believe it was in proper working order. They are relying for their safety on YOUR lie!

Here’s what that can lead to:

And here’s what can happen when people investigate the incident:

Thursday morning, the General Manager and CEO of the Board Safety Commission released a statement regarding the firings: “…I want the Board, our employees and our customers to know that this review revealed a disturbing level of indifference, lack of accountability, and flagrant misconduct in a portion of Metro’s track department which is completely intolerable. Further, it is reprehensible that any supervisor or mid-level manager would tolerate or encourage this behavior, or seek to retaliate against those who objected. It is also entirely unacceptable to me that any employee went along with this activity, rather than exercise a safety challenge, or any of the multiple avenues available to protect themselves, their coworkers, and the riding public.

Since the derailment occurred, we have either taken action or are in the process of taking disciplinary actions involving 28 individuals. This represents nearly half of the track inspection department and includes BOTH management and frontline track employees.

Six employees have been terminated, including 4 track inspectors and 2 supervisors

Six more track inspectors are pending termination or unpaid suspension; and 10 more are pending possible discipline pending the outcome of the administrative process

Another supervisor termination is underway; and two more supervisors are pending the outcome of the administrative process

One Superintendent was demoted to Supervisor

One Assistant General Superintendent was demoted to Superintendent

One assistant superintendent separated from Metro before the review concluded

In closing: Pencil-Whipping is immoral, illegal and just plain wrong. Don’t do it.

Newer posts »