More than three decades after a 737 ripped open over Hawaii and stunned the flying public with the dangers of human error in aircraft maintenance, the aviation industry is still challenged to contain that risk and keep crews and passengers safe in the air.
The toughest hurdle is persuading aircraft operators, manufacturers, vendors and their maintenance technicians to comply with procedures.
The industry has made great progress since April 28, 1988, when the top of an Aloha Airlines Boeing 737’s fuselage tore free at 24,000 feet. A flight attendant vanished and 94 others on board were terrified, with 65 injured. Images of them surrounded in the landed jet’s front cabin by little more than moist island air and shredded metal shocked much of the world. It was vivid proof of the hazards in pushing aircraft beyond the eye’s and brain’s abilities to keep them flying safely.
The fuselage blew because a lap joint failed, then a lot of minor cracks around rivet heads opened up like a zipper. The NTSB said Aloha Airlines maintenance program failed to detect significant disbonding and fatigue damage that led to that. It found “difficult and tedious” 737 inspection procedures had “physical, physiological and psychological limitations.”
The Threat Persists
Spurred by the flying public and the U.S. Congress, regulators and the industry launched an effort to identify maintenance-error hazards, research their causes, find ways to reduce risks and field effective mitigation measures. Over three decades, inspection procedures and techniques were improved and maintenance standards made stricter. Yet incidents and accidents show repeatedly that the threat persists, particularly the biggest one: failure to follow procedures.
Most maintenance tasks have written procedures that they say must be followed and are intended to produce the same result every time. “However, the incidence of failure-to-follow-procedures events continues to be a major issue in aviation maintenance,” three researchers reported in 2017.
Colin G. Drury and Catherine Drury Barnes of Applied Ergonomics Group and Michelle R. Bryant of the FAA’s Civil Aerospace Medical Institute had been tasked in 2015 with examining primary and contributing factors of failing to follow procedures and then developing mitigation strategies. Despite 30 years of research into procedural compliance, their report noted, “these challenges and recommendations have not changed a great deal in that time period.”
In 2019, FAA chief maintenance human factors advisor Bill Johnson spoke with executives of large U.S. airlines. “Without exception,” Johnson, said, they told him “procedural non-compliance is the unanimous No. 1 contributing factor” for maintenance-error events.” (Now retired from the FAA, Johnson is principal scientist at drbillj.com.)
“Even as we speak, there are hundreds of mechanics probably deviating from procedures,” said Robert Baron, president and chief consultant of The Aviation Consulting Group. He specializes in human factors and other safety training and issues, working with hundreds of aviation organizations around the world. “Fortunately, it’s a safe system. There are backups, redundancy, cross-checking and different types of oversight. But when something slips through, that could be potentially catastrophic.”
Consequences of Rule-Breaking
Baron’s comment pointed toward an underlying reason for persistent procedural non-compliance, which Gordon Dupont explains with an everyday analogy. As a Transport Canada safety officer in the 1990s, Dupont crafted the noteworthy “Dirty Dozen” list of 12 human factors that can degrade your ability to perform effectively and safely and lead to maintenance errors. He is the retired CEO of System Safety Services in Richmond, British Columbia.
“What’s the most common rule broken every day all around the world? The speed limit,” Dupont has said. “The average driver will go between 5 and 10 mph over the speed limit unless the weather is bad, there is a police officer close by” or some other condition slows the driver down. “So why do we do it? The answer is very simple. We foresee no negative consequences in breaking the speed limit and the positive consequence of getting to our destination sooner serves to justify the rule-breaking. Rule-breaking at work goes along the same lines.”
Overall, maintenance errors can appear to be a small problem. Boeing’s analysis has long put maintenance as a primary cause of just 3 to 4 percent of hull-loss accidents and a contributing cause of about 10 percent. By comparison, flight crew actions are cited as a primary cause in more than 60 percent.
Analysis by the International Air Transport Association (IATA) found that “maintenance operations” were a latent condition in 21 percent of 2020’s airline accidents and “maintenance operations: SOPs and checking” was a latent condition in 13 percent. IATA’s safety analysts define a latent condition as one that is present in the system before an accident that is made evident by triggering factors (which often relate to deficiencies in organizational processes and procedures).
In 2020’s accidents, IATA said, maintenance was a threat in 21 percent. It defines a threat as an event or error that occurs outside the pilots’ influence but requires their attention and management to maintain safety margins.
From 2016 to 2020, IATA found, maintenance operations were a latent condition in 12 percent of accidents. For maintenance operations: SOPs and checking, the number was 11 percent. Maintenance was a threat in 14 percent over that time. From 2011 to 2015, maintenance operations and maintenance operations: SOPs and checking each were latent conditions in 7 percent of accidents and maintenance was a threat in 10 percent.
“We’re stalled,” said John Goglia, a retired airline mechanic and former U.S. National Transportation Safety Board (NTSB) member who has been a driving force in addressing maintenance human factors issues. “We need to look at our history and do something different.”
A Problem Lies in Wait
One recent incident illustrates how the failure to follow procedures can create a problem that lies in wait.
On Oct. 23, 2020, a Jetstar Airways Airbus A320-232 was taking off from Brisbane, Australia. As the IAE V2527-A5 engines spooled up, the pilots noticed a vibration and “popping” noise that rapidly increased in frequency and volume. They rejected the takeoff at 30 knots. Stall and temperature-exceedance warnings appeared for the No. 2 engine. They learned that passengers had reported flames coming from the right engine, as had the tower controller and a following flight crew. Recorded data indicated it had surged.
An on-wing borescope inspection found out-of-limit damage to the right engine’s high-pressure compressor (HPC) consistent with foreign object strikes. A teardown inspection confirmed substantial HPC damage, including a broken stage 5 blade and one stage 6 vane, four stage 7 blades and one stage 8 blade that were missing. A screwdriver tip was found between the combustion liner and engine case, burnt, discolored and eroded from heat and mechanical damage.
The aircraft had been parked for four months. The return-to-service work included lubrication of the low-pressure compressor bleed valve mechanism. Procedures “contained specific highlighted caution notes regarding the loss of any screws or other loose objects down the bleed duct,” the Australian Transport Safety Bureau said in its Aug. 16, 2021 incident report. “The notes highlighted that lost articles would progress to the HPC and break valves and vanes.”
The bleed valve was lubricated 112 flight cycles prior to the Oct. 23 engine surge, the ATSB said.
Helicopter Crew Gets Lucky
Sometimes negative consequences of procedural non-compliance are quickly apparent.
On June 1, 2020, the crew of a Northern HeliCopter AS365-N3 were alerted for a rescue mission from their St. Peter-Ording Airfield base, about 85 miles (138 kilometers) northwest of Hamburg on Germany’s North Sea coast. It would be the day’s second flight. The pilots used an approved “scramble” takeoff procedure that did not include a flight control hydraulic check.
The copilot increased thrust. The helicopter lifted to a hover, then immediately pitched up. The copilot lowered the collective. The tail struck the ground and the main landing gear touched down hard. No one was injured. The helicopter was slightly damaged.
The pilots determined that forward and backward cyclic inputs had no effect on the rotor disk. They shut down on the runway. Back in the hangar, they found that the left actuator was not connected to the swashplate that redirects the main rotor blades. Its fastener was missing. They found the bolt, two washers and one Nylon stop crown nut on the gear box compartment below. They did not find a loose cotter pin or parts of one.
Through mid-May, a contractor had performed substantial maintenance on the helicopter, including a main gearbox leak repair that required the left actuator’s removal. A post-maintenance check flight was done. The repair was assigned to experienced mechanics and was checked by an experienced inspector. “However, that check had been signed a few days after the occurrence,” said the Federal Bureau of Aircraft Accident Investigation (with the German acronym BFU), which investigated. The mechanics, inspector and maintenance pilot told the BFU in written statements that the actuator’s screw fitting was properly installed and the cotter pin was positioned and visually checked several times before the helicopter was returned to the operator.
Those maintenance personnel “were certainly aware of the importance of the flight controls and were certainly familiar with different types of screw lockings,” the BFU said in its report.
The BFU concluded the incident was most likely caused by mechanics using a worn Nylon stop crown nut on the actuator-to-swashplate bolt, applying insufficient torque to that nut and not installing a cotter pin on it.
The BFU also concluded the inspector did not sufficiently check the mechanics’ work and that two other mechanics failed to check the actuator connection as required by a 10-flight-hour/seven-day-inspection performed the day before the incident.
“It was just luck that during the occurrence — total loss of control — only the tail skid of the helicopter was damaged and more severe damage or even injuries to persons did not occur,” the BFU report observed. Between its return from maintenance and the loss of control, the helicopter flew for a total of 8:46 flight hours.
BA 787-8 Case Study
Another recent maintenance error made itself known much faster.
On June 21, 2021, a British Airways Boeing 787-8 was being loaded at London’s Heathrow Airport for a cargo flight. Three mechanics were tasked with clearing status messages about a nose landing gear solenoid valve. The procedure required cycling the landing gear with hydraulic power applied to the aircraft. To prevent the gear from retracting, the procedure required pins to be inserted in the main and nose gear downlocks.
The lead mechanic, in the cockpit captain’s seat preparing for the job, told the other mechanics to place pins in the downlocks and ensure the four people loading cargo were clear of the aircraft. At the nose gear, the first mechanic could not reach the locking pin hole. He pointed to the hole’s location and the second mechanic fitted the pin, which like the others had red and yellow flags attached. At the right main gear, the first mechanic used portable steps to fit the pin. He repeated that on the left gear.
The first mechanic returned to the cockpit to tell the lead the pins had been fitted. The two mechanics then returned to the nose gear and plugged a communications headset into the nose gear bay port. The lead requested confirmation again that the pins were in place. The first mechanic said they were.
The lead mechanic applied hydraulic power. Before moving the gear lever, he requested final confirmation from the first mechanic that the pins were in place and the cargo team was clear. This mechanic again visually checked that he could see the warning flags for each gear pin. He also checked that no feet were visible to indicate the load team was clear. He then confirmed this to the lead.
In the cockpit, the lead selected the gear lever to up. The nose gear retracted and the nose fell to the ground.
The worker on the pallet loader under the starboard forward cargo door was slightly injured as that door moved down when the fuselage dropped. The copilot, sitting in the cockpit, received a minor injury.
The nose crushed a ground power unit’s articulated cable arm. The lower forward fuselage and nose gear doors were damaged, as were both engine cowlings (which also struck the ground). Door 2L struck the stairs positioned at its opening when the nose fell and was severely damaged.
When the recovery operation lifted the nose, the nose gear was examined and the downlock pin was found fitted not in its hole but in the apex pin bore next to it.
No Silver Bullet
Researchers talk of errors of omission (such as failing to install O-ring seals on turbine engine chip detectors) and of commission (such as using incorrect fasteners to install a cockpit windshield). There are timing errors (performing a task at the wrong time or in the wrong order) and precision errors (such as using the wrong setting on a torque wrench).
They also talk of perception errors (“I didn’t see that”) and slips (“I didn’t mean to do that”), as well as wrong assumptions (“I assumed we returned to Stand 513, where the aircraft’s integrated drive generator oil levels had to be checked, but we went to Stand 517”). There is technical misunderstanding (“I tried to replace the landing gear hydraulic-retract actuator, but I didn’t understand what I had to do”).
One of the most common maintenance errors involves a mechanic forgetting to do a task planned for completion before a job is closed out, such as removing an engine thrust-reverser lockout pin after investigating an engine bleed-air issue.
All of the above can involve failure to follow procedures, since many procedures are aimed at heading off such errors. Dupont classifies violations in three main ways.
In a situational violation, he says, a mechanic concludes a job can’t be completed without violating a procedure. This is often related to time pressures. It may not be repeated. The situation seems to justify the violation.
An exceptional violation occurs when there appears to be no other way to accomplish the task. Dupont offers the example of a manual calling for three people to be used at all times when moving an aircraft. If a mechanic is out sick, crewmates may decide to push the aircraft carefully using the only two people available.
A routine violation happens when a mechanic believes there is a better way to complete a task and sees no negative consequences to the ad hoc procedure, Dupont explains. It may start as a situational violation, but over time the informal procedure may become a norm.
If the mechanic’s organization condones or tolerates the violation, it can move into a fourth class: the organizational violation. A classic example, Dupont says, is May 25, 1979’s crash of American Airlines Flight 191. The links leading to the accident, which killed all 271 on the plane and two on the ground, included failures by the jet’s manufacturer, the FAA and the airline’s management, engineering and maintenance departments, as well as the mechanics.
Safety proponents are refining efforts to reduce procedural non-compliance. The FAA has fielded a free, 45-minute training course, “The Buck Stops with Me,” aimed at “creating champions for rules-following,” Johnson said. Several researchers are pursuing efforts to apply the safety gains of line-oriented safety audits on the flight deck to maintenance operations. Expanding requirements for operators to set up safety management systems may aid the effort by promoting the acceptance and use of human factors analysis to maintenance.
“There is no silver bullet for any of this,” Baron said. “It’s all about awareness.”
Courtesy of James McKenna from AVM- Mag