Saturday, February 8, 2020

Process, Projects and Building Great Things

For over 20 years, I have been working in the software development and engineering industry at companies that produce power plants and generators, drugs and of course, software. And, in those 20+ years of experience, I have seen, heard and learned quite a few things. Things about people, code and organizations. And, in the interests of helping others in my industry learn from these years of experience, I have started this humble blog.

You see, over these decades, I have noticed a disturbing trend. At one point, our industry appeared to be not only averse to failure, but downright terrified of it. At the time, it wasn't uncommon to read a slashdot article lamenting the aversion to risk that seemed pervasive in the software development industry.  "It stifles innovation and creativity", was a common rejoinder. "There's no room for process improvement" was another. And these were all valid concerns. 

Over the intervening decades, this sense of risk aversion seems to have waned to a point that could be called carelessness. Behold, the case that motivated me to begin putting bytes into the cloud: 

Starliner faced “catastrophic” failure before software bug foundDuring its quarterly meeting on Thursday, NASA's Aersopace Safety Advisory Panel dropped some significant news about a critical commercial crew test flight. The panel revealed that Boeing's Starliner may have been lost during a December mission had a software error not been found and fixed while the vehicle was in orbit
(https://arstechnica.com/science/2020/02/starliner-faced-catastrophic-failure-before-software-bug-found/

Boeing patched a software code error just two hours before the vehicle reentered Earth's atmosphere. Had the error not been caught, the source said, proper thrusters would not open during the reentry process, and the vehicle would have been lost. 

Forget about hot-patching a production e-commerce server a week before Christmas - these folks patched an orbiting spacecrafts flight control system to fix a bug. 

If you're reading this in early 2020, you will recall that Boeing is in the midst of another engineering project failure - that of the 737 MAX aircraft. Both the Starliner and the 737 MAX failed because of software defects. On the 737 MAX,  2 airframes were lost and 346 people lost their lives to software defects. After the second crash in nearly as many months, the FAA and other aviation authorities grounded the airliner. 

 
Grounded 737 MAX aircraft parked in Seattle at Boeing Field (photo from wikipedia). Each aircraft cost it's owners $99-$134 million. 

The reason I bring up these incidents is because they happened several months before the launch of Starliner, and it boggles the mind somewhat that Boeing could let software defects take down 2 major projects in one year. Alas, the previous sentence belies an assumption - that Boeing let software defects take down 2 major projects - I don't think Boeing let anything happen by intention. Rather, Boeing probably had no idea they were even vulnerable to it.

Most of the time, failures like this can be attributed to failures in process in the organization developing the software. Now, in Boeing's case, since I don't work there or have any visibility into it, I really won't be able to comment on exactly what went wrong. But, with the experiences I've had in this industry, I'm pretty sure it was one of the phenomena that we'll be discussing in the future on this blog. 

So, let's dissect it a little, shall we? Time for a user story:

According to Wikipedia,

The Maneuvering Characteristics Augmentation System (MCAS) is a flight control law (software) embedded into the Boeing 737 MAX flight control system which attempts to mimic pitching behavior similar to the Boeing 737 NG. When it detects that the aircraft is operating in manual flight, with flaps up, at an elevated angle of attack, it adjusts the horizontal stabilizer trim to add positive force feedback (a "nose heavy" feel) to the pilot, through the control column.

In other words, the software modulates the way the aircraft flies to so that the handling characteristics are like those of the 737NG. This is so that 737-800, -900 and NG pilots would also be qualified on the MAX.

What happens is that the MCAS software is "vulnerable to erroneous angle of attack data" when the sensors malfunction. In that event, if there is no second sensor to cross-check the first failed one, the software will begin to move the control column down to  compensate for the increasing angle of attack (which isn't happening).

The pilots begin fighting the aircraft to keep the nose up.

If they realize what's going on, they can navigate menu trees and disable the software. If they don't they will crash the aircraft. If the pilots are able to disable MCAS, one of them will have to keep doing it every 5 minutes or so. If they don't they'll crash the aircraft.

That's quite a story isn't it? I wonder if the developers ever told it that way....

Again, according to Wikipedia, here are some of the findings after the crashes of late 2019:

  • Boeing presented MCAS to the FAA as being existing technology, despite being new/novel on the MAX. The (dubious) justification for this was that similar software ran on the 767.
     
  • Just before entering certification, the functional requirements for MCAS were still changing. Boeing modified MCAS so that it intervened more strongly and at lower airspeeds than originally planned.
I selected these 2 points, because they are the most damning. All of the engineering process controls in the universe will do your project exactly no benefit if the process is not adhered to, and the project is not square with reality. Bending the truth (MCAS isn't new tech) is a form of "lying to yourself" for an organization, and this form usually only serves the project management teams involved. What it did was "set an expectation" with the regulators that the testing may not have needed to be as rigorous. 

The second bullet point is one I see all the time - simply playing "lip service" to the engineering process, while basically doing whatever you want. If I were to hazard a guess, it's a form of "we're doing a process, so it will all work out OK" sort of psychological error. 

I'm struggling to not find malfeasance here - but the problem is that I don't think I could ascribe it to a single role or individual at Boeing. And therein lies the fly in the ointment, and a theme that will be repeated time and again on this blog - that the bad decisions in question are rarely made by a single person. Typically, they're decisions made by a group of people and the priorities that make up the decision process aren't only engineering factors.

Project mangers and marketing people end up making engineering decisions. Schedule comes before results. The "go-live" or "launch" date becomes the only milestone that matters on the project. Launch happens, and the PR is great. Then, some time later, the defects start rolling in. And if you're extremely unlucky, people start dying. If you're lucky, your operations people are working 50 and 60 hour weeks to do damage control while developers crank out fixes under what is basically gunpoint. More mistakes occur because you're in "firefighting" mode, and because people are fatigued/stressed. Customer sentiment declines. 

What would have adhering to processes and being honest done for Boeing? Let's not take it for granted that it may have prevented 346 deaths. Even if that wasn't prevented, Boeing would have had the moral high ground of having "done the right thing". That counts a lot for whether or not a customer will ever consider buying Boeing airframes to replace aging fleets. 

Could an individual programmer raised the red flag on this and put on the brakes before the 737 MAX carried passengers? Probably. However, Boeing had "offshored" all of the development work to India/Bangladesh for this one. My suspicions on this one are that companies like the software development shops over there don't exactly have a culture of "one programmer can make a difference", when the entire purpose of these companies is to get software written to spec at the lowest price. 

So, in summation,
  • The engineering process seems to have only been done pro forma with respect to MCAS
  • The MCAS programmers did not work for Boeing and likely had no vested interest as to the success or failure of the product they were working on.
  • Boeing deliberately deceived regulators as to the novelty of MCAS 
Another person with a little more aerospace industry experience wrote up another article on this topic here. It's worth a read. He doubts the MAX can be fixed, but also calls out many of the same process failures that I do. 

Let's get our industry fixed. Let's stop wasting money, time and the hours of our lives. Let's stop hurting and killing people with laziness. 

No comments:

Post a Comment