Failures In Automation: 2020

Saturday, February 22, 2020

The Importance of Input Validation

I first saw this one come across a social media feed, so I didn't believe it until I had actually found the orginal source. Lo and behold - Biohackers Encode Malware in DNA

In new research they plan to present at the USENIX Security conference on Thursday, a group of researchers from the University of Washington has shown for the first time that it’s possible to encode malicious software into physical strands of DNA, so that when a gene sequencer analyzes it the resulting data becomes a program that corrupts gene-sequencing software and takes control of the underlying computer.

If you have a little knowledge of genetics, it's not really that much of a leap - it's an inspired leap, but since genes really are just software, it's not that far. As an attack vector it's artful and ironic. And it shows that all software should do input validation. I'm not sure of the practicality of this particular attack, but it definitely shows imagination.

Sunday, February 16, 2020

Antipatterns : The Death March

Antipatterns are a favorite topic of mine. If you haven't encountered them before, think tropes, but instead of a taxonomy of drama, you have a taxonomy of organizational dysfunction. According to Wikipedia, an antipattern is "is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive." (Link). The term was coined in 1995 (which acually was a long time ago) by a software developer names Andrew Koenig.

I intend this to be first in a series, exploring antipatterns, and how they happen. And, if we're lucky, I may be inspired to spend some compute cycles thinking about how to prevent and reverse them if desired.

There are quite a few examples at the link, however there is one that every programmer or engineer has a good likelihood of encountering in a career and it has got to be one of the most demoralizing and depressing antipatterns around:

The Death March

Again, our veritible reference Wikipedia defines a death march project as follows:

"a project that the participants feel is destined to fail, or that requires a stretch of unsustainable overwork. The general feel of the project reflects that of an actual death march because project members are forced by their superiors to continue the project against the members' better judgment."

I figured we'd start off with one of the most common antipatterns there is. Unless you're brand-new in whatever industry you're in, you've been involved with one of these. If you haven't already, you probably will within 5 years. It's not ambiguous if you were involved in a death march project - you'll remember and tell stories about it for a long time. And, when you see it happening again, you'll attempt to warn everyone.

And no one will listen to you. Or if they do, the general consensus will be "yes, we see the risk, but we're going to manage this project properly so that that doesn't happen". Or, you may be criticized for negativity. At the end of the day, a business decision is made and the project must proceed. Unfortunately, it is at that point that the project is doomed. It's already too late.

If a project is deemed business critical and must be executed, and the subject matter experts assigned to the project all state that the odds of success are low, the reason for a projects failure will be the accepting the work to begin with.

Project failure sucks. People can lose jobs over it, and companies can go out of business. Of course, if every death march project ended with results like this, the antipattern wouldn't be very common. So, while there is some personal risk to project managers when projects go this way, there's usually enough plausible and/or dubious accountability on the project that PM's can come out of them unscathed sometimes.

As a sat for a few moments and jotted down some of the contributing factors to some death march projects that I've seen or been involved with, the following factors come to mind as major contributors to the problem:

Overly optimistic planning

There's a lot of forms this can take, but the most common I've seen is that the "best case" scenario is used in all planning, and subject matter experts are shut down when bringing up what happens when you leave the "golden path". e.g. "well, that just can't happen, can it?"

Arbitrary goals/requirements that are not subject to any revision

Someone picks a date as the project end/delivery date, and there is absolutely no way this date can be moved. You'll be chided or screamed at for suggesting it. Often this will occur when the only concern of the project planning phase was a plan that the stakeholder "could live with" - often sacrificing actual achievability to do so.

Nebulous accountability for decisions

If no one person will ever be held accountable for a decision, what is the risk in making a bad decision? A group can't be held accountable for a bad decision like a single person can. And, groups absolutley suck at making decisions (it's its own antipattern!).This factor is becoming more and more prevalent in industry and business today, and I'd consider adding it as a new antipattern in and of itself.

Ignoring reality

So you're 75% of the way through a major project. Imagine you had 5 terabytes of data to move across the internet as part of it. You do the math and figure out it will take several days to move the data at the data rates that the networks in question can support. What would your reaction be to the client telling you that it's simply not acceptable? If your reaction isn't to book travel to the site the servers are located at so that you will personally convey a hard disk from one site to another, that's not the reaction the customer and the project management staff is looking for. Never mind the fact that you let everyone involved know this in the first of many project meetings. This is only an issue now.

I think most death march project phenomena fall into one of the above categories. It's almost like that by doing 4 things differently, this antipattern might not actually occur. But, in keeping with spirit of this post, let's take a look at what might happen in an organization that decides "failing at projects is OK, as long as we can sell new projects."

In the terminal phase of death march projects, companies tend to order project personnel to "pull out the stops", to deliver. Usually, this involves 14 hour days and 7 day weeks until people literally start to fall down on the job. Usually, this makes the problem worse (see Brooks's law).

Consquences/Impact

I'm not going to look at or consider the well-enumerated consequences (e.g. failing to meet terms of contracts, etc.), as those are well-known when these projects are embarked upon. It should be no mystery to the C-suite what happens when they don't pull it off. No, let's look at the other aspects at what happens across the engineering team.

First and foremost, knowledge that they're working on a doomed project will damage the psyche of the participants. One of the first casualties of projects like this is the investment in the success of the project by the participants. Simply and frankly put, I'm not going to be emotionally invested in the success or failure of a project if the stakeholders don't take the work seriously. It also makes it appear as if the organization doesn't actually value your work - if they're willing to throw away your effort on pointless projects, how valuable can you actually be?

Then there's the factor of "it didn't have to be this way" - that is, if planned correctly and executed properly, almost anything is actually possible. And, some of the death march projects in question could end up being gems on a resume if they were successfully executed.

Eventually, factors like the two above will cause a third impact your organization will feel - turnover. That is, eventually, the people who have to do the work on these projects will simply leave. They'll get tired of failing all of the time and the psychological damage that comes with it. Some may decide that their reputations are being sullied by associating with your organization, leave and actually deny working there in the future. Really.

Mitigation...

What can you do?

Before the death march

Make sure that as a SME, you never pull punches or start to compromise your professional judgement to make sure that a project gets approved. Candor in all estimates, and honesty in your own estimation of abilities is extremely important in this regard.

Don't commit the sin of over-optimistic planning, or be silently complicit in it.

Insist upon accountability for decisions.

Do not accept nebulous requirements and do not deny reality.

During the death march

Document all decisions. Save the emails containing them outside of your email system.

So the same whenever you raise concerns. This is a form of "CYA", so you can't be the one thrown under the bus when the time comes for recriminations.

Keep trying. That's what your organization is paying you for.

Don't make it worse, if it can be avoided.

And after?

Every situation will be unique. If your organization is honest about learning from mistakes, there will likely be a series of "post-mortems" on the project. Even if there are no such official meetings, it is always beneficial for the people involved to sit down and go over the project in an attempt to learn lessons from it. A logical analysis of the scenario would note that since the organization ended up in the death march antipattern to begin with, that there are systemic issues that exist that will most likely predispose the organization to falling into the antipattern again.

In this case, each individual should evaluate their career and make a decision as to whether or not they will continue to accept it (a form of compromise), or not. The "or not" part would usually entail leaving for greener pastures. As always, the reader should be cautioned against judging the suitability of a pasture based on the apparent shade of green on the other side of the fence. In plainer language, if you're 5 years from retirement, you might want to just ride it out. 24 years old and new in the industry - why put up with it? I'm oversimplifying for illustrative purposes.

Wednesday, February 12, 2020

The Electronic Classroom

I was watching my oldest daughter do her algebra homework yesterday. In her school, all students are issued a chromebook, and class assignments are distributed via Google Classroom. Homework assignments are essentially electronic forms. She was out on the couch with me doing the work after telling me she might need some help with the math. I watched her work through a few problems until she got to one she struggled with.

As I opened my mouth to help her with the problem, she started randomly clicking answers on the form, and then submitted the form before I could even offer advice. Predictably, she got more than half of the answers incorrect. She then tells me "we can resubmit the form as many times as we want", as she scrolls up the page, takes note of which answers were incorrect, and then proceeds by process of elimination by trial-and-error to select all of the correct answers. It took her about 3 minutes to do the entire 10-question homework assignment like this (systems of equations, intercept points and overlap sets).

She got 100% on the assignment.

I'm staring at her and I ask her "But what happens on test day?", to which her answer was "Oh, I only get to submit it once". I'll chalk up her lack of concern with understanding the material to teenage idiocy, but I have some serious questions to the professional educators who came up with this framework.

The chromebook/electronic classroom initiative was originally touted as a way to teach our children in a 21st century method for the world of the 21st century, or something like that. Not that that actually encapsulates any meaning at all, but most people get the idea. Somewhere along the line, it would seem that cause and effect became conflated, as they always to in the minds of humans - "if we teach the kids on laptops, then they'll be smart" or something to that effect.

And, in an attempt to be as fair and balanced as I can, I attempted to find some kind of paper, essay, or something that actually laid out a justification for this program, and I found an article on the topic at https://www.goguardian.com/blog/technology/7-reasons-your-students-need-chromebooks-in-the-classroom/ that could at least provide me some kind of justifications for this.

From the article:

They .. help engage students, prepare them for careers (which is particularly important as Science, Technology, Engineering and Math fields continue to increase in demand) and close achievement gaps.

But there's not a line indicating exactly how this happens. I suppose, just give the kid a chromebook, and poof! They're prepared for careers in STEM.

63% of student say the potential benefits of technology in the classroom outweigh the distractions.

Are we letting middle and high school students set educational policy?

Chromebooks Help Teachers Gain Insight into Student Behavior

That's right - the school can ... monitor ... the children more effectively. The slide into Orwellian dystopia is not a topic for this blog, but I figured it was worth mentioning here. This is not a "pro" for the chromebook program, in my eyes. They knew it the next day when my daughter created a spoof profile on farmersonly.com while in math class. But, I'm not sure how much insight was given as to why she did this (ed. - it was funny, and she's been talked to about it).

Bringing Chromebooks into the classroom, schools can dramatically reduce paper needs. Teachers can manage tests, textbooks requirements, homework assignments, projects, and student reporting online.

Paraphrasing this one, "we're reducing paper and it makes the teacher workload more manageable, while making metrics available to parents online". This one sounds good at face value, but being a corporate shill, here's what I hear: "we can put more kids in each classroom because the google classroom picks up the workload from the teacher". It took me 30 seconds to come to that conclusion, and I imagine that a school administrator whose budget is a function of the number of kids sitting in classrooms did to. Making teachers' lives easier is not the mission of the school administration - serving the daytime education and babysitting needs of the local community is what schools do.

Let's hear it from the horse's mouth perhaps? Some words from the goog on using their classroom:

"We see students coming back time and time again to check out a Chromebook in the library. They love that Chromebooks are easy to use and lightweight to carry."

-Jackie Radebaugh, Assistant Professor of Library Science, Columbus State University

Source: (https://edu.google.com/intl/en/products/chromebooks/?modal_active=none#casestudy-kippla)

That's it - that's the whole justification on Google's page about the chromebook, Kids go to the library to borrow them, so it's a good program. Oh, and they're lightweight.

And my daughter still can't plot a system of equations on her graphing calculator.

Tuesday, February 11, 2020

On February 7th, NASA shared it's initial findings from the Boeing Starliner orbital flight test investigation.The copy used by NASA in the press release is damning to Boeing's software engineering process.

Boeing, NASA, and U.S. Army personnel work around the Boeing CST-100 Starliner spacecraft shortly after it landed in White Sands, New Mexico, Sunday, Dec. 22, 2019. Photo Credit: (NASA/Bill Ingalls)

The press release lays out three major issues that occurred and states that the investigation made the following determinations:

An error with the Mission Elapsed Timer (MET), which incorrectly polled time from the Atlas V booster nearly 11 hours prior to launch.
A software issue within the Service Module (SM) Disposal Sequence, which incorrectly translated the SM disposal sequence into the SM Integrated Propulsion Controller (IPC).
An Intermittent Space-to-Ground (S/G) forward link issue, which impeded the Flight Control team’s ability to command and control the vehicle.

Regarding numbers 1 and 2 above:

Breakdowns in the design and code phase inserted the original defects.
Additionally, breakdowns in the test and verification phase failed to identify the defects preflight despite their detectability.

There's not really much that I can add to that. It would appear that despite having a process for software engineering, "breakdowns" in that process rendered it ineffective in this case. If I were to hazard a guess, I imagine that at Boeing, the software engineering process is well documented and robust. And, most importantly, it may prevent software development from getting done in a manner that is compatible with project schedules.

As a milestone or deadline looms in the near distance, project teams may end up "just click through the step" on a code review. Or, perhaps a verification/QA step got the same treatment. At the end of the day, the paradoxical situation of (a) there being evidence that the process was followed and (b) the process was completely ineffective is what is observed.

I could be wrong in the above analysis - there could be another factor - e.g. the code review simply didn't catch the problem, or the use case that made the defect detectable wasn't in the validation steps. However, the scenario outlined in the previous paragraph is a situation that I have encountered going back to the first years of the 21st century.

In those days, the "ISO Audit" was where auditors came in to the office and verified that the company follows standards and practices. One day, during a prep for an ISO audit at a major engineering firm where I was a "contractor", one of the firm's full employees gave me the following guidance:

"If they ask about the process, simply say 'there is a process' and direct them to me. The most important part is that we have a process".

Indeed.

Saturday, February 8, 2020

Process, Projects and Building Great Things

For over 20 years, I have been working in the software development and engineering industry at companies that produce power plants and generators, drugs and of course, software. And, in those 20+ years of experience, I have seen, heard and learned quite a few things. Things about people, code and organizations. And, in the interests of helping others in my industry learn from these years of experience, I have started this humble blog.

You see, over these decades, I have noticed a disturbing trend. At one point, our industry appeared to be not only averse to failure, but downright terrified of it. At the time, it wasn't uncommon to read a slashdot article lamenting the aversion to risk that seemed pervasive in the software development industry. "It stifles innovation and creativity", was a common rejoinder. "There's no room for process improvement" was another. And these were all valid concerns.

Over the intervening decades, this sense of risk aversion seems to have waned to a point that could be called carelessness. Behold, the case that motivated me to begin putting bytes into the cloud:

Starliner faced “catastrophic” failure before software bug foundDuring its quarterly meeting on Thursday, NASA's Aersopace Safety Advisory Panel dropped some significant news about a critical commercial crew test flight. The panel revealed that Boeing's Starliner may have been lost during a December mission had a software error not been found and fixed while the vehicle was in orbit

(https://arstechnica.com/science/2020/02/starliner-faced-catastrophic-failure-before-software-bug-found/)

Boeing patched a software code error just two hours before the vehicle reentered Earth's atmosphere. Had the error not been caught, the source said, proper thrusters would not open during the reentry process, and the vehicle would have been lost.

Forget about hot-patching a production e-commerce server a week before Christmas - these folks patched an orbiting spacecrafts flight control system to fix a bug.

If you're reading this in early 2020, you will recall that Boeing is in the midst of another engineering project failure - that of the 737 MAX aircraft. Both the Starliner and the 737 MAX failed because of software defects. On the 737 MAX, 2 airframes were lost and 346 people lost their lives to software defects. After the second crash in nearly as many months, the FAA and other aviation authorities grounded the airliner.

Grounded 737 MAX aircraft parked in Seattle at Boeing Field (photo from wikipedia). Each aircraft cost it's owners $99-$134 million.

The reason I bring up these incidents is because they happened several months before the launch of Starliner, and it boggles the mind somewhat that Boeing could let software defects take down 2 major projects in one year. Alas, the previous sentence belies an assumption - that Boeing let software defects take down 2 major projects - I don't think Boeing let anything happen by intention. Rather, Boeing probably had no idea they were even vulnerable to it.

Most of the time, failures like this can be attributed to failures in process in the organization developing the software. Now, in Boeing's case, since I don't work there or have any visibility into it, I really won't be able to comment on exactly what went wrong. But, with the experiences I've had in this industry, I'm pretty sure it was one of the phenomena that we'll be discussing in the future on this blog.

So, let's dissect it a little, shall we? Time for a user story:

According to Wikipedia,

The Maneuvering Characteristics Augmentation System (MCAS) is a flight control law (software) embedded into the Boeing 737 MAX flight control system which attempts to mimic pitching behavior similar to the Boeing 737 NG. When it detects that the aircraft is operating in manual flight, with flaps up, at an elevated angle of attack, it adjusts the horizontal stabilizer trim to add positive force feedback (a "nose heavy" feel) to the pilot, through the control column.

In other words, the software modulates the way the aircraft flies to so that the handling characteristics are like those of the 737NG. This is so that 737-800, -900 and NG pilots would also be qualified on the MAX.

What happens is that the MCAS software is "vulnerable to erroneous angle of attack data" when the sensors malfunction. In that event, if there is no second sensor to cross-check the first failed one, the software will begin to move the control column down to compensate for the increasing angle of attack (which isn't happening).

The pilots begin fighting the aircraft to keep the nose up.

If they realize what's going on, they can navigate menu trees and disable the software. If they don't they will crash the aircraft. If the pilots are able to disable MCAS, one of them will have to keep doing it every 5 minutes or so. If they don't they'll crash the aircraft.

That's quite a story isn't it? I wonder if the developers ever told it that way....

Again, according to Wikipedia, here are some of the findings after the crashes of late 2019:

Boeing presented MCAS to the FAA as being existing technology, despite being new/novel on the MAX. The (dubious) justification for this was that similar software ran on the 767.
Just before entering certification, the functional requirements for MCAS were still changing. Boeing modified MCAS so that it intervened more strongly and at lower airspeeds than originally planned.

I selected these 2 points, because they are the most damning. All of the engineering process controls in the universe will do your project exactly no benefit if the process is not adhered to, and the project is not square with reality. Bending the truth (MCAS isn't new tech) is a form of "lying to yourself" for an organization, and this form usually only serves the project management teams involved. What it did was "set an expectation" with the regulators that the testing may not have needed to be as rigorous.

The second bullet point is one I see all the time - simply playing "lip service" to the engineering process, while basically doing whatever you want. If I were to hazard a guess, it's a form of "we're doing a process, so it will all work out OK" sort of psychological error.

I'm struggling to not find malfeasance here - but the problem is that I don't think I could ascribe it to a single role or individual at Boeing. And therein lies the fly in the ointment, and a theme that will be repeated time and again on this blog - that the bad decisions in question are rarely made by a single person. Typically, they're decisions made by a group of people and the priorities that make up the decision process aren't only engineering factors.

Project mangers and marketing people end up making engineering decisions. Schedule comes before results. The "go-live" or "launch" date becomes the only milestone that matters on the project. Launch happens, and the PR is great. Then, some time later, the defects start rolling in. And if you're extremely unlucky, people start dying. If you're lucky, your operations people are working 50 and 60 hour weeks to do damage control while developers crank out fixes under what is basically gunpoint. More mistakes occur because you're in "firefighting" mode, and because people are fatigued/stressed. Customer sentiment declines.

What would have adhering to processes and being honest done for Boeing? Let's not take it for granted that it may have prevented 346 deaths. Even if that wasn't prevented, Boeing would have had the moral high ground of having "done the right thing". That counts a lot for whether or not a customer will ever consider buying Boeing airframes to replace aging fleets.

Could an individual programmer raised the red flag on this and put on the brakes before the 737 MAX carried passengers? Probably. However, Boeing had "offshored" all of the development work to India/Bangladesh for this one. My suspicions on this one are that companies like the software development shops over there don't exactly have a culture of "one programmer can make a difference", when the entire purpose of these companies is to get software written to spec at the lowest price.

So, in summation,

The engineering process seems to have only been done pro forma with respect to MCAS
The MCAS programmers did not work for Boeing and likely had no vested interest as to the success or failure of the product they were working on.
Boeing deliberately deceived regulators as to the novelty of MCAS

Another person with a little more aerospace industry experience wrote up another article on this topic here. It's worth a read. He doubts the MAX can be fixed, but also calls out many of the same process failures that I do.

Let's get our industry fixed. Let's stop wasting money, time and the hours of our lives. Let's stop hurting and killing people with laziness.