Software bugs – a few actual cases
Using software to control electro-mechanical systems is not only common, it's pretty much all-pervasive in today's world. The unfortunate reality though, is that software engineering is a relatively young field and that we humans are naturally prone to making mistakes. These factors can combine to create unfortunate accidents when software doesn't execute conforming to its design (which, of course, leads to it being called buggy).
Several real-world examples of this occurring exist; we highlight a few of them in the following sub-sections. The brief synopsis given here is really just that – (too) brief: to truly understand the complex issues behind failures like this, you do need to take the trouble to study the technical crash (or failure) investigation reports in detail (do see the links in the Further reading section of this chapter). Here, I briefly mention and summarize these cases to, one, underline the fact that software failures, even in large, heavily tested systems, can and do occur, and two, to motivate all of us involved in any part of the software life cycle to pay closer attention, to stop making assumptions, and to do a better job of designing, implementing, and testing the software we work on.
Patriot missile failure
During the Gulf War, the US deployed a Patriot missile battery in Dharan, Saudi Arabia. Its job was to track, intercept, and destroy incoming Iraqi Scud missiles. But, on February 25, 1991, a Patriot system failed to do so, causing the death of 28 soldiers and injury to about 100 others. An investigation revealed that the problem's root was at the heart of the software tracking system. Briefly, the system uptime was tracked as a monotonically increasing integer value. It was converted to a real – floating-point – value by multiplying the integer by 1/10 (which is a recurring binary expression evaluating to 0.00011001100110011001100110011001100...; a quick online calculator's available here: http://www.easysurf.cc/fracton2.htm). The trouble is, the computer used a 24-bit (integer) register for this conversion, resulting in the computation being truncated at 24 bits. This caused a loss of precision, which only became significant when the time quantity was sufficiently large.
This was exactly the case that day. The Patriot system had been up for about 100 hours, thus, the loss of precision during the conversion translated to an error of approximately 0.34 seconds. Doesn't sound like much, except that a Scud missile's velocity is about 1,676 meters per second, thus resulting in a tracking error of about 570 meters. This was large enough for the Scud to be outside the Patriot tracking system's range gate and it was thus not detected. Again, a case of loss of precision during conversion from an integer value to a real (floating-point) number value.
The ESA's unmanned Ariane 5 rocket
On the morning of June 4, 1996, the European Space Agency's (ESA's) Ariane 5 unmanned rocket launcher took off from the Guiana Space Centre, off the South American coast of French Guiana. A mere 40 seconds into its flight, the rocket lost control and exploded. The final investigation report revealed that the primary cause ultimately came down to a software overflow error.
It's more complex than that; a brief summary of the chain of events leading to the loss of the rocket follows. (In most cases like this, it's not one single event that causes an accident; rather, it's a chain of several events.) The overflow error occurred during the execution of code converting a 64-bit floating-point value to a 16-bit signed integer; an unprotected conversion gave rise to an exception (Operand Error; the programming language was Ada). This, in turn, occurred due to a much higher than expected internal variable value (BH – Horizontal Bias). The exception caused the shutdown of the Inertial Reference Systems (SRIs). This caused the primary onboard computer (OBC) to send erroneous commands to the nozzle deflectors resulting in full nozzle deflection of the boosters and the main Vulcain engine, which caused the rocket to veer dramatically off its flight path.
The irony is that the SRIs were, by default, not even supposed to function after launch; but due to a delay in the launch window, the design specified that they remain active for 50 seconds after launch! An interesting analysis of why this software exception wasn't caught during development and testing (https://archive.eiffel.com/doc/manuals/technology/contract/ariane/) boils down to concluding that the fault lay in a reuse error:
"The SRI horizontal bias module was reused from a 10-year-old software, the software from Ariane 4."
Mars Pathfinder reset issue
On July 4, 1997, NASA's Pathfinder lander touched down on the surface of Mars and proceeded to deploy its smaller robot cousin – the Sojourner rover, the very first wheeled device to embark upon another planet! The lander suffered from periodic reboots; the problem was ultimately diagnosed as being a classic case of priority inversion – a situation where a high-priority task is made to wait for lower-priority tasks. As such, this by itself may not cause an issue; the trouble is that the high-priority task was left off the CPU long enough for the watchdog timer to expire, causing the system to reboot.
An irony here was that there exists a well-known solution – enabling the priority inheritance feature of the semaphore object (allowing the task taking the semaphore lock to have its priority raised to the highest on the system for the duration for which it holds the lock, thus enabling it to complete its critical section and release the lock quickly, preventing starvation of higher-priority tasks). The VxWorks RTOS employed here defaulted to having the priority inheritance attribute turned off and the Jet Propulsion Laboratory (JPL) team left it that way. As they (very deliberately) allowed the robot to continuously stream telemetry debug data to Earth, they were able to correctly determine the root cause and thus fix it – by enabling the semaphore priority inheritance feature. An important lesson here is this one, as the team lead Glenn Reeves put it:
"We test what we fly and we fly what we test."
I'd venture that these articles (see the Further reading section) are a must-read for any system software developer!
The Boeing 737 MAX aircraft – the MCAS and lack of training of the flight crew
Two unfortunate accidents, taking 346 lives in all, put the Boeing 737 MAX under the spotlight: the crash of Lion Air Flight 610 from Jakarta into the Java Sea (October 29, 2018) and the crash of Ethiopian Airlines Flight 302 from Nairobi into the desert (March 10, 2019). These incidents occurred just 13 and 6 minutes after take-off, respectively.
Of course, the situation is complex. At one level, this is what likely caused these accidents: once Boeing determined that the aerodynamic characteristics of the 737 MAX left something to be desired, they worked on fixing it via a hardware approach. When that did not suffice, engineers came up with (what seemed) an elegant and relatively simple software fix, christened the maneuvering characteristics augmentation system (MCAS). Two sensors on the aircraft's nose continually measure the aircraft's angle of attack (AoA). When the AoA is determined to be too high, this typically entails a pending stall (dangerous!). The MCAS kicks in, (aggressively) moving control surfaces on the tail elevator, causing the nose to go down, and stabilizing the aircraft. But, for whatever reasons, the MCAS was designed to use only one of the sensors! If the sensor failed, the MCAS could automatically activate, causing the nose to go down and the aircraft to rapidly lose altitude; this is what seems to have actually occurred in both crashes.
Further, many pilot crews weren't explicitly trained in managing the MCAS (some claimed they weren't even aware of it!). The luckless flights' pilots apparently did not manage to override the MCAS, even when no actual stall occurred.
Other cases
A few other examples of such cases are as follows:
- June 2002, Fort Drum: a US Army report maintained that a software issue contributed to the death of two soldiers. This incident occurred when they were training to fire artillery shells. Apparently, unless the target altitude is explicitly entered into the system, the software assumes a default of zero. Fort Drum is apparently 679 feet above sea level.
- In November 2001, a British engineer, John Locker, noticed that he could easily intercept American military satellite feeds – live imagery of US spy planes over the Balkans. The almost unbelievable reason was the stream was being transmitted unencrypted, enabling pretty much anyone in Europe with a regular satellite TV receiver to see it! In today's context, many IoT devices have similar issues...
- Jack Ganssle, a veteran and widely known embedded systems developer and author, brings out the excellent TEM – The Embedded Muse – newsletter bi-monthly. Every issue has a section entitled Failure of the Week, typically highlighting a hardware and/or software failure. Do check it out!
- Read the web page on Software Horror Stories here (http://www.cs.tau.ac.il/~nachumd/horror.html); though old, it provides many examples of software gone wrong with, at times, tragic consequences.
- A quick Google search on Linux kernel bug stories yields interesting results: https://www.google.com/search?q=linux+kernel+bug+story.
Again, if interested in digging deeper, I urge you to read the detailed official reports on these accidents and faults; the Further reading section has several relevant links.
By now, you should be itching to begin debugging on Linux! Let's do just that – begin – by first setting up the workspace.