Have you ever witnessed an update of your application or operating system on a PC or a tablet? Stupid question, of course you have. Have you noticed how many updates are there in recent years? Quality of code rapidly declines and we have ourselves to blame for it. Software developers have managed to convince us, that beta testing of an already published software is something normal. Even with full-price apps and paying customers. Haven’t we learned anything from dramatic and tragic history? Here are some examples and a direct link between yesterday’s negligence and today’s laziness. Read why you should refactor the code.
When we say ‘a bug’ we often think about an infamous ‘blue screen of death’ in Windows or Y2K, a long-forgotten madness on the brink of a new century. There were however software bugs more profound and costly, and the costs were counted in billions of dollars, as well as human lives.
The first stop is the Therac-25 disaster that occurred when an error in coding resulted in a death of at least six patients. The Therac-25 was an radiation therapy machine, used in cancer therapy. It has been proven that poorly written software and insufficient system development caused radiation overdoses between June 1985 and January 1987. It was the fault of inexperienced coders.
A similar, but less deadly accident happened at St. Mary’s Mercy Medical Center in Grand Rapids, Michigan. In 2003 a software glitch accidentally ‘killed’ 8500 souls. Everything due to the faulty patient management software. This code error was rather innocent in comparison to the previously mentioned one, but it still presented a load of complications to the society. In this case nobody died but false death reports disorganized work for Social Security offices, resulting in Medicare chaos. Probably the most important aspect of the whole situation is that it’s unclear if the coding error was ever corrected. There were no further mentions of the false death reports; St. Mary’s hospital simply decided to purchase a new software.
The Mariner 1, a space probe launched back in 1962 to perform a Venus flyby, failed not only to human error while operating it but also because of the ‘most expensive hyphen in history’. Improper operation of the Atlas airborne beacon equipment caused a loss of the rate signal from the vehicle, therefore the beacon was inoperative. Additionally it was discovered that a missing hyphen in computer instructions (in the data-editing program to be exact) allowed transmission of incorrect guidance signals to the spacecraft. As a result the computer incorrectly accepted the frequency of the ground receiver and added this data to the tracking data sent to the remaining guidance data. Therefore a result computer had steered the spacecraft into a series of unnecessary course corrections. This threw the spacecraft off course, forcing NASA to give the ‘autodestruct’ order.
How CSHARK approaches bugs and software development challenges? Kamil Kwećka, a board member:
A few months ago, as software product development company, we received an inquiry from our partner. The task turned out to be simple: take over and rescue the project. Preferably yesterday. It was a case of implementation and large migration of data, plus integration with an external platform. That was the scope of product development services. An additional challenge was the implementation of a system for automatic data updates, based on entry-level data and information in the system itself. The problem lied in the management – the first one of our partners did not want to give up the code (!) and then we managed to obtain it only to find out it had been sent… without technical documentation. We had to understand where the project was at this place and time, figure out the ‘gaps’ and estimate costs. On top of that, our second team was given a brand new functionality to write and implement.
We had organized the team that took the knowledge from the partner, clarified requirements, estimated the scale of the problem and turned it into a working product. Not to mention the implementation in just a few iterations. The final phase was the mere handover of the project. Congratulations goes to development teams and their leaders – without excellent time-management skills none of that would have been possible.
10 years and 7 billion dollars. That’s the cost and the time for the Ariane-5 – a rocket that burst into flames seconds after launch. Developed by the European Space Agency (ESA), it was supposed to be a pinnacle of engineering. Unfortunately, a portion of code from Ariane-4 decided of its failure. During the flight, one of the functions from Ariane-4, totally redundant in its successor, reported a bug. Its interpretation led to another bug in an internal navigation system of the rocket, causing the main computer to order a sudden shift of 20 degrees. In this case, even the secondary system failed, causing a multi-billion project to go down in flames. Literally.
Do you want to keep both your feet on the ground? Let’s stay in the clouds a little longer, because the next example is very tasty. In 1999, a 700 million dollars-worth probe called Mars Climate Orbiter positioned itself on a Mars’ orbit. After some time the probe went dark, making its way on the other side of the planet and losing contact with NASA. Both situations were supposed to happen, so there wasn’t a reason to panic. Until NASA had found out that the connection couldn’t be re-established because the probe burned in Mars’ atmosphere. The investigation showed multiple errors on the way but the main reason for losing taxpayer’s money was a miscommunication over dedicated software. Part of the code that was written in England and ordered by Lockheed Martin was made based on different metric units then the part in the USA, made by NASA. Incomprehensibly, the inconsistency was ignored even when it was found. The bug was so severe that led to errors in lander’s nitro system. The consequence was a very close landing approach and the annihilation of a probe in Mars’ atmosphere.
OK, now we can stay on Earth. In 2009 and 2010 there were reports flying around, linking Lexus and Toyota vehicles to road accidents and collisions. The cars would simply… not stop. Results? 89 deaths and 52 injuries. The problem is known under the name ‘sudden unintended acceleration’ (SUA) and goes back to 1986 when a study ‘Sudden Acceleration’ was published. Author? The American National Highway Traffic Safety Administration (NHTSA). Although the problem in the 80s weren’t that severe due to the lack of electronics on board (‘drive by wire’ to be exact), the symptoms were already present – lack of quality control. Michael Barr, an expert was asked to make an opinion about Toyota’s case, stated that a number of bugs in a source code of an electronic throttle control system (ETCS) are to blame. That’s not all. Barr stated that Toyota’s source code was full of bugs, including metric ones that allowed for the presence of additional bugs. That’s still not all. Barr also found that fail safes were defective and so weak that the ‘house of cards’ expression was not only founded but mandatory.
Another example – in 2004 a company EDS introduced an overcomplex IT system to aid U.K.’s Child Support Agency (CSA). At the exact same time, the Department for Work and Pensions (DWP) underwent the restructurization of the entire agency. The incompatibility of systems and the irreversible nature of errors resulted in overpayment of 1.9 million people, underpayment of 700,.000 people, 7 billion dollars in uncollected child support payments and the backlog of 36,.000 cases that were stuck in the system, unsolved. Cost of this nightmare? Over 1 billion dollars, as of May 2018.
In 1991 a system error cost the lives of 28 American soldiers. A U.S. Patriot missile defence system stationed in Saudi Arabia didn’t detect an attack on an army barracks. A report produced by the government stated, that a software product development problem led to an ‘inaccurate tracking calculation that became worse the longer the system operated.’ On the day of the tragedy the system had been running for over 100 hours. The inaccuracy was so damaging that led to overlooking the incoming missile (the system was looking in the safe part of the sky).
There are of course many other examples of horrific bugs, mistakes and bad decisions. Honourable mentions go to:
All these examples prove that product development services management, communication, a proper code refactoring and quality assurance go a long way. Especially where human lives are at stake. The Internet has made developers and publishers lazy. Everything can be updated, refined, published under the umbrella of ‘definitive edition’ or early access (video games). It can also be patched indefinitely and explained as ‘adjustment’ to the new operating system, which is standard practice for mobile. By the way, have you noticed how many ‘under the hood’ updates are presented in the updates in recent years? Patch notes do not come with listings anymore; who would want to show off incompetency?
People tend to not communicate with each other enough, this is a crucial mistake. Every member of the team should have the same understanding of the project, requirements, current situation, etc. Every risk should be immediately reported and QA should work closely with developers and product managers.
Documentation plays an enormous role, both functional and technical. This is a serious problem in long-term projects. Documentation is often not updated! It’s also worth to perform a handover – a session between two teams, where team A shares the knowledge with team B. It’s also a good idea to appoint someone from the team A to contact the other team in the first stage of the project. 1h daily consultations would be beneficial in that situation.
If two groups work on the same project, let them use the same board in the TFS (or any other platform on that matter) with the split for user stories per team and then with tasks per person. A daily call is also a handy invention – using it will tell you who did what and what they struggle with.
We all work in one team and every member must take responsibility for it. The awareness of impact that every member has on the project, results in quality and cleanliness of the code. It additionally impacts quality engineers. They depend on each other.
The Project Manager should grasp it all and make an effort towards cleaning the communication pipeline. He or she should also defines risks – preferably at the very beginning of the software product development process itself, alongside the contingency plans.
People by definition don’t like to admit their lack of knowledge. They will work on something all day, where they should ask someone more experienced instead. It saves time and limits the risk of a mistake. Project Manager should also motivate the team – motivated employees work more efficiently, which translates into the quality of the project. That’s how we can ensure the quality of the code and safety of the people.