Skip to main content

The Most Dangerous Bug


There was a thread on Twitter recently from people who caused major problems in software systems, and I was reminded of the worst bug I've ever seen. Long ago I had published this story with a major tech media vendor, but that article seems to have succumbed to bit rot, so I am going to tell that story here. It happened more than twenty years ago when I had only been in QA for a couple of years, but it is seared into my mind...

In the late 1990s I worked for a company that handled 911 land line location software for major telcos. So when you choke on that chicken bone and call 911, you can't talk but they send the ambulance to where you live anyway. We handled about 75% of all the telephone numbers in the USA.

So from time to time in those days, there would be an area code split, where a new area code is added to a populous area and phone numbers start getting the new area code where they had a different area code before. In the 911 world, we would maintain records for both old and new area codes for a certain amount of time, but eventually we would delete the numbers from the old area codes to save space on the system and only use the new area codes. This was a completely routine operation.

Except for this one time. We arrived at the point where we intended to delete the obsolete numbers from the system for a large Midwestern US state. There was a particular bit of code that identified the obsolete numbers. That code was run by analysts, not by programmers or sysadmins. This code was put in the hands of a poor newbie analyst. They configured it incorrectly, and it identified for deletion all of the old numbers AND ALL OF THE NEW NUMBERS. This list was delivered to our sysadmin to execute.

The sysadmin was named Kevin. No one liked Kevin. He was not a nice person. Kevin took one look at this file and said "Whoa this file is WAY too big something is wrong". The managers told Kevin to delete the numbers anyway. Kevin resisted. Kevin was threatened with being fired, so he started running the numbers through the delete process.

The delete process was a script that had been created by sysadmins (not developers) in the earliest days of the company, and it had never gone through formal QA (which was where I worked. I had never seen this script.) Because this is 911, we always make backups and copies, and this is how the script worked:

READ THE NUMBER
WRITE THE NUMBER TO A FILE FOR BACKUP
DELETE THE NUMBER PERMANENTLY FROM THE NETWORK
REPEAT

This is where it gets interesting: this software ran on the Tandem computer, an old mainframe system (it still exists today, after many acquisitions it is today known as HP NonStop, and it is still in use in certain industries.) The thing about Tandem is that when you create a file, you have to declare the size of the file at the time you create it, and you can't exceed the size that you declare.

I'm sure you can see where this is going.

Kevin ran the numbers through the script. The script wrote the backup records to the file. The file filled up. Because the script had never been subjected to our rigorous development process, it had never occurred to its creators to catch the error when the file filled up. So the script did this:

READ THE NUMBER
WRITE THE NUMBER TO A FILE FOR BACKUP
GET AN ERROR THAT THE FILE WAS FULL
FAIL TO CATCH THE ERROR
DELETE THE NUMBER PERMANENTLY FROM THE NETWORK
REPEAT

We deleted most of the 911 location records for a major Midwestern US state. Being 911, we had backups of the data, but we had deleted so many records that our original idea was that it would be faster to give someone a physical tape and put them on an airplane from Colorado to the Midwest in order to restore the records locally from the tape. Ultimately one of our more brilliant programmers devised a compression scheme on the spot that let us update the records over the network.

We were so very thankful that no major disasters happened in the 36 hours or so that the 911 location information was missing. A big fire or a chemical spill or something like that would have been a problem of epic, historical proportions.