Study the style of doing science from successes and engineering from failures

Failure avoidance is the essence of engineering

Dec 30, 2020

In “The Art of Doing Science and Engineering”, Hamming repeatedly mentions that one should better study from success stories:

There are so many ways of being wrong and so few of being right, studying successes is more efficient.
Look at your successes, and pay less attention to failures than you are usually advised to do in the expression, "Learn from your mistakes". While playing chess, Shannon would often advance his queen boldly into the fray and say, "I ain't scaird of nothing". I learned to repeat it to myself when stuck, and at times it has enabled me to go on to a success.

I think this is probably the right approach for research, but not for engineering. I think that failure avoidance is the essence of engineering (a-la Taleb's Via Negativa). Implementing functions is often easy. More challenging is to ensure that the functions don't have bugs, don't fail under resource constraints and at scale, degrade gracefully, and reliably report their failures so that we can notice and fix problems quickly.

Apart from pure technical risks, engineers also have to manage project risks and product risks.

Hamming also writes about this duality between science and engineering:

In science, if you know what you are doing you should not be doing it. In engineering, if you do not know what you are doing you should not be doing it.

But then he notes that there is rarely such a thing as pure science or pure engineering:

Of course, you seldom, if ever, see either pure state. All of engineering involves some creativity to cover the parts not known, and almost all of science includes some practical engineering to translate the abstractions into practice.

This means that it's not that engineers should study only failure stories and never success stories. That would be obviously absurd. But I still feel that engineers should lean towards learning from the stories about engineering mistakes rather than from reports about successful engineering projects.

Furthermore, engineers often write success stories at the "Golden Age" moment soon after the system has been deployed to production. All engineers who designed the system still work on it so the maintainability risks haven't had a chance to manifest yet. The production load hasn't changed much since the time when the system was designed and deployed so the scalability risks are also invisible. And so on.

The system failure is not an "if" question, it is a "when" question. Of course, this doesn't mean that we should give up and not build anything at all. But I think that the measure of engineering success is the total lifespan of the system, not that it ever worked at all. Hence I think there is little to learn from the stories reported immediately after the system went live.

In the software engineering community, people often share their experiences in blogs. On the one hand, this is great and something that I wish other engineering communities embraced more. On the other hand, the majority of the stories in the engineering blogs of large technology companies are success reports of the kind that I described above. (By the way, many of the "research" papers coming from these companies are actually exactly the same kind of reports, starting from the famous MapReduce paper.)

These stories sometimes mention the previous system that has been replaced by the new "successful" one and why it was needed (i. e., the ways in which the previous system started to fail), but this is never the focus of the story. In fact, the future may reveal that the old system was more successful than the new one because it was in production for longer. But the factors that contributed to the system's longevity are rarely investigated because to fully appreciate them one should wait until the system finally fails, at which point it's more fashionable to write about the new shiny stuff rather than the legacy system.

I don't mean that the success reports are not useful at all: they can be helpful as blueprints for solutions to very similar problems as the authors had. But I doubt that there is much to transfer-learn from them for engineers who go through lists of papers in the hope to become good system designers upon reading them. In other words, the MapReduce paper was useful for the engineers who built Apache Hadoop and, perhaps, later Apache Spark, but not for most other programmers who have read it.

I think engineers tend to write success stories for some psychological reasons, and also because companies want to create an attractive engineering brand. However, as learners, we should pay much closer attention to the rare stories of failure and long-term reflection, despite they may describe technologies that have been long obsolete.

For example, we know that MapReduce and the Google File System have been successful (long-lived) systems, but it's not from the respective papers themselves we can learn why. In 2009 (8 years since GFS and 5 years since MapReduce), Barroso et al. wrote in "The Datacenter as a Computer": "In theory, a distributed system with a single master limits the resulting system availability to the availability of the master. Centralized control is nevertheless much simpler to implement and generally yields more responsive control actions. At Google, we tended towards centralized control models for much of our software infrastructure (like MapReduce and GFS). Master availability is addressed by designing master failover protocols."

A fantastic story has been shared recently about how the decision to rewrite a mobile app triggered a chain of unexpected contingencies that nearly failed the operation of the company.

I think we can also learn a lot from debugging stories.

The post-mortems are also interesting, however, they should be taken with a grain of salt because the SaaS and cloud providers will never reveal the underpinning system conditions that are hard to fix (or they are unwilling to fix) that will tell the customers that this provider will likely fail in a similar way in the future.

Engineering Ideas

Discussion about this post