Engineering Ideas #1
TLA+ in prod, caching, tooling not eliminating our jobs, positive vs. negative thinking about incidents, result & opportunity orientation, building trust, and via negativa
This year, I’m starting to record ideas that I encounter in my reading (listening, watching) about engineering and find novel, powerful, or just interesting. The scope is software, systems, reliability, and data engineering, as well as software operations and relationships in engineering organizations (peopleware).
Use of Formal Methods at Amazon Web Services
Amazon engineers Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff describe how TLA+ can be useful in the development of industrial systems. This idea may have sounded more uncommon 6 years ago, while now there is a whole book on the subject: Practical TLA+ by Hillel Wayne (see review).
A precise, testable, well commented description of a design is an excellent form of documentation. Documentation is very important as our systems have unbounded lifetime. Over time, teams grow as the business grows, so we regularly have to bring new people up to speed on systems. We need this education to be effective. To avoid creating subtle bugs, we need all of the engineers to have the same mental model of the system, and for that shared model to be accurate, precise and complete. Engineers form mental models in a variety of ways; talking to each other, reading design documents, reading code, and implementing bug fixes or small features. But talk and design documents can be ambiguous or incomplete, and the executable code is far too large to absorb quickly and might not precisely reflect the intended design. In contrast, a formal specification is precise, short, and can be explored and experimented upon with tools.
Caching challenges and strategies
This article from the Amazon Builders’ Library by Matt Brinkley and Jas Chhabra is an excellent overview of considerations, tradeoff points, and pitfalls to look out for when introducing a cache somewhere in your system. I plan to go through this article each time I will be setting up a cache to check I didn’t forget anything.
Be skeptical of the value a cache will bring, and carefully evaluate that the benefits will outweigh the added risks that the cache introduces.
Ensure that your service is resilient in the face of cache non-availability, which includes a variety of circumstances that lead to the inability to serve requests using cached data. These include cold starts, caching fleet outages, changes in traffic patterns, or extended downstream outages. In many cases, this could mean trading some of your availability to ensure that your servers and your dependent services don’t brown out (for example by shedding load, capping requests to dependent services, or serving stale data).
Design the storage format for cached objects to evolve over time (for example, use a version number) and write serialization code capable of reading older versions. Beware of poison pills in your cache serialization logic.
Evaluate how the cache will handle downstream errors, and consider maintaining a negative cache with a distinct TTL. Don’t cause or amplify an outage by repeatedly asking for the same downstream resource and discarding the error responses.
Everyone is Not Ops
Cindy Sridharan suggests why more advanced tooling in software engineering and operations may not threat our jobs, if not the opposite:
While, in general, better tooling is definitely a net positive, automation can be best leveraged when the person using the automation understands the underlying abstractions…
…paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder.
Studying an Incident
Subbu Allamaraju shows the power of positive thinking vs. negative thinking in the context of software resilience engineering:
Asking for what went well and how things worked, instead of just asking about what went wrong, opens possibilities for improvements that you would otherwise miss.
…probing for both “what went wrong” and “what went well and how” are essential. Each contributes to improving your understanding of the complexity of the system, how things work and identify potential corrective actions.
Lessons from 2019 by Subbu Allamaraju
Subbu Allamaraju is on fire lately. In this very insightful post, he brings up several important ideas echoing The Effective Executive by Peter Drucker. About result orientation:
Instead of selling what you want to build and how you want to develop it, focus on outcomes, and find ways to test your hypothesis by working backward from those outcomes incrementally. Romanticize about the results and not ideas.
About focusing on opportunity rather than problems:
On any day, when making decisions, choose opportunity over fear. It is not uncommon to come across arguments based on fear and uncertainty of competitors, third parties, and other entities. Concerns of vendor-lock-in is an excellent example of a fear-based approach. When presented with arguments, ask questions to turn attention towards opportunity.
Subbu also links to his earlier post on Status Management, where he excerpts Daniel Coyle’s The Culture Code:
The business school students appear to be collaborating, but in fact they are engaged in a process psychologists call status management. They are figuring out where they fit into the larger picture: Who is in charge? Is it okay to criticize someone’s idea? What are the rules here? Their interactions appear smooth, but their underlying behavior is riddled with inefficiency, hesitation, and subtle competition. Instead of focusing on the task, they are navigating their uncertainty about one another. They spend so much time managing status that they fail to grasp the essence of the problem (the marshmallow is relatively heavy, and the spaghetti is hard to secure). As a result, their first efforts often collapse and they run out of time.
Build Trust in Moments, Not in Years
By Edmond Lau
These powerful conversations that accelerate trust aren’t just limited to tense relationships. When you’re working with someone on a new project, explicitly designing your relationship with that person as you would an alliance accelerates trust. When you’re in a high-stakes situation, discovering and sharing what’s important and listening to what’s not being said accelerates trust. Even when you have a healthy relationship, sharing the unintended impact that people’s behaviors have on you accelerates trust.
The Knowledge Project: Episode 18 with Naval Ravikant
This remarkable interview is not about engineering at all, I put it here just because I loved it so much and want to share with as many people as possible.
In this podcast, I noticed just one idea which can be used in engineering more or less directly. Naval reminds us about the less is more idea (aka via negativa, explored by Nassim Taleb in Antifragile):
I don’t believe that I have the ability to say what is going to work. Rather, what I try to do is I try to eliminate what’s not going to work. I think being successful is just about not making mistakes. It’s not about having correct judgment. It’s about avoiding the incorrect judgments.
I think there should be many ways to apply this idea in engineering, starting from best code is no code and that excellent operations, in terms of reliability, is when nothing exciting happens. Do you have some other interesting applications on your mind? Please share them in the comments.
You can subscribe to new Engineering Ideas via RSS (e. g. using Feedly) or e-mail.