Engineering Ideas #6
Concurrency models, fallback in distributed systems, talk before engineering, engineering productivity, platform team's goals, design docs
Another great post by Adam Warski discussing the tradeoffs of different models of concurrent programming: futures, green threads, coroutines,
IO monad, actors.
Warski makes an analogy of RPC to demonstrate that boilerplate around async calls (which Project Loom aims to eliminate) is not necessarily a bad thing:
RPCs, and any network calls in general, have a significantly different characteristic than a normal function call. They are unpredictable: can arbitrarily fail, regardless of the value of input parameters. They can take an arbitrary amount of time to complete, while a normal function completes, fails or loops. And even if a network call seems to have failed, it might have still succeeded from the viewpoint of the other system. […] The fact that a normal function call is also syntactically distinct from an RPC call, might be an advantage to readability.
Martin Thompson brings insight into why Loom Fibers won’t magically solve all performance problems with our concurrent applications:
Warski distills a persuasive strategy about picking up a concurrency model for the task:
When doing a simple service, small application or a quick script, I don’t want to deal with any kind of wrappers, be it
IO. Fibers and their “codes like sync, works like async” model will make my life much easier.
When writing a business application, I might want to use the synchronous-like API that Loom
Fibers enable to express the business logic, using well-known constructs to express the control flow within a business process. However, to orchestrate concurrently running computations, handle errors and allocate resources, I’ll use an
Finally, for a high-performance asynchronous system, I’ll probably take the fully-asynchronous approach, working with state machines, callbacks or
I want to contrast the above discourse with a quote from this blog post by Brad Fitzpatrick:
The Go runtime is relatively complex internally but it permits simple APIs and programming models for users who then don't need to worry about memory management, thread management, blocking, the color of their functions, etc.
By not having Futures, indeed, Go avoids a lot of complexity which we have with concurrency in Java.
Jacob Gabrielson shares many non-intuitive observations and actionable tactics for improving reliability.
At Amazon, we avoid fallback in our systems because it’s difficult to prove and its effectiveness is hard to test. Fallback strategies introduce an operational mode that a system enters only in the most chaotic moments where things begin to break, and switching to this mode only increases the chaos.
For critical single-machine applications that must work in case of memory allocation failures, one solution is to pre-allocate all heap memory on startup and never rely on malloc again, even under error conditions.
In “Failure Modes and Continuous Resilience”, Adrian Cockcroft mentions a similar idea in the context of failover to a different availability zone: at a critical moment, the control plane service may fail, too (or appear to not work due to problems in code or configuration, that was not noticed earlier because it is not usually exercised), so instances/databases/networks could instead be pre-allocated in the secondary region in a cold standby fashion.
Replace pull with push:
The IAM service needs to provide signed, rotated credentials to code running on EC2 instances. To avoid ever needing to fall back, the credentials are proactively pushed to every instance and remain valid for many hours. This means that IAM role-related requests keep working in the unlikely event of a disruption in the push mechanism.
Convert fallback into failover, which falls into the pattern of constant work:
A service must run both the fallback and the non-fallback logic continuously. It must not merely run the fallback case but also treat it as an equally valid source of data. For example, a service might randomly choose between the fallback and non-fallback responses (when it gets both back) to make sure they're both working.
Do you monitor retries in your system already?
We maintain metrics that monitor overall retry rates and alarms that alert our teams if retries are happening frequently.
Keavy McMinn on how to start a new project: talk with people. Customers, colleagues, industry peers.
A great question I learnt more recently, while sitting in on a colleague interviewing a customer, was “What question do you wish I’d asked you?”.
My experiences have taught me that if you want to produce the right thing, that has rich and lasting impact, this starting point of finding people, talking to and learning from them is fundamental. For me it’s the precursor before the technical research and experiments can truly begin.
From an HN comment by Jonathan Tang:
Jeff & Sanjay have a reputation as rockstars because they can apply this productivity to things that really matter; they're able to pick out the really important parts of the problem and then focus their efforts there so that the end result ends up being much more impactful than what the SWE3 wrote. The SWE3 may spend his time writing a bunch of unit tests that catch bugs that wouldn't really have happened anyway, or migrating from one system to another that isn't really a large improvement, or going down an architectural dead-end that'll just have to be rewritten later.
Jeff or Sanjay will spend their time running a proposed API by clients to ensure it meets their needs, or measuring the performance of subsystems so they fully understand their building blocks, or mentally simulating the operation of the system before building it so they rapidly test out alternatives. They don't actually write more code than a junior developer (oftentimes, they write less), but the code they do write gives them more information, which makes them ensure that they write the right code.
… these developers rapidly become 1x developers (or worse) if you don't let them make their own architectural choices - the reason they're excellent in the first place is because they know how to determine if certain work is going to be useless and avoid doing it in the first place.
Galo Navarro highlights the most important question every platform infrastructure team should ask themselves:
A Platform team should really avoid competing against AWS, Google, or any commercial company. It doesn’t matter if their homegrown CI system is superior, today, to
$commercial_ci_system. The market will catch up, faster than expected, and make that team redundant. Every Platform team should be asking themselves: what is our differentiator? What do we offer than makes it worthwhile for our company to invest in our team, rather than throwing those engineers at the product?
The main value we provide is in the joints, the articulation, the glue. In how we integrate together all these systems. […] We focus on what is specific for our company, tailoring off-the-shelf solutions to our needs.
Dan Luu on the benefits of having an up-front design phase on project with the invention and implementation phases more challenging than the discovery phase (see Grady Booch’s phases):
Working through a design collaboratively teaches everyone on the team everyone else's tricks. It's a lot like the kind of skill transfer you get with pair programming, but applied to design.
The iteration speed is much faster in the design phase, where throwing away a design just means erasing a whiteboard. Once you start coding, iterating on the design can mean throwing away code; for infrastructure projects, that can easily be person-years or even tens of persons-years of work. […] I've seen teams on projects of similar scope insist on getting "working" code as soon as possible. In every single case, that resulted in massive delays as huge chunks of code had to be re-written, and in a few cases the project was fundamentally flawed in a way that required the team had to start over from scratch.
I get that on product-y projects, where you can't tell how much traction you're going to get from something, you might want to get an MVP out the door and iterate, but for pure infrastructure, it's often possible to predict how useful something will be in the design phase.
He also confirms the importance of design docs, which lines up with the practices at Uber shared by Gergely Orosz:
I noticed a curiously strong correlation between the quality of initial design docs and the success of projects.
P. S. Did you know that you can comment on this post on Substack? Share what you think about the above ideas.