Thursday, December 19, 2019

Avoiding fallback in distributed systems

As previously mentioned, I was recently able to contribute to the Amazon Builders' Library. I'd also like to share another post that I wrote for the series, entitled Avoiding fallback in distributed systems. I'm especially excited to be able to publish it, as it's based on an internal Amazon document I wrote in 2006 entitled Modes Considered Harmful. It provoked a lot of interesting discussions over the years, so I'm happy to be able to share a (hopefully slightly better written) version of it publicly. Here's a quick abstract of the article:

This article covers fallback strategies and why we almost never use them at Amazon. In the world of distributed systems, fallback strategies are among the most difficult challenges to handle, especially for time-sensitive services. Compounding this difficulty is that bad fallback strategies can take a long time (even years) to leave repercussions, and the difference between a good strategy and a bad strategy is subtle. In this article, the focus will be on how fallback strategies can cause more problems than they fix. We’ll include examples of where fallback strategies have caused problems at Amazon. Finally, we’ll discuss alternatives to fallback that we use at Amazon.

Challenges with distributed systems

I was recently invited to contribute to the Amazon Builders' Library. One article I'd been wanting to publish publicly is about how bizarre distributed systems are, and what's been the biggest challenge building them, in my experience.

Please check out the article and here's a quick abstract:


Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed. This article reviews the concepts that contribute to why distributed computing is so, well, weird.

Avoiding fallback in distributed systems

As previously mentioned , I was recently able to contribute to the Amazon Builders' Library . I'd also like to share another post t...