As previously mentioned, I was recently able to contribute to the Amazon Builders' Library. I'd also like to share another post that I wrote for the series, entitled Avoiding fallback in distributed systems. I'm especially excited to be able to publish it, as it's based on an internal Amazon document I wrote in 2006 entitled Modes Considered Harmful. It provoked a lot of interesting discussions over the years, so I'm happy to be able to share a (hopefully slightly better written) version of it publicly. Here's a quick abstract of the article:
This article covers fallback strategies and why we almost never use them at Amazon. In the world of distributed systems, fallback strategies are among the most difficult challenges to handle, especially for time-sensitive services. Compounding this difficulty is that bad fallback strategies can take a long time (even years) to leave repercussions, and the difference between a good strategy and a bad strategy is subtle. In this article, the focus will be on how fallback strategies can cause more problems than they fix. We’ll include examples of where fallback strategies have caused problems at Amazon. Finally, we’ll discuss alternatives to fallback that we use at Amazon.