Amazon's EC2 Outage: What did you expect?

These last few days has seen one of the worst failures of Amazon's elastic computing cloud service. As I write this the service is still not 100% and Amazon has yet to provide a detailed explanation of what happened.

But I am going to speculate.... Over the years I have been responsible for large complex systems. In particular I was operations manager for Project Athena for quite a number of years. It was quite a complex system in its day.

One of the things that I have learned is that complex systems fail in complex ways. I suspect that is what has happened here. From what has been published to date I surmise:

  • There was an initial triggering event. Amazon has stated that there was a network problem of some kind. I wonder if someone tripped over a power cord (sorry, couldn't help it).
  • The failure is related to EC2's "Elastic Block Store" (EBS) service which provides virtual disks for virtual EC2 "instances" (computers).
  • This initial failure resulted in some significant traffic in one of their data centers (they have said as much) and probably caused all EBS services in that data center to fail.
  • Now here is where it gets interesting... The failure likely caused a "Cascade" failure which leaked over to their "control plane" (the systems that manage the whole affair). The control plane is common to the entire region. This resulted in some services in other data centers (what Amazon calls an "availability zone") to be impacted. In particular it appears that new EBS disks could not be manipulated. However most running instances were likely unaffected, outside of the original data center with the failure.
  • As I write this, only the originally affected data center is still not operational, but the other availability zones appear to be working.

I operate the MIT Web Survey service using Amazon's resources. Because my operation isn't very large, it can easily fit on one virtual server. However for redundancy I use two servers, each in a separate availability zone. I use a load balancer to direct incoming connection between the two servers. For my database I use CouchDB.

One of the really cool features of CouchDB is its multi-master replication. So I do replication in real time between the two instances. This replication is done by software on my instances and doesn't depend on Amazon's control plane (but it does depend on connectivity between the two servers and of course to the Internet at large).

One of my servers was in the failed data center. However I was easily able to remove it from the load balancer. My one remaining server was then handling all of the load (which wasn't much at the time) and all data had been replicated to the good server. Because of the cascading failure, I could not start another instance, even in another availability zone. However I setup replication between the one remaining server and a host on the MIT Campus. So if this last server failed, I would have all of my data up until the moment of failure.

Let me tell you, knowing that your data is safe gives you a really warm fuzzy feeling!

Just in case, I brought up my software on a new instance in an unaffected Amazon region. If the remaining server failed, I could completely move my operation to Amazon's west coast region (which was and is completely unaffected) in about 30 minutes time. I also have a copy of my software in another cloud provider, so moving there is also an option.

Eventually I was able to start a new instance in Amazon's east coast region (the affected region) and by yesterday afternoon I was back to having two servers, one each in a different availability zone, neither in the busted zone! I am also maintaining replication to the host on the MIT Campus (CoucDB doesn't limit replication to only two hosts, you can have an arbitrary replication system involving many hosts). So if Amazon completely fails, I still have an alternative option AND I don't loose any data. Keep in mind that when you are running a survey service, each survey response is precious!

Lessons Learned

Do not assume that any provider of services will never have a failure. Redundancy and data replication are your friend! I didn't attempt to bring up the server in the failing zone. Instead I am going to wait for an "all clear" from Amazon. That way I am not contributing to the problem!

I saw a bit of whining on Amazon's developer forums. Apparently there are quite a few folks out there who run only one instance and had all of their eggs in it. I have a hard time taking their whining seriously. I don't care how small a startup company they are, if they can afford one instance, they can afford a second. They are not expensive!

Hopefully this is not the end of Cloud Computing. If anything I expect Amazon to come out of this stronger after having learned some important lessons. Hopefully those lessons will not only include the technical ones, but also social ones. They probably should have dedicated at least one cluefull employee to the task of providing information to the developer community. Their status updates on their status dashboard were very terse. Folks with my background can read them well enough to get a sense of what was going on, but lots of other people could not.

Also as I write this there has been no statement from Amazon about what refunds, if any, will come from this event. In my case I am running an "on demand" instance which costs me 4 times as much as the "reserved" instance located in the failed data center. I don't really care about the cost, the total will likely be under $10 when all is said an done. However they would get a large public relations bonus if they:

  • Made a simple statement: "Yes, this is bad. We will learn from this situation and do our damnedest to ensure that something like this does not happen again"
  • Made a statement: "All billing for the East Coast Region (the affected region) will be discarded from the time of the initial failure until X days from when complete service is restored." (X can be small).

We'll see what happens...

Comments:

From: Daniel

I am trying to engineer a similar small-system service with one redundant server (hot spare more important than load-balancing in my case) and your system makes a lot of sense, very similar to what I was envisioning. However: Examining all possible failure points, I am still befuddled by the load balancer. Where is it hosted, and what if it goes down? Obviously it's difficult to get all clients on board a backup load balancer (as fast as possible of course) without a lot of control over routes (which a small operation usually doesn't have) and the collective client DNS cache (which nobody has).

From: Jeff

Sorry for the slow response.

This is one of the cool features of Amazon's elastic load
balancer. The load balancer exists in all availability zones which
host an instance. So if I have two instances, one in zone "A" and a
second in zone "B", the load balancer exists in both zones. I don't
know if they are both active, I suspect not. But when you create a
load balancer Amazon gives you a DNS name to use. If you do an "A"
record (not to be confused with Zone "A") on the name of your balancer
you see it is given out with a very small TTL (Time-To-Live). So if
the balancer in actual use is in a availability zone that goes down,
Amazon can change the "A" record to point to the balancer in the still
working zone. There will be an outage, but it is limited to a few
minutes at worst.
 
So the DNS is the key. The DNS is designed to be redundant, you
normally advertise multiple servers. So you leverage the DNS to point
clients to the working load balancer. A key "secret sauce" that you
need is a general health overseer that keeps track of what is running,
and what isn't and updates DNS as needed.
 
I know that various pieces exist to build a system. I don't know if
anyone has actually built an open-source complete system. I'm going to
check Google...
<p>Sorry for the slow response.</p>
<p>This is one of the cool features of Amazon's elastic load
balancer. The load balancer exists in all availability zones which
host an instance. So if I have two instances, one in zone "A" and a
second in zone "B", the load balancer exists in both zones. I don't
know if they are both active, I suspect not. But when you create a
load balancer Amazon gives you a DNS name to use. If you do an "A"
record (not to be confused with Zone "A") on the name of your balancer
you see it is given out with a very small TTL (Time-To-Live). So if
the balancer in actual use is in a availability zone that goes down,
Amazon can change the "A" record to point to the balancer in the still
working zone. There will be an outage, but it is limited to a few
minutes at worst.</p>
<p>
So the DNS is the key. The DNS is designed to be redundant, you
normally advertise multiple servers. So you leverage the DNS to point
clients to the working load balancer. A key "secret sauce" that you
need is a general health overseer that keeps track of what is running,
and what isn't and updates DNS as needed.
</p><p>
I know that various pieces exist to build a system. I don't know if
anyone has actually built an open-source complete system. I'm going to
check Google...
</p>

Copyright © 2009-2023 Jeffrey I. Schiller