Recent Blog Entries
Beware Chrome and HTTP/2 Debugging April 2 2023
It’s About Security, not Privacy Feb. 26 2016
Technology Marches On Feb. 18 2016
Bitcoin: Where is the Governance? March 3 2014
Bitcoin March 1 2014
These last few days has seen one of the worst failures of Amazon's elastic computing cloud service. As I write this the service is still not 100% and Amazon has yet to provide a detailed explanation of what happened.
But I am going to speculate.... Over the years I have been responsible for large complex systems. In particular I was operations manager for Project Athena for quite a number of years. It was quite a complex system in its day.
One of the things that I have learned is that complex systems fail in complex ways. I suspect that is what has happened here. From what has been published to date I surmise:
I operate the MIT Web Survey service using Amazon's resources. Because my operation isn't very large, it can easily fit on one virtual server. However for redundancy I use two servers, each in a separate availability zone. I use a load balancer to direct incoming connection between the two servers. For my database I use CouchDB.
One of the really cool features of CouchDB is its multi-master replication. So I do replication in real time between the two instances. This replication is done by software on my instances and doesn't depend on Amazon's control plane (but it does depend on connectivity between the two servers and of course to the Internet at large).
One of my servers was in the failed data center. However I was easily able to remove it from the load balancer. My one remaining server was then handling all of the load (which wasn't much at the time) and all data had been replicated to the good server. Because of the cascading failure, I could not start another instance, even in another availability zone. However I setup replication between the one remaining server and a host on the MIT Campus. So if this last server failed, I would have all of my data up until the moment of failure.
Let me tell you, knowing that your data is safe gives you a really warm fuzzy feeling!
Just in case, I brought up my software on a new instance in an unaffected Amazon region. If the remaining server failed, I could completely move my operation to Amazon's west coast region (which was and is completely unaffected) in about 30 minutes time. I also have a copy of my software in another cloud provider, so moving there is also an option.
Eventually I was able to start a new instance in Amazon's east coast region (the affected region) and by yesterday afternoon I was back to having two servers, one each in a different availability zone, neither in the busted zone! I am also maintaining replication to the host on the MIT Campus (CoucDB doesn't limit replication to only two hosts, you can have an arbitrary replication system involving many hosts). So if Amazon completely fails, I still have an alternative option AND I don't loose any data. Keep in mind that when you are running a survey service, each survey response is precious!
Lessons Learned
Do not assume that any provider of services will never have a failure. Redundancy and data replication are your friend! I didn't attempt to bring up the server in the failing zone. Instead I am going to wait for an "all clear" from Amazon. That way I am not contributing to the problem!
I saw a bit of whining on Amazon's developer forums. Apparently there are quite a few folks out there who run only one instance and had all of their eggs in it. I have a hard time taking their whining seriously. I don't care how small a startup company they are, if they can afford one instance, they can afford a second. They are not expensive!
Hopefully this is not the end of Cloud Computing. If anything I expect Amazon to come out of this stronger after having learned some important lessons. Hopefully those lessons will not only include the technical ones, but also social ones. They probably should have dedicated at least one cluefull employee to the task of providing information to the developer community. Their status updates on their status dashboard were very terse. Folks with my background can read them well enough to get a sense of what was going on, but lots of other people could not.
Also as I write this there has been no statement from Amazon about what refunds, if any, will come from this event. In my case I am running an "on demand" instance which costs me 4 times as much as the "reserved" instance located in the failed data center. I don't really care about the cost, the total will likely be under $10 when all is said an done. However they would get a large public relations bonus if they:
We'll see what happens...
Comments:
From: Daniel
I am trying to engineer a similar small-system service with one redundant server (hot spare more important than load-balancing in my case) and your system makes a lot of sense, very similar to what I was envisioning. However: Examining all possible failure points, I am still befuddled by the load balancer. Where is it hosted, and what if it goes down? Obviously it's difficult to get all clients on board a backup load balancer (as fast as possible of course) without a lot of control over routes (which a small operation usually doesn't have) and the collective client DNS cache (which nobody has).
From: Jeff
Sorry for the slow response.
Copyright © 2009-2023 Jeffrey I. Schiller