Amazon EC2 Outage: As I expected

In my last post on this topic I explained that I believed the Amazon EC2 failure was the result of a triggering event followed by a cascade failure. This turns out to have been the case according to Amazon's Postmortem report.

I wonder if they read my post. Not only did they provide a technically detailed report, much more so then their very terse reports during the outage, but they are also offering a very reasonable refund based on the outage. Good for them.

I have been thinking about what this outage means... and why it was perceived as badly as it was.

As I stated earlier, I am running a service within EC2 and my service was pretty much unaffected because I planned for failure and my backup strategy worked. I even have "deeper" layers of backup that I didn't have to invoke.

However many people were caught unawares by the failure. I suspect that is because they naively assumed that Amazon's infrastructure was internally redundant and would never fail, even though Amazon explicitly disclaims this in their documentation.

One of the steps that Amazon will be taking after this outage will be an educational campaign to help people better understand the nature of EC2 (and all other Infrastructure as a Service offerings) and design their applications accordingly.

This started me thinking about other cloud services. Infrastructure as a Service (or IaaS) is probably the easiest to plan contingencies for because the service offered is generic and provided by multiple providers. Other approaches such as Software as a Service (SaaS) and Platform as a Service (PaaS) are much more proprietary and therefore you are more dependent on the service provider.

But you are not without alternatives. This Blog for example is hosted in Google's "App Engine" PaaS service. Yet it is written in such a fashion that I could easily move it to an IaaS provider. Yes, I will need to make changes to the code, but they are minor, *and* I have to make sure I backup the data that I store in Google's Data Store.

Even SaaS services such as E-mail are not completely locked to a vendor. As long as you control your domain name (AND THIS IS VERY IMPORTANT) then you can move your mailbox location to another provider. However if you advertise your e-mail address using the provider's domain name, then you are at the mercy of the provider.

For example. I have three e-mail addresses. One is at gmail.com. I send mailing list mail and other less important mail there. I also rarely give people that address as a contact address. If there is a problem with Google's infrastructure, I am out of luck until they fix it. I have a separate e-mail address provided by MIT, where I work. This mailbox is MIT's responsibility.

However my "real" personal e-mail address is at a domain I control. At the moment it is routed to a virtual server at a cloud provider (IaaS). If there is a problem at that provider I can always route it to a different virtual server at a different provider, or I can even route it to Google (or Yahoo or whatever). I may loose access to stored mail if I haven't backed it up (but I do back it up!). But no matter what, I will be able to continue to receive new mail at the new location in the event of a failure.

So the point of this post is that all cloud services have risks associated with them and you can almost always mitigate them, but you have to make an effort. If you blindly depend on your provider, well you get what you plan for!

All of this though does put an emphasis on where you procure domain name services. As ultimately your services (whether they be a mailbox or a website) are located by you and others via your domain name provider. If they fail, you may have a problem! I'll muse on this in another post.

Copyright © 2009-2023 Jeffrey I. Schiller