Thursday, December 24, 2009

Use ELB to Serve Multiple SSL Domains on One EC2 Instance

This is one of the coolest uses of Amazon's ELB I've seen yet. Check out James Elwood's article.

You may know that you can't serve more than one SSL-enabled domain on a single EC2 instance. Okay, you can but only via a wildcard certificate (limited) or a multi-domain certificate (hard to maintain). So you really can't do it properly. Serving multiple SSL domains is one of the main use cases behind the popular request to support multiple IP addresses per instance.

Why can't you do it "normally"?

The reason why it doesn't work is this: The HTTPS protocol encrypts the HTTP request, including the Host: header within. This header identifies what actual domain is being requested - and therefore what SSL certificate to use to authenticate the request. But without knowing what domain is being requested, there's no way to choose the correct SSL certificate! So web servers can only use one SSL certificate.

If you have multiple IP addresses then you can serve different SSL domains from different IP addresses. The VirtualHost directive in Apache (or similar mechanisms in other web servers) can look at the target IP address in the TCP packets - not in the HTTP Host: header - to figure out which IP address is being requested, and therefore which domain's SSL certificate to use.

But without multiple IP addresses on an EC2 instance, you're stuck serving only a single SSL-enabled domain from each EC2 instance.

How can you?

Really, read James' article. He explains it very nicely.

How much does it cost?

Much less than two EC2 instances, that's for sure. According to the EC2 pricing charts, ELB costs:
  • $0.025 per Elastic Load Balancer-hour (or partial hour) ($0.028 in us-west-1 and eu-west-1)
  • $0.008 per GB of data processed by an Elastic Load Balancer
The smallest per-hour cost you can get in EC2 is for the m1.small instance, at $0.085 ($0.095 in us-west-1 and eu-west-1).

Using the ELB-for-multiple-SSL-sites trick saves you 75% of the cost of using separate instances.

Thanks, James!

Thursday, December 10, 2009

Read-After-Write Consistency in Amazon S3

S3 has an "eventual consistency" model, which presents certain limitations on how S3 can be used. Today, Amazon released an improvement called "read-after-write-consistency" in the EU and US-west regions (it's there, hidden at the bottom of the blog post). Here's an explanation of what this is, and why it's cool.


What is Eventual Consistency?

Consistency is a key concept in data storage: it describes when changes committed to a system are visible to all participants. Classic transactional databases employ various levels of consistency, but the golden standard is that after a transaction commits the changes are guaranteed to be visible to all participants. A change committed at millisecond 1 is guaranteed to be available to all views of the system - all queries - immediately thereafter.

Eventual consistency relaxes the rules a bit, allowing a time lag between the point the data is committed to storage and the point where it is visible to all others. A change committed at millisecond 1 might be visible to all immediately. It might not be visible to all until millisecond 500. It might not even be visible to all until millisecond 1000. But, eventually it will be visible to all clients. Eventual consistency is a key engineering tradeoff employed in building distributed systems.

One issue with eventual consistency is that there's no theoretical limit to how long you need to wait until all clients see the committed data. A delay must be employed (either explicitly or implicitly) to ensure the changes will be visible to all clients.

Practically speaking, I've observed that changes committed to S3 become visible to all within less than 2 seconds. If your distributed system reads data shortly after it was written to eventually consistent storage (such as S3) you'll experience higher latency as a result of the compensating delays.


What is Read-After-Write Consistency?

Read-after-write consistency tightens things up a bit, guaranteeing immediate visibility of new data to all clients. With read-after-write consistency, a newly created object or file or table row will immediately be visible, without any delays.

Note that read-after-write is not complete consistency: there's also read-after-update and read-after-delete. Read-after-update consistency would allow edits to an existing file or changes to an already-existing object or updates of an existing table row to be immediately visible to all clients. That's not the same thing as read-after-write, which is only for new data. Read-after-delete would guarantee that reading a deleted object or file or table row will fail for all clients, immediately. That, too, is different from read-after-write, which only relates to the creation of data.


Why is Read-After-Write Consistency Useful?

Read-after-write consistency allows you to build distributed systems with less latency. As touched on above, without read-after-write consistency you'll need to incorporate some kind of delay to ensure that the data you just wrote will be visible to the other parts of your system.

But no longer. If you use S3 in the US-west or EU regions, your systems need not wait for the data to become available.


Why Only in the AWS US-west and EU Regions?

Read-after-write consistency for AWS S3 is only available in the US-west and EU regions, not the US-Standard region. I asked Jeff Barr of AWS blogging fame why, and his answer makes a lot of sense:
This is a feature for EU and US-West. US Standard is bi-coastal and doesn't have read-after-write consistency.
Aha! I had forgotten about the way Amazon defines its S3 regions. US-Standard has servers on both the east and west coasts (remember, this is S3 not EC2) in the same logical "region". The engineering challenges in providing read-after-write consistency in a smaller geographical area are greatly magnified when that area is expanded. The fundamental physical limitation is the speed of light, which takes at least 16 milliseconds to cross the US coast-to-coast (that's in a vacuum - it takes at least four times as long over the internet due to the latency introduced by routers and switches along the way).

If you use S3 and want to take advantage of the read-after-write consistency, make sure you understand the cost implications: the US-west and EU regions have higher storage and bandwidth costs than the US-Standard region.


Next Up: SQS Improvements?

Some vague theorizing:

It's been suggested that AWS Simple Queue Service leverages S3 under the hood. The improved S3 consistency model can be used to provide better consistency for SQS as well. Is this in the works? Jeff Barr, any comment? :-)

Friday, December 4, 2009

The Open Cloud Computing Interface at IGT2009

Today I participated in the Cloud Standards & Interoperability panel at the IGT2009 conference, together with Shahar Evron of Zend Technologies, and moderated by Reuven Cohen. Reuven gave an overview of his involvement with various governments on the efforts to define and standardize "cloud", and Shahar presented an overview of the Zend Simple Cloud API (for PHP). I presented an overview of the Open Grid Forum's Open Cloud Computing Interface (OCCI).

The slides include a 20,000-foot view of the specification, a 5,000-foot view of the specification, and an eye-level view in which I illustrated the metadata travelling over the wire using the HTTP Header rendering.

Here's my presentation.

Tuesday, November 17, 2009

How to Work with Contractors on AWS EC2 Projects

Recently I answered a question on the EC2 forums about how to give third parties access to EC2 instances. I noticed there's not a lot of info out there about how to work with contractors, consultants, or even internal groups to whom you want to grant access to your AWS account. Here's how.


First, a Caveat


Please be very selective when you choose a contractor. You want to make sure you choose a candidate who can actually do the work you need - and unfortunately, not everyone who advertises as such can really deliver the goods. Reuven Cohen's post about choosing a contractor/consultant for cloud projects examines six key factors to consider:
  1. Experience: experience solving real world problems is probably more important than anything else.
  2. Code: someone who can produce running code is often more useful than someone who just makes recommendations for others to follow.
  3. Community Engagement: discussion boards are a great way to gauge experience, and provide insight into the capabilities of the candidate.
  4. Blogs & Whitepaper: another good way to determine a candidate's insight and capabilities.
  5. Interview: ask the candidate questions to gauge their qualifications.
  6. References: do your homework and make sure the candidate really did what s/he claims to have done.
Reuven's post goes into more detail. It's highly recommended for anyone considering using a third-party for cloud projects.


What's Your Skill Level?


The best way to allow a contractor access to your resources depends on your level of familiarity with the EC2 environment and with systems administration in general.

If you know your way around the EC2 toolset and you're comfortable managing SSH keypairs, then you probably already know how to allow third-party access safely. This article is not meant for you. (Sorry!)

If you don't know your way around the EC2 toolset, specifically the command-line API tools, and the AWS Management Console or the ElasticFox Firefox Extension, then you will be better off allowing the contractor to launch and configure the EC2 resources for you. The next section is for you.


Giving EC2 Access to a Third Party


[An aside: It sounds strange, doesn't it? "Third party". Did I miss two parties already? Was there beer? Really, though, it makes sense. A third party is someone who is not you (you're the first party) and not Amazon (they're the counterparty, or the second party). An outside contractor is a third party.]

Let's say you want a contractor to launch some EC2 instances for you and to set them up with specific software running on them. You also want them to set up automated EBS snapshots and other processes that will use the EC2 API.


What you should give the contractor


Give the contractor your Access Key ID and your Secret Access Key, which you should get from the Security Credentials page:



The Access Key ID is not a secret - but the Secret Access Key is, so make sure you transfer it securely. Don't send it over email! Use a private DropBox or other secure method.

Don't give out the email address and password that allows you to log into the AWS Management Console. You don't want anyone but you to be able to change the billing information or to sign you up for new services. Or to order merchandise from Amazon.com using your account (!).


What the contractor will do


Using ElasticFox and your Access Key ID and Secret Access Key the contractor will be able to launch EC2 instances and make all the necessary configuration changes on your account. Plus they'll be able to put these credentials in place for automated scripts to make EC2 API calls on your behalf - like to take an EBS snapshot. [There are some rare exceptions which will require your X.509 Certificates and the use of the command-line API tools.]

For example, here's what the contractor will do to set up a Linux instance:
  1. Install ElasticFox and put in your access credentials, allowing him access to your account.
  2. Set up a security group allowing him to access the instance.
  3. Create a keypair, saving the private key to his machine (and to give to you later).
  4. Choose an appropriate AMI from among the many available. (I recommend the Alestic Ubuntu AMIs).
  5. Launch an instance of the chosen AMI, in the security group, using the keypair.
  6. Once the instance is launched he'll SSH into the instance and set it up. He'll use the instance's public IP address and the private key half of the keypair (from step 3), and the user name (most likely "root") to do this.
The contractor can also set up some code to take EBS snapshots - and the code will require your credentials.


What deliverables to expect from the contractor


When he's done, the contractor will give you a few things. These should include:
  • the instance ids of the instances, their IP addresses, and a description of their roles.
  • the names of any load balancers, auto scaling groups, etc. created.
  • the private key he created in step 3 and the login name (usually "root"). Make sure you get this via a secure communications method - it allows privileged access to the instances.
Make sure you also get a thorough explanation of how to change the credentials used by any code requiring them. In fact, you should insist that this must be easy for you to do.

Plus, ask your contractor to set up the Security Groups so you will have the authorization you need to access your EC2 deployment from your location.

And, of course, before you release the contractor you should verify that everything works as expected.


What to do when the contractor's engagement is over


When your contractor no longer needs access to your EC2 account you should create new access key credentials (see the "Create a new Access Key" link on the Security Credentials page mentioned above).

But don't disable the old credentials just yet. First, update any code the contractor installed to use the new credentials and test it.

Once you're sure the new credentials are working, disable the credentials given to the contractor (the "Make Inactive" link).


The above guidelines also apply to working with internal groups within your organization. You might not need to revoke their credentials, depending on their role - but you should follow the suggestions above so you can if you need to.

Tuesday, October 27, 2009

What Language Does the Cloud Speak, Now and In the Future?

You're a developer writing applications that use the cloud. Your code manipulates cloud resources, creating and destroying VMs, defining storage and networking, and gluing these resources together to create the infrastructure upon which your application runs. You use an API to perform these cloud operations - and this API is specific to the programming language and to the cloud provider you're using: for example, for Java EC2 applications you'd use typica, for Python EC2 applications you'd use boto, etc. But what's happening under the hood, when you call these APIs? How do these libraries communicate with the cloud? What language does the cloud speak?

I'll explore this question for today's cloud, and touch upon what the future holds for cloud APIs.


Java? Python? Perl? PHP? Ruby? .NET?

It's tempting to say that the cloud speaks the same programming language whose API you're using. Don't be fooled: it doesn't.

"Wait," you say. "All these languages have Remote Procedure Call (RPC) mechanisms. Doesn't the cloud use them?"

No. The reason why RPCs are not provided for every language is simple: would you want to support a product that needed to understand the RPC mechanism of many languages? Would you want to add support for another RPC mechanism as a new language becomes popular?

No? Neither do cloud providers.

So they use HTTP.


HTTP: It's a Protocol

The cloud speaks HTTP. HTTP is a protocol: it prescribes a specific on-the-wire representation for the traffic. Commands are sent to the cloud and results returned using the internet's most ubiquitous protocol, spoken by every browser and web server, routable by all routers, bridgeable by all bridges, and securable by any number of different methods (HTTP + SSL/TLS being the most popular, a.k.a. HTTPS). RPC mechanisms cannot provide all these benefits.

Cloud APIs all use HTTP under the hood. EC2 actually has two different ways of using HTTP: the SOAP API and the Query API. SOAP uses XML wrappers in the body of the HTTP request and response. The Query API places all the parameters into the URL itself and returns the raw XML in the response.

So, the lingua franca of the cloud is HTTP.

But EC2's use of HTTP to transport the SOAP API and the Query API is not the only way to use HTTP.


HTTP: It's an API

HTTP itself can be used as a rudimentary API. HTTP has methods (GET, PUT, POST, DELETE) and return codes and conventions for passing arguments to the invoked method. While SOAP wraps method calls in XML, and Query APIs wrap method calls in the URL (e.g. http://ec2.amazonaws.com/?Action=DescribeRegions), HTTP itself can be used to encode those same operations. For example:
GET /regions HTTP/1.1
Host: cloud.example.com
Accept: */*
That's a (theoretical) way to use raw HTTP to request the regions available from a cloud located at cloud.example.com. It's about a simple as you can get for an on-the-wire representation of the API call.

Using raw HTTP methods we can model a simple API as follows:
  • HTTP GET is used as a "getter" method.
  • HTTP PUT and POST are used as "setter" or "constructor" methods.
  • HTTP DELETE is used to delete resources.
All CRUD operations can be modeled in this manner. This technique of using HTTP to model a higher-level API is called Representational State Transfer, or REST. RESTful APIs are mapped to the HTTP verbs and are very lightweight. They can be used directly by any language (OK, any language that supports HTTP - which is every useful language) and also by browsers directly.

RESTful APIs are "close to the metal" - they do not require a higher-level object model in order to be usable by servers or clients, because bare HTTP constructs are used.

Unfortunately, EC2's APIs are not RESTful. Amazon was the undisputed leader in bringing cloud to the masses, and its cloud API was built before RESTful principles were popular and well understood.


Why Should the Cloud Speak RESTful HTTP?

Many benefits can be gained by having the cloud speak RESTful HTTP. For example:
  • The cloud can be operated directly from the command-line, using curl, without any language libraries needed.
  • Operations require less parsing and higher-level modeling because they are represented close to the "native" HTTP layer.
  • Cache control, hashing and conditional retrieval, alternate representations of the same resource, etc., can be easily provided via the usual HTTP headers. No special coding is required.
  • Anything that can run a web server can be a cloud. Your embedded device can easily advertise itself as a cloud and make its processing power available for use via a lightweight HTTP server.
All these benefits are important enough to be provided by any cloud API standard.


Where are Cloud API Standards Headed?

There are many cloud API standardization efforts. Some groups are creating open standards, involving all industry stakeholders and allowing you (the developer) to use them or implement them without fear of infringing on any IP. Some of them are not open, where those guarantees cannot be made. Some are language-specific APIs, and others are HTTP-based APIs (RESTful or not).

The following are some popular cloud APIs:

jClouds
libcloud
Cloud::Infrastructure
Zend Simple Cloud API
Dasein Cloud API
Open Cloud Computing Interface (OCCI)
Microsoft Azure
Amazon EC2
VMware vCloud
deltacloud

Here's how the above products (APIs) compare, based on these criteria:

Open: The specification is available for anyone to implement without licensing IP, and the API was designed in a process open to the public.
Proprietary: The specification is either IP encumbered or the specification was developed without the free involvement of all ecosystem participants (providers, ISVs, SIs, developers, end-users).
API: The standard defines an API requiring a programming language to operate.
Protocol: The standard defines a protocol - HTTP.



This chart shows the following:
  • There are many language-specific APIs, most open-source.
  • Proprietary standards are the dominant players in the marketplace today.
  • OCCI is the only completely open standard defining a protocol.
  • Deltacloud was begun by RedHat and is currently open, but its initial development was closed and did not involve players from across the ecosystem (hence its location on the border between Open and Proprietary).

What Does This Mean for the Cloud Developer?

The future of the cloud will have a single protocol that can be used to operate multiple providers. Libraries will still exist for every language, and they will be able to control any standards-compliant cloud. In this world, a RESTful API based on HTTP is a highly attractive option.

I highly recommend taking a look at the work being done in OCCI, an open standard that reflects the needs of the entire ecosystem. It'll be in your future.

Update 27 October 2009:
Further Reading
No mention of cloud APIs would be complete without reference to William Vambenepe's articles on the subject:

Saturday, October 17, 2009

Avoiding EC2 InsufficientInstanceCapacity: Insufficient Capacity Errors

Here's a quick tip from this thread on the AWS EC2 Developer Forums.

If you experience the InsufficientInstanceCapacity: Insufficient Capacity error, you'll be glad to know there are some strategies for working around it. Justin@AWS offers this advice:
There can be short periods of time when we are unable to accommodate instance requests that are targeted to a specific Availability Zone. When a particular instance type experiences unexpected demand in an Availability Zone, our system must react by shifting capacity from one instance type to another. This can result in short periods of insufficient capacity. We incorporate this data into our capacity planning and try to manage all zones to have adequate capacity at all times. The following steps will ensure that you will have the best experience launching Amazon EC2 instances when an initial insufficient capacity message is received:

1. Don't specify an Availability Zone in your request unless necessary. By targeting a specific Availability Zone you eliminate our ability to satisfy that request by using our other available Availability Zones. Please note that a single RunInstances call will allocate all instances within a single Availability Zone.

2. If you require a large number of instances for a particular job, please request them in batches. The best practice to follow here is to request 25% of your total cluster size at a time. For example, if you want to launch 200 instances, launching 50 instances at a time will result in a better experience.

3. Try using a different instance type. As capacity varies across instance types, attempting to launch difference instance types provides spill over capacity should your primary instance type be temporarily unavailable.

Unfortunately, these techniques require that you be willing to accept higher bandwidth costs for cross-availability-zone traffic.

And, none of these tips help if you're using Auto Scaling. A single Auto Scaling Group must be in a specific availability zone, so #1 won't help. You can try using smaller numbers of instances when a trigger is reached by choosing a smaller LowerBreachScaleIncrement or UpperBreachScaleIncrement (which control by how many instances or by what percent to scale in each direction), as per #2, but this is only helpful if you've planned in advance. And #3 is only possible if you've already noticed an auto scaling activity failure and changed the Launch Configuration - which defeats the purpose of Auto Scaling.

Auto Scaling's error reporting and recovery is very limited currently. Are you listening, AWS?

Update 18 October 2009: AWS is listening. The following post by John@AWS appears in this thread:
AutoScaling currently reports [...] InsufficientInstanceCapacity [...] as a generic Internal Error. This is unintentional, and will be remedied in our next release.
Cool!

Update 19 October 2009: Auto Scaling Groups can now be configured to support more than one Availability Zone. Here is the salient quote from the updated documentation:
Instance Distribution and Balance across Multiple Zones

Amazon Auto Scaling attempts to distribute instances evenly between the Availability Zones that are enabled for your AutoScalingGroup. Auto Scaling uses the Availability Zone with the least number of instances when launching new instances. However, if an Availabilty Zone has insufficient capacity or if Amazon EC2 is unable to launch new instances in it, then Auto Scaling launches instances in another Availability Zone to satisfy the required capacity for your group.

Certain operations and conditions can cause your AutoScalingGroup to become unbalanced. Auto Scaling compensates by creating a rebalancing activity under any of the following conditions:

  1. You issue a request to change the Availability Zones for your group.

  2. You call TerminateInstanceInAutoScalingGroup, which causes the group to become unbalanced.

  3. An Availability Zone that previously had insufficient capacity recovers and has additional capacity available.

Auto Scaling always launches new instances before attempting to terminate old ones, so a rebalancing activity will not compromise the performance or availability of your application.

Multi-Zone Instance Counts when Approaching Capacity

Because Auto Scaling always attempts to launch new instances before terminating old ones, being at or near the specified maximum capacity could impede or completely halt rebalancing activities. To avoid this problem, the system can temporarily exceed the specified maximum capacity of a group by a 10% margin during a rebalancing activity. The margin is only extended if the group is at or near maximum capacity and needs rebalancing (either as a result of user-requested rezoning or to compensate for zone availability issues). The extension only lasts as long as needed to re-balanced the group (typically a few minutes).

Sunday, September 27, 2009

Alternatives to Elastic IPs for EC2 Name Resolution

How can you handle DNS lookups in EC2 without going crazy each time a resource's IP address changes? One solution is to use an Elastic IP, a stable IP address that can be remapped to different instances, but Elastic IPs are not appropriate for all situations. This article explores the various methods of managing name resolution with EC2 instances.

Features of Different Name Resolution Methods

Before diving into the methods themselves let's take a look at the factors to consider when evaluating methods of managing name resolution. Here are the factors:
  • Updatable in code. You will want to write code to make changes to the name resolution settings automatically, in response to infrastructure events (e.g. launching a new server).
  • Propagation delay. It can take some time for changes to name resolution settings to propagate (especially with DNS). A solution should offer some degree of assurance that changes will propagate within a known and reasonable period of time. [Note that some clients (e.g. the IE browser or the Java rutime) by default ignore the DNS TTL, artificially increasing the propagation delay for DNS-based methods.]
  • Compatible with DNS. If your service will be accessed by a web browser or other client that you do not control, your name resolution method will need to be compatible with DNS. Otherwise clients will not be able to resolve your hostnames properly.
  • Ease of implementation. Some solutions, while technically sufficient, are difficult to implement.
  • Public / Private IP addresses. Whether the solution can serve public and/or private IP addresses. If your clients are inside the same EC2 region then you want their lookups to resolve to the private IP address. Clients outside the same EC2 region should be served the public IP address.
  • Supply. Is there any practical limitation on the number of name resolution entries?
  • Cost. How much it costs to implement, including costs for idle resources and updating settings.
Methods of Name Resolution

As mentioned above, there are a number of different methods to manage name resolution. These are:
  • Traditional DNS.
  • Dynamically update the /etc/hosts file on the various application hosts. The /etc/hosts file on linux (like the C:\Windows\System32\Drivers\etc\hosts file on Windows) contains host-name-to-IP-address mappings that are checked before DNS is consulted, allowing it to override DNS. The file can be updated via pull (initiated by the host) or push (initiated by an external agent).
  • Store the mappings in S3 or SimpleDB. Clients must use the S3 or SimpleDB APIs for name resolution.
  • Use a dynamic DNS provider.
  • Run your own traditional DNS servers for your domain. Clients must be able to see these DNS servers.
  • Run your own dynamic DNS servers for your domain. Clients must be able to see these DNS servers.
  • Elastic IPs. The AWS pricing model discourages (though not strongly enough, I believe) Elastic IPs from being left unused, so you should use them for instances hosting services that are always on, such as your web server or your Facebook application. You should set up a DNS entry pointing the host names to the Elastic IPs, and then any remapping of the Elastic IP to a different instance happens via the EC2 API without requiring any change to DNS.
Here is a table (click on it to see it full size) showing how each of these name resolution methods stack up against each other:



Notes:
  • Dynamically updating /etc/hosts can be used to store either the public IP or the private IP but not both for the same client. You can use one /etc/hosts file for your clients inside the same EC2 region which contains the private IPs, and a different but corresponding /etc/hosts file for your clients outside the EC2 region (or outside EC2 completely) which contains the public IPs. The propagation delay is governed by the frequency with which you update the /etc/hosts file on each client. You can minimize this delay by increasing the frequency of updates. This technique is described in detail in an article by Tim Dysinger.
  • Similarly, the two "run your own DNS" methods (Your Own DNS for your Domain, Your Own Dynamic DNS for your Domain) can be used to resolve to either the public IP address or the private IP address, but not both for the same client. You should set up your clients inside EC2 to utilize the DNS service inside EC2, and the domain should be configured to point to the DNS service running outside EC2 so that clients outside EC2 will see the public IPs. Note that clients running inside EC2 whose DNS resolution you do not control (for example, another EC2 user's client) will be referred to the public IPs. Jeff Roberts offers some great practical suggestions for running your own DNS inside EC2.
This table demonstrates the following:
  • Elastic IPs are the best choice when you need only a limited number of resolvable names and you will use them constantly. If you use their corresponding DNS name then they intelligently resolve to the public IP when looked up from the internet and to the private IP when looked up from within EC2.
  • If you need an unlimited number of resolvable names within EC2 then you should run your own dynamic DNS within EC2.
  • Methods that are incompatible with DNS should only be used with clients you control.
As we can see, Dynamic DNS (especially running your own) has one distinct advantage over using Elastic IPs: unlimited supply at no cost when unused.

When Running Your Own Dynamic DNS is Better than Elastic IPs

One application for running your own Dynamic DNS is a testing environment that includes large clusters of EC2 instances, for example database cluster or application nodes, connected to web layer instance(s). These cluster instances will only be visible to the front-end web tier, so they do not need a publicly resolvable IP address. And your testing environment is not likely to be running all the time. Elastic IPs would work here (presuming you needed only 5 or you could convince AWS to increase your Elastic IP limit to meet your needs), but would cost money when unused. A more economical solution might be to use your own Dynamic DNS within EC2 for these instances. If you have spare capacity on an existing instance then you can put the Dynamic DNS service there - otherwise you will need another instance, making the cost less attractive. In any case you'll need the instance hosting the Dynamic DNS to have an Elastic IP to allow failover without affecting the clients. And you'll need a script to dynamically configure the /etc/resolv.conf on your EC2 clients to point to the private IP address of the Dynamic DNS instance by looking up its Elastic IP's DNS name.

Let's compare the monthly costs of using Elastic IPs with the costs of running your own dynamic DNS for a testing environment such as the above. The cost reflects the following ingredients and assumptions:
  • The number of hours over the month that allocated addresses (DNS entries) are not associated with a live instance, in total for all allocated addresses. If you have DNS entries / addresses and leave them unmapped for 10 hours each then you have 100 unmapped hours.
  • The number of changes to the DNS mappings made that month.
  • The fractional cost of running an instance just to serve the dynamic DNS. If you have spare capacity on an existing instance then this is the instance cost multiplied by the fraction of the capacity that the dynamic DNS service uses. If you need to spin up a dedicated instance for the dynamic DNS service then this is the entire cost of that instance.
  • Pricing for Elastic IPs: free when in use. 1 cent per hour unused. First 100 remaps per month free, 10 cents per remap afterward.
It should be obvious that using dynamic DNS for this testing environment will be economical when

FractionalDNSInstanceCost < NumUnmappedHours * 0.01 + MAX(NumMappingChanges - 100, 0) * 0.1

For simplicity's sake this can be rewritten in clearer terms:

FractionalDNSInstanceCost < NumInstances * ( NumHoursClusterUnused * 0.01 + MAX(NumTimesClusterIsLaunched - 100, 0) * 0.1)

Right about now I'm wishing Excel had better 3-D graphing capabilities. Here's something helpful to visualize this:



The chart shows the monthly cost of running clusters of different sizes according to how many times the cluster is launched. The color "bands" show the areas in which the monthly cost lies, depending on how many hours the cluster remains unused. For a given number of times launched (i.e. for a given vertical line), the "bottom" point of each band is the cost when the cluster is unused zero hours (i.e. always on), and the "top" point is the cost when the cluster is unused for 500 hours (about 20 days).

The dominant factors are, first, the number of instances in the cluster and, second, the number of times the cluster will be launched. A cluster of 100 instances costs $10 each time it is launched beyond the first 100, (plus $1 for each hour unused). For large cluster sizes, the more times you launch, the higher the cost of using Elastic IPs will be and the more attractive the run-your-own dynamic DNS option becomes.

Thursday, September 24, 2009

Cool Things You Can Do with Shared EBS Snapshots

I've been awaiting this feature for a long time: Shared EBS Snapshots. Here's a brief intro to using the feature, and some cool things you can do with shared snapshots. I also offer predictions about things that will appear as this feature gains adoption among developers.

How to Share an EBS Snapshot

Really, it's easy. The first thing you'll need to know is the Account Number of the user with whom you want to share the snapshot. If you want to make the snapshot public then you don't need this. The account number can be found in the Your Account > Account Activity page. It's in small numbers in the top-right of the page (so small you may need to click on the image below to see it in full size):


The person with whom you want to share the snapshot (you are the sharer, they are the "sharee"?) should tell you this 12-digit number. Don't worry, sharee, it's not a secret.

Once you have the sharee's account number you, the sharer, go into the AWS Management Console and choose the Snapshots item. Find the snapshot you want to share and right-click on it, choosing "Snapshot Permissions". You'll get the following dialog:



Fill in the sharee's account number, without the separating dashes, into the dialog, and hit "Save". It should only take a few seconds and... presto! The snapshot should be visible in the sharee's AWS Management Console Snapshots page.

Cool Things You Can Do with Shared Snapshots

Update 27 September 2009: Before you share snapshots publicly, read Eric Hammond's warning about the dangers of doing so.

Easily move data between development, testing, and production

You've been keeping separate AWS accounts for your production environment, your testing environment, and your development environment, right? Right? Well, in case you haven't, you no longer have any excuse not to do so. You can now share your database, your HDFS volumes (if you use Cloudera's Hadoop distribution with EBS support), and anything else of significant size between these separate accounts. No more "tar, gzip, split into < 5GB chunks, upload to S3" and "download from S3, concatenate, untar-gzip". Your data is ready to go with the newly-created volume.

Share entire setups for troubleshooting and support


If you support a product that is deployed in EC2 you no longer need to jump through hoops to get access to your customer's files when there's a problem. Simply have them put the relevant files into an EBS volume, snapshot it, and share the snapshot with you.

Deliver your application in a more granular manner

Until today you delivered your application as an AMI - perhaps even a DevPay AMI - and you may not have given your customers root access. But, if your application used less than 100% of an instance's CPU, the customer was stuck paying for an entire CPU. Now, you can distribute your applications as a shared snapshot instead, and your customers will be free to use the rest of the instance's CPU. You'll just need to build a way to manage access, only allowing authorized customers to see the snapshot.

Deliver you customer's results in a more usable format

If you run a service that provides large amounts of data, you no longer need to use S3 to share the results. Until today you had to store the results in S3, and your customer needed to retrieve the results from S3 in order to use them. No longer: now you can provide a shared snapshot of the results, and the customer can access them via their filesystem more simply. "The shared snapshot is the new bucket."

Mount a volume created from a shared snapshot at startup

In a previous article I explained how to automatically mount an EBS volume created from a snapshot during the instance's startup sequence. I provided a script that gets the snapshot ID via the user-data and does all the rest automatically. Now you can also use snapshots that have been shared.

Update 25 September 2009: Share entire machines

Reader Robert Staveley (Tom) comments below about his use for shared snapshots: Sharing entire machines - boot code and everything - between development, testing, and production accounts. Using the technique to boot an instance from an EBS volume he points out that the entire bootable hard drive and all applications (even beyond 10GB) can be shared between these accounts.

Things to Expect in the Future

Shared snapshots are still a very new feature, but here are some things I expect to happen now that this is possible.
  • The AWS Management Console is the only UI that allows you to share a snapshot. ElasticFox will be adding this capability Real Soon Now, and I am sure others will as well.
  • Alternatives to AMIs. AMIs have many limitations, such as the 10GB maximum size, that can be circumvented using a technique I described to boot from an EBS volume. I expect to see OS distributions packaged as a shared EBS snapshot. These distributions could all share a common AMI containing just enough code to create a volume from the shared distribution snapshot, mount it, and boot from it. No more headaches bundling an AMI - just share a new bootable EBS snapshot.
    Update 3 December 2009: This prediction has come true, with AWS's release of EBS-backed AMIs.
  • Payment gateway services for managing access to shared snapshots. Now that you're distributing software as a shared snapshot you'll need to manage access to the snapshot, limiting it to authorized customers. You might build that system yourself today, but soon we'll see third-party services that do this for you.
Do you have other cool uses or predictions for shared snapshots? Please comment!

Monday, September 7, 2009

Solving Common ELB Problems with a Sanity Test

Help! My ELB isn't serving files!
Whoa! My back-end instances work but not the ELB!
Hey! I can't get the ELB to work!


These are among the most common Elastic Load Balancer problems raised on the Amazon EC2 Discussion Forums. Inspired by Eric Hammond's indispensible article Solving "I can't connect to my server on Amazon EC2", here is a helpful guide to debugging these common ELB issues, as well as a utility to perform sanity tests on your own ELBs.

Questions to Answer

You're trying to figure out what's wrong and you need to know where to start looking. Or, you're posting your problem on the AWS forums and you want help as quickly as possible. The best way to help yourself or to get help quickly is to examine the basic facts of your situation. Here are some questions to answer for yourself and in your forum post:
  1. What is the output of elb-describe-lbs elbName --show-xml ? This gives the basic details of the ELB, which are critical to diagnosing any problem. If you are posting to the forums and want to keep the DNS name of the ELB private then obscure it in the output. One reason to obscure the DNS name is to prevent readers from accessing your ELB-based service. However, this precaution does not add any security because the DNS information is public, and - presumably - you are using a DNS CNAME entry to integrate the ELB into your domain's DNS.
  2. What is the output of elb-describe-instance-health elbName ? This provides crucial information about the health of the instances.
  3. What resource are you trying to access via the ELB and what tool are you using to access it and from what location? The resource will likely be a URL of the form http://ELB-DNS-Name/index.html or maybe https://ELB-DNS-Name/index.html, or it might be "I'm running a POP server on port 1234". The tool you're using to access it is most likely a browser or HTTP client (Firefox, or wget), or possibly "Microsoft Outlook version 5.4". The location is either "my local machine" or "an EC2 instance". Also, can you access the same resource when you connect directly to a back-end instance via its public IP address or host name from a client outside EC2? A public-facing URL pointing directly to a back-end instance looks like this: http://ec2-123-213-123-31.compute-1.amazonaws.com/index.html . And, can you access the same resource when you connect directly to a back-end instance via its private IP address or host name from another instance within EC2? Such a URL looks like this: http://domU-12-31-34-00-69-B9.compute-1.internal/index.html .
  4. Can you access the health check resource directly via the ELB DNS name, and via the back-end instance's public IP address, and via the back-end instance's private IP address? If your health check is configured with target=HTTP:8080/check.html then try to access http://ELB-DNS-Name:8080/check.html (which is via the ELB) and http://ec2-123-213-123-31.compute-1.amazonaws.com:8080/check.html (which is via the instance's public IP address) and http://domU-12-31-34-00-69-B9.compute-1.internal:8080/check.html (which is via the instance's private IP address, and only accessible from within EC2).
  5. What are the security groups and availability zones for each instance in the ELB? This is visible in the output of ec2-describe-instances i-11111111 i-22222222 ... As above, you might want to obscure the public and private DNS names of these instances in the output.
  6. Can all the back-end instances receive traffic on the instance ports of the ELB listeners and the health check? This can be checked from the output of ec2-describe-group groupName1 groupName2 ... for all the groups shown in question 5's ec2-describe-instances command.
  7. Do logs on your back-end instances show any connections from ELB?

Common ELB Problems

Okay, now that you know what information is important to diagnosing the problem, here is a look at some of the common gotchas, how to detect them, and how to fix them. These descriptions refer to the above questions by number.

Common problems and solutions include:
  • Security groups on back-end instances don't allow access to the instance ports and health check port. Back-end instances must have all ports on which they receive traffic from the ELB (#1) open to CIDR 0.0.0.0/0 in one of their associated security groups (#6). Fix this by changing the permissions on the security groups associated with the instances. Note: this fix takes effect within a few seconds and does not require launching new instance or rebooting existing instances.
  • Back-end instances are not healthy (InService). When an instance fails the health check (#1) it is marked as OutOfService (#2) and the ELB does not route traffic to it anymore. To fix this you need to determine why the ELB cannot access the health check resource. Note: there is currently a bug in ELB where instances initially are marked as InService when added to the ELB, until they fail the health check. So you'll want to make sure you've given ELB enough time to detect a failed health check.
  • An availability zone is enabled on the ELB but has no healthy back-end instances. If you have an availability zone enabled for your ELB (#1) but no healthy instances in that availability zone (#5 and #2), you'll get 503 Gateway Timeout or other errors. Fix this by adding an instance in that availability zone to the ELB or disabling that availability zone for the ELB.
  • You cannot see a requested resource (#3) or the health check URL (#4) using the ELB DNS name. In this case, check that the URL exists on the back-end instances and look at the back-end instance's logs (#7) to see if the ELB forwarded your connection or not. If you can see the requested resource using the public address of a back-end instance then check the instance's security groups (#6) to see that they grant access to the instance's port.
  • The health check port is not the same as listener target port (#1). While this does not necessarily indicate a problem, for most ELBs the health check should use the same port as one of the listeners. Setting up your ELB to have a health check performed on a different port than the load-balanced traffic is perfectly valid, but you likely want the health check to use the same path that the load-balanced traffic takes to reach your app (and also to exercise a representative set of features used by your app).
I will update this article with new common issues as they appear.

An ELB Sanity Test Utility

If you have your thinking cap on you'll notice that detecting the first three of the common ELB problems can be automated. Here is an ELB sanity test utility for linux which automates these tests. Save it or download it as follows:

curl -o elb-sanity-test.tar.gz -L https://sites.google.com/site/shlomosfiles/clouddevelopertips/elb-sanity-test.tar.gz?attredirects=0

Next, unpack it:

tar xzf elb-sanity-test.tar.gz
cd elb-sanity-test


Next, set up the utility with your credentials. Edit the elb-sanity-test script file, setting AWS_CREDENTIAL_FILE to point to a file containing your AWS credentials in the following format:

AWSAccessKeyId=your-Access-Key-Id
AWSSecretKey=your-Secret-Key


The above is the same format that can be used to specify your AWS credentials for the ELB API Tools (see the README.TXT and credential-file-path.template file in the ELB API Tools bundle).

To run the ELB sanity test:

cd elb-sanity-test
./elb-sanity-test


Here is sample output showing an ELB that passes the sanity test:
$ ./elb-sanity-test
JUnit version 4.5
Test: all instances have their Security Groups defined to allow access to the ELB listener port
Load Balancer: someLB
ELB someLB has a listener that uses instance-port 8080 and instance i-360ef05e has that TCP port open to the world.
ELB someLB has a listener that uses instance-port 8081 and instance i-360ef05e has that TCP port open to the world.
Test: all ELBs have a HealthCheck on a port that the listener directs traffic to
Load Balancer: someLB
ELB someLB has a configured HealthCheck on listener port 8080
Test: all ELBs have InService instances in each configured availability zone
Load Balancer: somLB
ELB someLB has InService instances in each configured availability zone
Time: 5.22
Tests run: 3, Failures: 0
The elb-sanity-test utility performs the following sanity tests on every ELB defined in your account:
  • All instances have their security groups defined to allow access to the ELB listener port.
  • All ELBs have a health check on a port that the listener directs traffic to.
  • All ELBs have healthy instances in each configured availability zone.
If a sanity test fails the utility shows a very verbose error message explaining what is wrong.

Some notes about the elb-sanity-test bundle:
  • The utility is written in Java, which is also required for the ELB tools. If you can run the ELB API Tools, you already have all the prerequisites to run this sanity test.
  • The bundle includes source code and is licensed under the Apache License, Version 2.0.
  • The bundle includes all dependency jars necessary to run the script. It uses the JUnit framework and the Typica library.
I would be happy to re-bundle the utility to include a .bat or .cmd file to make it easy to run the script on Windows. If you write one, please add it it in the comments and I'll include it.

Getting Further Help

If you still have an ELB issue after trying the above advice and the elb-sanity-test utility, please post in the AWS EC2 forum. Questions about the elb-sanity-test utility specifically or about this article are welcome in the comments below.

Update 15 September 2009: Ylastic integrated my elb-sanity-test script into their EC2 management dashboard.

Update 11 October 2009: elb-sanity-test has been released as part of the open-source ec2-elb-tests project hosted on Google Code. And, if you use this utility, please subscribe to the ec2-elb-tests Google Group.

Monday, August 31, 2009

How to Keep Your AWS Credentials on an EC2 Instance Securely

If you've been using EC2 for anything serious then you have some code on your instances that requires your AWS credentials. I'm talking about code that does things like this:
  • Attach an EBS volume (requires your X.509 certificate and private key)
  • Download your application from a non-public location in S3 (requires your secret access key)
  • Send and receive SQS messages (requires your secret access key)
  • Query or update SimpleDB (requires your secret access key)
How do you get the credentials onto the instance in the first place? How can you store them securely once they're there? First let's examine the issues involved in securing your keys, and then we'll explore the available options for doing so.

Potential Vulnerabilities in Transferring and Storing Your Credentials

There are a number of vulnerabilities that should be considered when trying to protect a secret. I'm going to ignore the ones that result from obviously foolish practice, such as transferring secrets unencrypted.
  1. Root: root can get at any file on an instance and can see into any process's memory. If an attacker gains root access to your instance, and your instance can somehow know the secret, your secret is as good as compromised.
  2. Privilege escalation: User accounts can exploit vulnerabilities in installed applications or in the kernel (whose latest privilege escalation vulnerability was patched in new Amazon Kernel Images on 28 August 2009) to gain root access.
  3. User-data: Any user account able to open a socket on an EC2 instance can see the user-data by getting the URL http://169.254.169.254/latest/user-data . This is exploitable if a web application running in EC2 does not validate input before visiting a user-supplied URL. Accessing the user-data URL is particularly problematic if you use the user-data to pass in the secret unencrypted into the instance - one quick wget (or curl) command by any user and your secret is compromised. And, there is no way to clear the user-data - once it is set at launch time, it is visible for the entire life of the instance.
  4. Repeatability: HTTPS URLs transport their content securely, but anyone who has the URL can get the content. In other words, there is no authentication on HTTPS URLs. If you specify an HTTPS URL pointing to your secret it is safe in transit but not safe from anyone who discovers the URL.
Benefits Offered by Transfer and Storage Methods

Each transfer and storage method offers a different set of benefits. Here are the benefits against which I evaluate the various methods presented below:
  1. Easy to do. It's easy to create a file in an AMI, or in S3. It's slightly more complicated to encrypt it. But, you should have a script to automate the provision of new credentials, so all of the methods are graded as "easy to do".
  2. Possible to change (now). Once an instance has launched, can the credentials it uses be changed?
  3. Possible to change (future). Is it possible to change the credentials that will be used by instances launched in the future? All methods provide this benefit but some make it more difficult to achieve than others, for example instances launched via Auto Scaling may require the Launch Configuration to be updated.

How to Put AWS Credentials on an EC2 Instance


With the above vulnerabilities and benefits in mind let's look at different ways of getting your credentials onto the instance and the consequences of each approach.

Mitch Garnaat has a great set of articles about the AWS credentials. Part 1 explores what each credential is used for, and part 2 presents some methods of getting them onto an instance, the risks involved in leaving them there, and a strategy to mitigate the risk of them being compromised. A summary of part 1: keep all your credentials secret, like you keep your bank account info secret, because they are - literally - the keys to your AWS kingdom.

As discussed in part 2 of Mitch's article, there are a number of methods to get the credentials (or indeed, any secret) onto an instance:

1. Burn the secret into the AMI

Pros:
  • Easy to do.
Cons:
  • Not possible to change (now) easily. Requires SSHing into the instance, updating the secret, and forcing all applications to re-read it.
  • Not possible to change (future) easily. Requires bundling a new AMI.
  • The secret can be mistakenly bundled into the image when making derived AMIs.
Vulnerabilities:
  • root, privilege escalation.
2. Pass the secret in the user-data

Pros:
  • Easy to do. Putting the secret into the user-data must be integrated into the launch procedure.
  • Possible to change (future). Simply launch new instances with updated user-data. With Auto Scaling, create a new Launch Configuration with the updated user-data.
Cons:
  • Not possible to change (now). User-data cannot be changed once an instance is launched.
Vulnerabilities:
  • user-data, root, privilege escalation.

Here are some additional methods to transfer a secret to an instance, not mentioned in the article:

3. Put the secret in a public URL
The URL can be on a website you control or in S3. It's insecure and foolish to keep secrets in a publicly accessible URL. Please don't do this, I had to mention it just to be comprehensive.

Pros:
  • Easy to do.
  • Possible to change (now). Simply update the content at that URL. Any processes on the instance that read the secret each time will see the new value once it is updated.
  • Possible to change (future).
Cons:
  • Completely insecure. Any attacker between the endpoint and the EC2 boundary can see the packets and discover the URL, revealing the secret.
Vulnerabilities:
  • repeatability, root, privilege escalation.
4. Put the secret in a private S3 object and provide the object's path
To get content from a private S3 object you need the secret access key in order to authenticate with S3. The question then becomes "how to put the secret access key on the instance", which you need to do via one of the other methods.

Pros:
  • Easy to do.
  • Possible to change (now). Simply update the content at that URL. Any processes on the instance that read the secret each time will see the new value once it is updated.
  • Possible to change (future).
Cons:
  • Inherits the cons of the method used to transfer the secret access key.
Vulnerabilities:
  • root, privilege escalation.

5. Put the secret in a private S3 object and provide a signed HTTPS S3 URL
The signed URL must be created before launching the instance and specified somewhere that the instance can access - typically in the user-data. The signed URL expires after some time, limiting the window of opportunity for an attacker to access the URL. The URL should be HTTPS so that the secret cannot be sniffed in transit.

Pros:
  • Easy to do. The S3 URL signing must be integrated into the launch procedure.
  • Possible to change (now). Simply update the content at that URL. Any processes on the instance that read the secret each time will see the new value once it is updated.
  • Possible to change (future). In order to integrate with Auto Scaling you would need to (automatically) update the Auto Scaling Group's Launch Configuration to provide an updated signed URL for the user-data before the previously specified signed URL expires.
Cons:
  • The secret must be cached on the instance. Once the signed URL expires the secret cannot be fetched from S3 anymore, so it must be stored on the instance somewhere. This may make the secret liable to be burned into derived AMIs.
Vulnerabilities:
  • repeatability (until the signed URL expires), root, privilege escalation.
6. Put the secret on the instance from an outside source, via SCP or SSH
This method involves an outside client - perhaps your local computer, or a management node - whose job it is to put the secret onto the newly-launched instance. The management node must have the private key with which the instance was launched, and must know the secret in order to transfer it. This approach can also be automated, by having a process on the management node poll every minute or so for newly-launched instances.

Pros:
  • Easy to do. OK, not "easy" because it requires an outside management node, but it's doable.
  • Possible to change (now). Have the management node put the updated secret onto the instance.
  • Possible to change (future). Simply put a new secret onto the management node.
Cons:
  • The secret must be cached somewhere on the instance because it cannot be "pulled" from the management node when needed. This may make the secret liable to be burned into derived AMIs.
Vulnerabilities:
  • root, privilege escalation.

The above methods can be used to transfer the credentials - or any secret - to an EC2 instance.

Instead of transferring the secret directly, you can transfer an encrypted secret. In that case, you'd need to provide a decryption key also - and you'd use one of the above methods to do that. The overall security of the secret would be influenced by the combination of methods used to transfer the encrypted secret and the decryption key. For example, if you encrypt the secret and pass it in the user-data, providing the decryption key in a file burned into the AMI, the secret is vulnerable to anyone with access to both user-data and the file containing the decryption key. Also, if you encrypt your credentials then changing the encryption key requires changing two items: the encryption key and the encrypted credentials. Therefore, changing the encryption key can only be as easy as changing the credentials themselves.

How to Keep AWS Credentials on an EC2 Instance

Once your credentials are on the instance, how do you keep them there securely?

First off, let's remember that in an environment out of your control, such as EC2, you have no guarantees of security. Anything processed by the CPU or put into memory is vulnerable to bugs in the hypervisor (the virtualization provider) or to malicious AWS personnel (though the AWS Security White Paper goes to great lengths to explain the internal procedures and controls they have implemented to mitigate that possibility) or to legal search and seizure. What this means is that you should only run applications in EC2 for which the risk of secrets being exposed via these vulnerabilities is acceptable. This is true of all applications and data that you allow to leave your premises. But this article is about the security of the AWS credentials, which control the access to your AWS resources. It is perfectly acceptable to ignore the risk of abuse by AWS personnel exposing your credentials because AWS folks can manipulate your account resources without needing your credentials! In short, if you are willing to use AWS then you trust Amazon with your credentials.

There are three ways to store information on a running machine: on disk, in memory, and not at all.

1. Keeping a secret on disk
The secret is stored in a file on disk, with the appropriate permissions set on the file. The secret survives a reboot intact, which can be a pro or a con: it's a good thing if you want the instance to be able to remain in service through a reboot; it's a bad thing if you're trying to hide the location of the secret from an attacker, because the reboot process contains the script to retrieve and cache the secret, revealing its cached location. You can work around this by altering the script that retrieves the secret, after it does its work, to remove traces of the secret's location. But applications will still need to access the secret somehow, so it remains vulnerable.

Pros:
  • Easily accessible by applications on the instance.
Cons:
  • Visible to any process with the proper permissions.
  • Easy to forget when bundling an AMI of the instance.
Vulnerabilities:
  • root, privilege escalation.
2. Keeping the secret in memory
The secret is stored as a file on a ramdisk. (There are other memory-based methods, too.) The main difference between storing the secret in memory and on the filesystem is that memory does not survive a reboot. If you remove the traces of retrieving the secret and storing it from the startup scripts after they run during the first boot, the secret will only exist in memory. This can make it more difficult for an attacker to discover the secret, but it does not add any additional security.

Pros:
  • Easily accessible by applications on the instance.
Cons:
  • Visible to any process with the proper permissions.
Vulnerabilities:
  • root, privilege escalation.
3. Do not store the secret; retrieve it each time it is needed
This method requires your applications to support the chosen transfer method.

Pros:
  • Secret is never stored on the instance.
Cons:
  • Requires more time because the secret must be fetched each time it is needed.
  • Cannot be used with signed S3 URLs. These URLs expire after some time and the secret will no longer be accessible. If the URL does not expire in a reasonable amount of time then it is as insecure as a public URL.
  • Cannot be used with externally-transferred (via SSH or SCP) secrets because the secret cannot be pulled from the management node. Any protocol that tries to pull the secret from the management node can be also be used by an attacker to request the secret.
Vulnerabilities:
  • root, privilege escalation.
Choosing a Method to Transfer and Store Your Credentials

The above two sections explore some options for transferring and storing a secret on an EC2 instance. If the secret is guarded by another key - such as an encryption key or an S3 secret access key - then this key must also be kept secret and transferred and stored using one of those same methods. Let's put all this together into some tables presenting the viable options.

Unencrypted Credentials

Here is a summary table evaluating the transfer and storage of unencrypted credentials using different combinations of methods:

Transferring and Keeping Unencrypted Credentials

Some notes on the above table:
  • Methods making it "hard" to change credentials are highlighted in yellow because, through scripting, the difficulty can be minimized. Similarly, the risk of forgetting credentials in an AMI can be minimized by scripting the AMI creation process and choosing a location for the credential file that is excluded from the AMI by the script.
  • While you can transfer credentials using a private S3 URL, you still need to provide the secret access key in order to access that private S3 URL. This secret access key must also be transferred and stored on the instance, so the private S3 URL is not by itself usable. See below for an analysis of using a private S3 URL to transfer credentials. Therefore the Private S3 URL entries are marked as N/A.
  • You can burn credentials into an AMI and store them in memory. The startup process can remove them from the filesystem and place them in memory. The startup process should then remove all traces from the startup scripts mentioning the key's location in memory, in order to make discovery more difficult for an attacker with access to the startup scripts.
  • Credentials burned into the AMI cannot be "not stored". They can be erased from the filesystem, but must be stored somewhere in order to be usable by applications. Therefore these entries are marked as N/A.
  • Credentials transferred via a signed S3 URL cannot be "not stored" because the URL expires and, once that happens, is no longer able to provide the credentials. Thus, these entries are marked N/A.
  • Credentials "pushed" onto the instance from an outside source, such as SSH, cannot be "not stored" because they must be accessible to applications on the instance. These entries are marked N/A.
A glance at the above table shows that it is, overall, not difficult to manage unencrypted credentials via any of the methods. Remember: don't use the Public URL method, it's completely unsecure.

Bottom line: If you don't care about keeping your credentials encrypted then pass a signed S3 HTTPS URL in the user-data. The startup scripts of the instance should retrieve the credentials from this URL and store them in a file with appropriate permissions (or in a ramdisk if you don't want them to remain through a reboot), then the startup scripts should remove their own commands for getting and storing the credentials. Applications should read the credentials from the file (or directly from the signed URL is you don't care that it will stop working after it expires).

Encrypted Credentials

We discussed 6 different ways of transferring credentials and 3 different ways of storing them. A transfer method and a storage method must be used for the encrypted credentials and for the decryption key. That gives us 36 combinations of transfer methods, and 9 combinations of storage methods, for a grand total of 324 choices.

Here are the first 54, summarizing the options when you choose to burn the encrypted credentials into the AMI:



As (I hope!) you can see, all combinations that involve burning encrypted credentials into the AMI make it hard (or impossible) to change the credentials or the encryption key, both on running instances and for future ones.

Here are the next set, summarizing the options when you choose to pass encrypted credentials via the user-data:



Passing encrypted credentials in the user-data requires the decryption key to be transferred also. It's pointless from a security perspective to pass the decryption key together with the encrypted credentials in the user-data. The most flexible option in the above table is to pass the decryption key via a signed S3 HTTPS URL (specified in the user-data, or specified at a public URL burned into the AMI) with a relatively short expiry time (say, 4 minutes) allowing enough time for the instance to boot and retrieve it.

Here is a summary of the combinations when the encrypted credentials are passed via a public URL:



It might be surprising, but passing encrypted credentials via a public URL is actually a viable option. You just need to make sure you send and store the decryption key securely, so send that key via a signed S3 HTTPS URL (specified in the user-data on specified at a public URL burned into the AMI) for maximum flexibility.

The combinations with passing the encrypted credentials via a private S3 URL are summarized in this table:



As explained earlier, the private S3 URL is not usable by itself because it requires the AWS secret access key. (The access key id is not a secret). The secret access key can be transferred and stored using the combinations of methods shown in the above table.

The most flexible of the options shown in the above table is to pass in the secret access key inside a signed S3 HTTPS URL (which is itself provided in the user-data or at a public URL burned into the AMI).

Almost there.... This next table summarizes the combinations with encrypted credentials passed via a signed S3 HTTPS URL:



The signed S3 HTTPS URL containing the encrypted credentials can be specified in the user-data or specified behind a public URL which is burned into the AMI. The best options for providing the decryption key are via another signed URL or from an external management node via SSH or SCP.

And, the final section of the table summarizing the combinations of using encrypted credentials passed in via SSH or SCP from an outside management node:



The above table summarizing the use of an external management node to place encrypted credentials on the instance shows the same exact results as the previous table (for a signed S3 HTTPS URL). The same flexibility is achieved using either method.

The Bottom Line

Here's a practical recommendation: if you have code that generates signed S3 HTTPS URLs then pass in two signed URLs into the user-data, one containing the encrypted credentials and the other containing the decryption key. The startup sequence of the AMI should read these two items from their URLs, decrypt the credentials, and store the credentials in a ramdisk file with the minimum permissions necessary to run the applications. The start scripts should then remove all traces of the procedure (beginning with "read the user-data URL" and ending with "remove all traces of the procedure") from themselves.

If you don't have code to generate signed S3 URLs then burn the encrypted credentials into the AMI and pass the decryption key via the user-data. As above, the startup sequence should decrypt the credentials, store them in a ramdisk, and destroy all traces of the raw ingredients and the process itself.

This article is an informal review of the benefits and vulnerabilities offered by different methods of transferring credentials to and storing credentials on an EC2 instance. In a future article I will present scripts to automate the procedures described. In the meantime, please leave your feedback in the comments.

Tuesday, August 18, 2009

Amazon S3 Gotcha: Using Virtual Host URLs with HTTPS

Amazon S3 is a great place to store static content for your web site. If the content is sensitive you'll want to prevent the content from being visible while in transit from the S3 servers to the client. The standard way to secure the content during transfer is by https - simply request the content via an https URL. However, this approach has a problem: it does not work for content in S3 buckets that are accessed via a virtual host URL. Here is an examination of the problem and a workaround.

Accessing S3 Buckets via Virtual Host URLs


S3 provides two ways to access your content. One way uses s3.amazonaws.com host name URLs, such as this:

http://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey

The other way to access your S3 content uses a virtual host name in the URL:

http://mybucket.mydomain.com.s3.amazonaws.com/myObjectKey

Both of these URLs map to the same object in S3.

You can make the virtual host name URL shorter by setting up a DNS CNAME that maps mybucket.mydomain.com to mybucket.mydomain.com.s3.amazonaws.com. With this DNS CNAME alias in place, the above URL can also be written as follows:

http://mybucket.mydomain.com/myObjectKey

This shorter virtual host name URL works only if you setup the DNS CNAME alias for the bucket.

Virtual host names in S3 is a convenient feature because it allows you to hide the actual location of the content from the end-user: you can provide the URL http://mybucket.mydomain.com/myObjectKey and then freely change the DNS entry for mybucket.mydomain.com (to point to an actual server, perhaps) without changing the application. With the CNAME alias pointing to mybucket.mydomain.com.s3.amazonaws.com, end-users do not know that the content is actually being served from S3. Without the DNS CNAME alias you'll need to explicitly use one of the URLs that contain s3.amazonaws.com in the host name.

The Problem with Accessing S3 via https URLs

https encrypts the transferred data and prevents it from being recovered by anyone other than the client and the server. Thus, it is the natural choice for applications where protecting the content in transit is important. However, https relies on internet host names for verifying the identity certificate of the server, and so it is very sensitive to the host name specified in the URL.

To illustrate this more clearly, consider the servers at s3.amazonaws.com. They all have a certificate issued to *.s3.amazonaws.com. ["Huh?" you say. Yes, the SSL certificate for a site specifies the host name that the certificate represents. Part of the handshaking that sets up the secure connection ensures that the host name of the certificate matches the host name in the request. The * indicates a wildcard certificate, and means that the certificate is valid for any subdomain also.] If you request the https URL https://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey, then the certificate's host name matches the requested URL's host name component, and the secure connection can be established.

If you request an object in a bucket without any periods in its name via a virtual host https URL, things also work fine. The requested URL can be https://aSimpleBucketName.s3.amazonaws.com/myObjectKey. This request will arrive at an S3 server (whose certificate was issued to *.s3.amazonaws.com), which will notice that the URL's host name is indeed a subdomain of s3.amazonaws.com, and the secure connection will succeed.

However, if you request the virtual host URL https://mybucket.mydomain.com.s3.amazonaws.com/myObjectKey, what happens? The host name component of the URL is mybucket.mydomain.com.s3.amazonaws.com, but the actual server that gets the request is an S3 server whose certificate was issued to *.s3.amazonaws.com. Is mybucket.mydomain.com.s3.amazonaws.com a subdomain of s3.amazonaws.com? It depends who you ask, but most up-to-date browsers and SSL implementations will say "no." A multi-level subdomain - that is, a subdomain that has more than one period in it - is not considered to be a proper subdomain by recent Firefox, Internet Explorer, Java, and wget clients. So the client will report that the server's SSL certificate, issued to *.s3.amazonaws.com, does not match the host name of the request, mybucket.mydomain.com.s3.amazonaws.com, and refuse to establish a secure connection.

The same problem occurs when you request the virtual host https URL https://mybucket.mydomain.com/myObjectKey. The request arrives - after the client discovers that mybucket.mydomain.com is a DNS CNAME alias for mybucket.mydomain.com.s3.amazonaws.com - at an S3 server with an SSL certificate issued to *.s3.amazonaws.com. In this case the host name mybucket.mydomain.com clearly does not match the host name on the certificate, so the secure connection again fails.

Here is what a failed certificate check looks like in Firefox 3.5, when requesting https://images.mydrifts.com.s3.amazonaws.com/someContent.txt:


Here is what happens in Java:
javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching media.mydrifts.com.s3.amazonaws.com found.
at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1591)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:187)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:181)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:975)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:123)
at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:516)
at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:454)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:884)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1096)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1123)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1107)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:405)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:166)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching media.mydrifts.com.s3.amazonaws.com found.
at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:193)
at sun.security.util.HostnameChecker.match(HostnameChecker.java:77)
at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:264)
at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:250)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:954)

And here is what happens in wget:
$ wget -nv https://media.mydrifts.com.s3.amazonaws.com/someContent.txt
ERROR: Certificate verification error for media.mydrifts.com.s3.amazonaws.com: unable to get local issuer certificate
ERROR: certificate common name `*.s3.amazonaws.com' doesn't match requested host name `media.mydrifts.com.s3.amazonaws.com'.
To connect to media.mydrifts.com.s3.amazonaws.com insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.

Requesting the https URL using the DNS CNAME images.mydrifts.com results in the same errors, with the messages saying that the certificate *.s3.amazonaws.com does not match the requested host name images.mydrifts.com.

Notice that the browsers and wget clients offer a way to circumvent the mis-matched SSL certificate. You could, theoretically, ask your users to add an exception to the browser's security settings. However, most web users are scared off by a "This Connection is Untrusted" message, and will turn away when confronted with that screen.

How to Access S3 via https URLs

As pointed out above, there are two forms of S3 URLs that work with https:

https://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey

and this:

https://simplebucketname.s3.amazonaws.com/myObjectKey

So, in order to get https to work seamlessly with your S3 buckets, you need to either:
  • choose a bucket whose name contains no periods and use the virtual host URL, such as https://simplebucketname.s3.amazonaws.com/myObjectKey or
  • use the URL form that specifies the bucket name separately, after the host name, like this: https://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey.
Update 25 Aug 2009: For buckets created via the CreateBucketConfiguration API call, the only option is to use the virtual host URL. This is documented in the S3 docs here.