Sunday, September 27, 2009

Alternatives to Elastic IPs for EC2 Name Resolution

How can you handle DNS lookups in EC2 without going crazy each time a resource's IP address changes? One solution is to use an Elastic IP, a stable IP address that can be remapped to different instances, but Elastic IPs are not appropriate for all situations. This article explores the various methods of managing name resolution with EC2 instances.

Features of Different Name Resolution Methods

Before diving into the methods themselves let's take a look at the factors to consider when evaluating methods of managing name resolution. Here are the factors:
  • Updatable in code. You will want to write code to make changes to the name resolution settings automatically, in response to infrastructure events (e.g. launching a new server).
  • Propagation delay. It can take some time for changes to name resolution settings to propagate (especially with DNS). A solution should offer some degree of assurance that changes will propagate within a known and reasonable period of time. [Note that some clients (e.g. the IE browser or the Java rutime) by default ignore the DNS TTL, artificially increasing the propagation delay for DNS-based methods.]
  • Compatible with DNS. If your service will be accessed by a web browser or other client that you do not control, your name resolution method will need to be compatible with DNS. Otherwise clients will not be able to resolve your hostnames properly.
  • Ease of implementation. Some solutions, while technically sufficient, are difficult to implement.
  • Public / Private IP addresses. Whether the solution can serve public and/or private IP addresses. If your clients are inside the same EC2 region then you want their lookups to resolve to the private IP address. Clients outside the same EC2 region should be served the public IP address.
  • Supply. Is there any practical limitation on the number of name resolution entries?
  • Cost. How much it costs to implement, including costs for idle resources and updating settings.
Methods of Name Resolution

As mentioned above, there are a number of different methods to manage name resolution. These are:
  • Traditional DNS.
  • Dynamically update the /etc/hosts file on the various application hosts. The /etc/hosts file on linux (like the C:\Windows\System32\Drivers\etc\hosts file on Windows) contains host-name-to-IP-address mappings that are checked before DNS is consulted, allowing it to override DNS. The file can be updated via pull (initiated by the host) or push (initiated by an external agent).
  • Store the mappings in S3 or SimpleDB. Clients must use the S3 or SimpleDB APIs for name resolution.
  • Use a dynamic DNS provider.
  • Run your own traditional DNS servers for your domain. Clients must be able to see these DNS servers.
  • Run your own dynamic DNS servers for your domain. Clients must be able to see these DNS servers.
  • Elastic IPs. The AWS pricing model discourages (though not strongly enough, I believe) Elastic IPs from being left unused, so you should use them for instances hosting services that are always on, such as your web server or your Facebook application. You should set up a DNS entry pointing the host names to the Elastic IPs, and then any remapping of the Elastic IP to a different instance happens via the EC2 API without requiring any change to DNS.
Here is a table (click on it to see it full size) showing how each of these name resolution methods stack up against each other:



Notes:
  • Dynamically updating /etc/hosts can be used to store either the public IP or the private IP but not both for the same client. You can use one /etc/hosts file for your clients inside the same EC2 region which contains the private IPs, and a different but corresponding /etc/hosts file for your clients outside the EC2 region (or outside EC2 completely) which contains the public IPs. The propagation delay is governed by the frequency with which you update the /etc/hosts file on each client. You can minimize this delay by increasing the frequency of updates. This technique is described in detail in an article by Tim Dysinger.
  • Similarly, the two "run your own DNS" methods (Your Own DNS for your Domain, Your Own Dynamic DNS for your Domain) can be used to resolve to either the public IP address or the private IP address, but not both for the same client. You should set up your clients inside EC2 to utilize the DNS service inside EC2, and the domain should be configured to point to the DNS service running outside EC2 so that clients outside EC2 will see the public IPs. Note that clients running inside EC2 whose DNS resolution you do not control (for example, another EC2 user's client) will be referred to the public IPs. Jeff Roberts offers some great practical suggestions for running your own DNS inside EC2.
This table demonstrates the following:
  • Elastic IPs are the best choice when you need only a limited number of resolvable names and you will use them constantly. If you use their corresponding DNS name then they intelligently resolve to the public IP when looked up from the internet and to the private IP when looked up from within EC2.
  • If you need an unlimited number of resolvable names within EC2 then you should run your own dynamic DNS within EC2.
  • Methods that are incompatible with DNS should only be used with clients you control.
As we can see, Dynamic DNS (especially running your own) has one distinct advantage over using Elastic IPs: unlimited supply at no cost when unused.

When Running Your Own Dynamic DNS is Better than Elastic IPs

One application for running your own Dynamic DNS is a testing environment that includes large clusters of EC2 instances, for example database cluster or application nodes, connected to web layer instance(s). These cluster instances will only be visible to the front-end web tier, so they do not need a publicly resolvable IP address. And your testing environment is not likely to be running all the time. Elastic IPs would work here (presuming you needed only 5 or you could convince AWS to increase your Elastic IP limit to meet your needs), but would cost money when unused. A more economical solution might be to use your own Dynamic DNS within EC2 for these instances. If you have spare capacity on an existing instance then you can put the Dynamic DNS service there - otherwise you will need another instance, making the cost less attractive. In any case you'll need the instance hosting the Dynamic DNS to have an Elastic IP to allow failover without affecting the clients. And you'll need a script to dynamically configure the /etc/resolv.conf on your EC2 clients to point to the private IP address of the Dynamic DNS instance by looking up its Elastic IP's DNS name.

Let's compare the monthly costs of using Elastic IPs with the costs of running your own dynamic DNS for a testing environment such as the above. The cost reflects the following ingredients and assumptions:
  • The number of hours over the month that allocated addresses (DNS entries) are not associated with a live instance, in total for all allocated addresses. If you have DNS entries / addresses and leave them unmapped for 10 hours each then you have 100 unmapped hours.
  • The number of changes to the DNS mappings made that month.
  • The fractional cost of running an instance just to serve the dynamic DNS. If you have spare capacity on an existing instance then this is the instance cost multiplied by the fraction of the capacity that the dynamic DNS service uses. If you need to spin up a dedicated instance for the dynamic DNS service then this is the entire cost of that instance.
  • Pricing for Elastic IPs: free when in use. 1 cent per hour unused. First 100 remaps per month free, 10 cents per remap afterward.
It should be obvious that using dynamic DNS for this testing environment will be economical when

FractionalDNSInstanceCost < NumUnmappedHours * 0.01 + MAX(NumMappingChanges - 100, 0) * 0.1

For simplicity's sake this can be rewritten in clearer terms:

FractionalDNSInstanceCost < NumInstances * ( NumHoursClusterUnused * 0.01 + MAX(NumTimesClusterIsLaunched - 100, 0) * 0.1)

Right about now I'm wishing Excel had better 3-D graphing capabilities. Here's something helpful to visualize this:



The chart shows the monthly cost of running clusters of different sizes according to how many times the cluster is launched. The color "bands" show the areas in which the monthly cost lies, depending on how many hours the cluster remains unused. For a given number of times launched (i.e. for a given vertical line), the "bottom" point of each band is the cost when the cluster is unused zero hours (i.e. always on), and the "top" point is the cost when the cluster is unused for 500 hours (about 20 days).

The dominant factors are, first, the number of instances in the cluster and, second, the number of times the cluster will be launched. A cluster of 100 instances costs $10 each time it is launched beyond the first 100, (plus $1 for each hour unused). For large cluster sizes, the more times you launch, the higher the cost of using Elastic IPs will be and the more attractive the run-your-own dynamic DNS option becomes.

Thursday, September 24, 2009

Cool Things You Can Do with Shared EBS Snapshots

I've been awaiting this feature for a long time: Shared EBS Snapshots. Here's a brief intro to using the feature, and some cool things you can do with shared snapshots. I also offer predictions about things that will appear as this feature gains adoption among developers.

How to Share an EBS Snapshot

Really, it's easy. The first thing you'll need to know is the Account Number of the user with whom you want to share the snapshot. If you want to make the snapshot public then you don't need this. The account number can be found in the Your Account > Account Activity page. It's in small numbers in the top-right of the page (so small you may need to click on the image below to see it in full size):


The person with whom you want to share the snapshot (you are the sharer, they are the "sharee"?) should tell you this 12-digit number. Don't worry, sharee, it's not a secret.

Once you have the sharee's account number you, the sharer, go into the AWS Management Console and choose the Snapshots item. Find the snapshot you want to share and right-click on it, choosing "Snapshot Permissions". You'll get the following dialog:



Fill in the sharee's account number, without the separating dashes, into the dialog, and hit "Save". It should only take a few seconds and... presto! The snapshot should be visible in the sharee's AWS Management Console Snapshots page.

Cool Things You Can Do with Shared Snapshots

Update 27 September 2009: Before you share snapshots publicly, read Eric Hammond's warning about the dangers of doing so.

Easily move data between development, testing, and production

You've been keeping separate AWS accounts for your production environment, your testing environment, and your development environment, right? Right? Well, in case you haven't, you no longer have any excuse not to do so. You can now share your database, your HDFS volumes (if you use Cloudera's Hadoop distribution with EBS support), and anything else of significant size between these separate accounts. No more "tar, gzip, split into < 5GB chunks, upload to S3" and "download from S3, concatenate, untar-gzip". Your data is ready to go with the newly-created volume.

Share entire setups for troubleshooting and support


If you support a product that is deployed in EC2 you no longer need to jump through hoops to get access to your customer's files when there's a problem. Simply have them put the relevant files into an EBS volume, snapshot it, and share the snapshot with you.

Deliver your application in a more granular manner

Until today you delivered your application as an AMI - perhaps even a DevPay AMI - and you may not have given your customers root access. But, if your application used less than 100% of an instance's CPU, the customer was stuck paying for an entire CPU. Now, you can distribute your applications as a shared snapshot instead, and your customers will be free to use the rest of the instance's CPU. You'll just need to build a way to manage access, only allowing authorized customers to see the snapshot.

Deliver you customer's results in a more usable format

If you run a service that provides large amounts of data, you no longer need to use S3 to share the results. Until today you had to store the results in S3, and your customer needed to retrieve the results from S3 in order to use them. No longer: now you can provide a shared snapshot of the results, and the customer can access them via their filesystem more simply. "The shared snapshot is the new bucket."

Mount a volume created from a shared snapshot at startup

In a previous article I explained how to automatically mount an EBS volume created from a snapshot during the instance's startup sequence. I provided a script that gets the snapshot ID via the user-data and does all the rest automatically. Now you can also use snapshots that have been shared.

Update 25 September 2009: Share entire machines

Reader Robert Staveley (Tom) comments below about his use for shared snapshots: Sharing entire machines - boot code and everything - between development, testing, and production accounts. Using the technique to boot an instance from an EBS volume he points out that the entire bootable hard drive and all applications (even beyond 10GB) can be shared between these accounts.

Things to Expect in the Future

Shared snapshots are still a very new feature, but here are some things I expect to happen now that this is possible.
  • The AWS Management Console is the only UI that allows you to share a snapshot. ElasticFox will be adding this capability Real Soon Now, and I am sure others will as well.
  • Alternatives to AMIs. AMIs have many limitations, such as the 10GB maximum size, that can be circumvented using a technique I described to boot from an EBS volume. I expect to see OS distributions packaged as a shared EBS snapshot. These distributions could all share a common AMI containing just enough code to create a volume from the shared distribution snapshot, mount it, and boot from it. No more headaches bundling an AMI - just share a new bootable EBS snapshot.
    Update 3 December 2009: This prediction has come true, with AWS's release of EBS-backed AMIs.
  • Payment gateway services for managing access to shared snapshots. Now that you're distributing software as a shared snapshot you'll need to manage access to the snapshot, limiting it to authorized customers. You might build that system yourself today, but soon we'll see third-party services that do this for you.
Do you have other cool uses or predictions for shared snapshots? Please comment!

Monday, September 7, 2009

Solving Common ELB Problems with a Sanity Test

Help! My ELB isn't serving files!
Whoa! My back-end instances work but not the ELB!
Hey! I can't get the ELB to work!


These are among the most common Elastic Load Balancer problems raised on the Amazon EC2 Discussion Forums. Inspired by Eric Hammond's indispensible article Solving "I can't connect to my server on Amazon EC2", here is a helpful guide to debugging these common ELB issues, as well as a utility to perform sanity tests on your own ELBs.

Questions to Answer

You're trying to figure out what's wrong and you need to know where to start looking. Or, you're posting your problem on the AWS forums and you want help as quickly as possible. The best way to help yourself or to get help quickly is to examine the basic facts of your situation. Here are some questions to answer for yourself and in your forum post:
  1. What is the output of elb-describe-lbs elbName --show-xml ? This gives the basic details of the ELB, which are critical to diagnosing any problem. If you are posting to the forums and want to keep the DNS name of the ELB private then obscure it in the output. One reason to obscure the DNS name is to prevent readers from accessing your ELB-based service. However, this precaution does not add any security because the DNS information is public, and - presumably - you are using a DNS CNAME entry to integrate the ELB into your domain's DNS.
  2. What is the output of elb-describe-instance-health elbName ? This provides crucial information about the health of the instances.
  3. What resource are you trying to access via the ELB and what tool are you using to access it and from what location? The resource will likely be a URL of the form http://ELB-DNS-Name/index.html or maybe https://ELB-DNS-Name/index.html, or it might be "I'm running a POP server on port 1234". The tool you're using to access it is most likely a browser or HTTP client (Firefox, or wget), or possibly "Microsoft Outlook version 5.4". The location is either "my local machine" or "an EC2 instance". Also, can you access the same resource when you connect directly to a back-end instance via its public IP address or host name from a client outside EC2? A public-facing URL pointing directly to a back-end instance looks like this: http://ec2-123-213-123-31.compute-1.amazonaws.com/index.html . And, can you access the same resource when you connect directly to a back-end instance via its private IP address or host name from another instance within EC2? Such a URL looks like this: http://domU-12-31-34-00-69-B9.compute-1.internal/index.html .
  4. Can you access the health check resource directly via the ELB DNS name, and via the back-end instance's public IP address, and via the back-end instance's private IP address? If your health check is configured with target=HTTP:8080/check.html then try to access http://ELB-DNS-Name:8080/check.html (which is via the ELB) and http://ec2-123-213-123-31.compute-1.amazonaws.com:8080/check.html (which is via the instance's public IP address) and http://domU-12-31-34-00-69-B9.compute-1.internal:8080/check.html (which is via the instance's private IP address, and only accessible from within EC2).
  5. What are the security groups and availability zones for each instance in the ELB? This is visible in the output of ec2-describe-instances i-11111111 i-22222222 ... As above, you might want to obscure the public and private DNS names of these instances in the output.
  6. Can all the back-end instances receive traffic on the instance ports of the ELB listeners and the health check? This can be checked from the output of ec2-describe-group groupName1 groupName2 ... for all the groups shown in question 5's ec2-describe-instances command.
  7. Do logs on your back-end instances show any connections from ELB?

Common ELB Problems

Okay, now that you know what information is important to diagnosing the problem, here is a look at some of the common gotchas, how to detect them, and how to fix them. These descriptions refer to the above questions by number.

Common problems and solutions include:
  • Security groups on back-end instances don't allow access to the instance ports and health check port. Back-end instances must have all ports on which they receive traffic from the ELB (#1) open to CIDR 0.0.0.0/0 in one of their associated security groups (#6). Fix this by changing the permissions on the security groups associated with the instances. Note: this fix takes effect within a few seconds and does not require launching new instance or rebooting existing instances.
  • Back-end instances are not healthy (InService). When an instance fails the health check (#1) it is marked as OutOfService (#2) and the ELB does not route traffic to it anymore. To fix this you need to determine why the ELB cannot access the health check resource. Note: there is currently a bug in ELB where instances initially are marked as InService when added to the ELB, until they fail the health check. So you'll want to make sure you've given ELB enough time to detect a failed health check.
  • An availability zone is enabled on the ELB but has no healthy back-end instances. If you have an availability zone enabled for your ELB (#1) but no healthy instances in that availability zone (#5 and #2), you'll get 503 Gateway Timeout or other errors. Fix this by adding an instance in that availability zone to the ELB or disabling that availability zone for the ELB.
  • You cannot see a requested resource (#3) or the health check URL (#4) using the ELB DNS name. In this case, check that the URL exists on the back-end instances and look at the back-end instance's logs (#7) to see if the ELB forwarded your connection or not. If you can see the requested resource using the public address of a back-end instance then check the instance's security groups (#6) to see that they grant access to the instance's port.
  • The health check port is not the same as listener target port (#1). While this does not necessarily indicate a problem, for most ELBs the health check should use the same port as one of the listeners. Setting up your ELB to have a health check performed on a different port than the load-balanced traffic is perfectly valid, but you likely want the health check to use the same path that the load-balanced traffic takes to reach your app (and also to exercise a representative set of features used by your app).
I will update this article with new common issues as they appear.

An ELB Sanity Test Utility

If you have your thinking cap on you'll notice that detecting the first three of the common ELB problems can be automated. Here is an ELB sanity test utility for linux which automates these tests. Save it or download it as follows:

curl -o elb-sanity-test.tar.gz -L https://sites.google.com/site/shlomosfiles/clouddevelopertips/elb-sanity-test.tar.gz?attredirects=0

Next, unpack it:

tar xzf elb-sanity-test.tar.gz
cd elb-sanity-test


Next, set up the utility with your credentials. Edit the elb-sanity-test script file, setting AWS_CREDENTIAL_FILE to point to a file containing your AWS credentials in the following format:

AWSAccessKeyId=your-Access-Key-Id
AWSSecretKey=your-Secret-Key


The above is the same format that can be used to specify your AWS credentials for the ELB API Tools (see the README.TXT and credential-file-path.template file in the ELB API Tools bundle).

To run the ELB sanity test:

cd elb-sanity-test
./elb-sanity-test


Here is sample output showing an ELB that passes the sanity test:
$ ./elb-sanity-test
JUnit version 4.5
Test: all instances have their Security Groups defined to allow access to the ELB listener port
Load Balancer: someLB
ELB someLB has a listener that uses instance-port 8080 and instance i-360ef05e has that TCP port open to the world.
ELB someLB has a listener that uses instance-port 8081 and instance i-360ef05e has that TCP port open to the world.
Test: all ELBs have a HealthCheck on a port that the listener directs traffic to
Load Balancer: someLB
ELB someLB has a configured HealthCheck on listener port 8080
Test: all ELBs have InService instances in each configured availability zone
Load Balancer: somLB
ELB someLB has InService instances in each configured availability zone
Time: 5.22
Tests run: 3, Failures: 0
The elb-sanity-test utility performs the following sanity tests on every ELB defined in your account:
  • All instances have their security groups defined to allow access to the ELB listener port.
  • All ELBs have a health check on a port that the listener directs traffic to.
  • All ELBs have healthy instances in each configured availability zone.
If a sanity test fails the utility shows a very verbose error message explaining what is wrong.

Some notes about the elb-sanity-test bundle:
  • The utility is written in Java, which is also required for the ELB tools. If you can run the ELB API Tools, you already have all the prerequisites to run this sanity test.
  • The bundle includes source code and is licensed under the Apache License, Version 2.0.
  • The bundle includes all dependency jars necessary to run the script. It uses the JUnit framework and the Typica library.
I would be happy to re-bundle the utility to include a .bat or .cmd file to make it easy to run the script on Windows. If you write one, please add it it in the comments and I'll include it.

Getting Further Help

If you still have an ELB issue after trying the above advice and the elb-sanity-test utility, please post in the AWS EC2 forum. Questions about the elb-sanity-test utility specifically or about this article are welcome in the comments below.

Update 15 September 2009: Ylastic integrated my elb-sanity-test script into their EC2 management dashboard.

Update 11 October 2009: elb-sanity-test has been released as part of the open-source ec2-elb-tests project hosted on Google Code. And, if you use this utility, please subscribe to the ec2-elb-tests Google Group.