Monday, September 7, 2009

Solving Common ELB Problems with a Sanity Test

Help! My ELB isn't serving files!
Whoa! My back-end instances work but not the ELB!
Hey! I can't get the ELB to work!


These are among the most common Elastic Load Balancer problems raised on the Amazon EC2 Discussion Forums. Inspired by Eric Hammond's indispensible article Solving "I can't connect to my server on Amazon EC2", here is a helpful guide to debugging these common ELB issues, as well as a utility to perform sanity tests on your own ELBs.

Questions to Answer

You're trying to figure out what's wrong and you need to know where to start looking. Or, you're posting your problem on the AWS forums and you want help as quickly as possible. The best way to help yourself or to get help quickly is to examine the basic facts of your situation. Here are some questions to answer for yourself and in your forum post:
  1. What is the output of elb-describe-lbs elbName --show-xml ? This gives the basic details of the ELB, which are critical to diagnosing any problem. If you are posting to the forums and want to keep the DNS name of the ELB private then obscure it in the output. One reason to obscure the DNS name is to prevent readers from accessing your ELB-based service. However, this precaution does not add any security because the DNS information is public, and - presumably - you are using a DNS CNAME entry to integrate the ELB into your domain's DNS.
  2. What is the output of elb-describe-instance-health elbName ? This provides crucial information about the health of the instances.
  3. What resource are you trying to access via the ELB and what tool are you using to access it and from what location? The resource will likely be a URL of the form http://ELB-DNS-Name/index.html or maybe https://ELB-DNS-Name/index.html, or it might be "I'm running a POP server on port 1234". The tool you're using to access it is most likely a browser or HTTP client (Firefox, or wget), or possibly "Microsoft Outlook version 5.4". The location is either "my local machine" or "an EC2 instance". Also, can you access the same resource when you connect directly to a back-end instance via its public IP address or host name from a client outside EC2? A public-facing URL pointing directly to a back-end instance looks like this: http://ec2-123-213-123-31.compute-1.amazonaws.com/index.html . And, can you access the same resource when you connect directly to a back-end instance via its private IP address or host name from another instance within EC2? Such a URL looks like this: http://domU-12-31-34-00-69-B9.compute-1.internal/index.html .
  4. Can you access the health check resource directly via the ELB DNS name, and via the back-end instance's public IP address, and via the back-end instance's private IP address? If your health check is configured with target=HTTP:8080/check.html then try to access http://ELB-DNS-Name:8080/check.html (which is via the ELB) and http://ec2-123-213-123-31.compute-1.amazonaws.com:8080/check.html (which is via the instance's public IP address) and http://domU-12-31-34-00-69-B9.compute-1.internal:8080/check.html (which is via the instance's private IP address, and only accessible from within EC2).
  5. What are the security groups and availability zones for each instance in the ELB? This is visible in the output of ec2-describe-instances i-11111111 i-22222222 ... As above, you might want to obscure the public and private DNS names of these instances in the output.
  6. Can all the back-end instances receive traffic on the instance ports of the ELB listeners and the health check? This can be checked from the output of ec2-describe-group groupName1 groupName2 ... for all the groups shown in question 5's ec2-describe-instances command.
  7. Do logs on your back-end instances show any connections from ELB?

Common ELB Problems

Okay, now that you know what information is important to diagnosing the problem, here is a look at some of the common gotchas, how to detect them, and how to fix them. These descriptions refer to the above questions by number.

Common problems and solutions include:
  • Security groups on back-end instances don't allow access to the instance ports and health check port. Back-end instances must have all ports on which they receive traffic from the ELB (#1) open to CIDR 0.0.0.0/0 in one of their associated security groups (#6). Fix this by changing the permissions on the security groups associated with the instances. Note: this fix takes effect within a few seconds and does not require launching new instance or rebooting existing instances.
  • Back-end instances are not healthy (InService). When an instance fails the health check (#1) it is marked as OutOfService (#2) and the ELB does not route traffic to it anymore. To fix this you need to determine why the ELB cannot access the health check resource. Note: there is currently a bug in ELB where instances initially are marked as InService when added to the ELB, until they fail the health check. So you'll want to make sure you've given ELB enough time to detect a failed health check.
  • An availability zone is enabled on the ELB but has no healthy back-end instances. If you have an availability zone enabled for your ELB (#1) but no healthy instances in that availability zone (#5 and #2), you'll get 503 Gateway Timeout or other errors. Fix this by adding an instance in that availability zone to the ELB or disabling that availability zone for the ELB.
  • You cannot see a requested resource (#3) or the health check URL (#4) using the ELB DNS name. In this case, check that the URL exists on the back-end instances and look at the back-end instance's logs (#7) to see if the ELB forwarded your connection or not. If you can see the requested resource using the public address of a back-end instance then check the instance's security groups (#6) to see that they grant access to the instance's port.
  • The health check port is not the same as listener target port (#1). While this does not necessarily indicate a problem, for most ELBs the health check should use the same port as one of the listeners. Setting up your ELB to have a health check performed on a different port than the load-balanced traffic is perfectly valid, but you likely want the health check to use the same path that the load-balanced traffic takes to reach your app (and also to exercise a representative set of features used by your app).
I will update this article with new common issues as they appear.

An ELB Sanity Test Utility

If you have your thinking cap on you'll notice that detecting the first three of the common ELB problems can be automated. Here is an ELB sanity test utility for linux which automates these tests. Save it or download it as follows:

curl -o elb-sanity-test.tar.gz -L https://sites.google.com/site/shlomosfiles/clouddevelopertips/elb-sanity-test.tar.gz?attredirects=0

Next, unpack it:

tar xzf elb-sanity-test.tar.gz
cd elb-sanity-test


Next, set up the utility with your credentials. Edit the elb-sanity-test script file, setting AWS_CREDENTIAL_FILE to point to a file containing your AWS credentials in the following format:

AWSAccessKeyId=your-Access-Key-Id
AWSSecretKey=your-Secret-Key


The above is the same format that can be used to specify your AWS credentials for the ELB API Tools (see the README.TXT and credential-file-path.template file in the ELB API Tools bundle).

To run the ELB sanity test:

cd elb-sanity-test
./elb-sanity-test


Here is sample output showing an ELB that passes the sanity test:
$ ./elb-sanity-test
JUnit version 4.5
Test: all instances have their Security Groups defined to allow access to the ELB listener port
Load Balancer: someLB
ELB someLB has a listener that uses instance-port 8080 and instance i-360ef05e has that TCP port open to the world.
ELB someLB has a listener that uses instance-port 8081 and instance i-360ef05e has that TCP port open to the world.
Test: all ELBs have a HealthCheck on a port that the listener directs traffic to
Load Balancer: someLB
ELB someLB has a configured HealthCheck on listener port 8080
Test: all ELBs have InService instances in each configured availability zone
Load Balancer: somLB
ELB someLB has InService instances in each configured availability zone
Time: 5.22
Tests run: 3, Failures: 0
The elb-sanity-test utility performs the following sanity tests on every ELB defined in your account:
  • All instances have their security groups defined to allow access to the ELB listener port.
  • All ELBs have a health check on a port that the listener directs traffic to.
  • All ELBs have healthy instances in each configured availability zone.
If a sanity test fails the utility shows a very verbose error message explaining what is wrong.

Some notes about the elb-sanity-test bundle:
  • The utility is written in Java, which is also required for the ELB tools. If you can run the ELB API Tools, you already have all the prerequisites to run this sanity test.
  • The bundle includes source code and is licensed under the Apache License, Version 2.0.
  • The bundle includes all dependency jars necessary to run the script. It uses the JUnit framework and the Typica library.
I would be happy to re-bundle the utility to include a .bat or .cmd file to make it easy to run the script on Windows. If you write one, please add it it in the comments and I'll include it.

Getting Further Help

If you still have an ELB issue after trying the above advice and the elb-sanity-test utility, please post in the AWS EC2 forum. Questions about the elb-sanity-test utility specifically or about this article are welcome in the comments below.

Update 15 September 2009: Ylastic integrated my elb-sanity-test script into their EC2 management dashboard.

Update 11 October 2009: elb-sanity-test has been released as part of the open-source ec2-elb-tests project hosted on Google Code. And, if you use this utility, please subscribe to the ec2-elb-tests Google Group.

4 comments:

  1. Very useful post! I just saw that Ylastic integrated your sanity check approach into their dashboard. Nice!

    I wanted to share a problem I had recently. Because of some current ELB issues, it is possible to cause an ELB to get into an inconsistent state by terminating one of its instances without first deregistering it from the ELB. See the thread Mysterious "server=0" content returned by ELB on the AWS forums.

    Also, if you are trying to read client IP addresses on a server behind an ELB, see my blog post: Amazon ELB - Capturing Client IP Address

    ReplyDelete
  2. Excellent post! Very useful tips on what to look for while debugging ELBs. We just integrated this approach into our management UI - http://vimeo.com/6592989

    thanks!

    ReplyDelete
  3. Is it possible to edit a load balancer to forward additional ports? For example we launched our aws ELB forwarding port 80, and now we want to add https support and forward 443 as well. Do I need to create a new ELB or can I edit my existing one?

    ReplyDelete
  4. @Logan Green,

    There's no API to modify the listeners of an ELB. So you'll need to create a new ELB with both HTTP:80 and TCP:443 listeners specified at creation time.

    ReplyDelete