Friday, August 14, 2009

Elastic Load Balancer: An Elasticity Gotcha

If you use AWS's Elastic Load Balancer to allow your EC2 application to scale, like I do, then you'll want to know about this gotcha recently reported in the AWS forums. By all appearances, it looks like something that should be fixed by Amazon. Until it is, you can reduce (but not eliminate) your exposure to this problem by keeping a small TTL for your ELB's DNS CNAME entry. Read on for details.

The Gotcha

As your ELB-balanced application experiences an increasing load, some of the traffic received by your back-end instances may be traffic that does not belong to your application. And, after your application experiences a sustained heavy load and then traffic subsides, some of your application's traffic may be lost or misdirected to other EC2 instances that are not yours.

Why it Happens

In my article about how ELB works, I describe how ELB resolves its DNS name to a pool of IP addresses, and that this pool increases and decreases in size according to the load placed on the service. Each of the IP addresses in the pool is a "virtual appliance", acting as a load balancer to distribute the connections among your back-end instances. This gives ELB two levels of elasticity: the pool of virtual appliance IP addresses, and your pool of back-end instances.

Before they are assigned to a specific ELB, the virtual appliance IP addresses are available for use by any ELB, waiting in a global pool. When an ELB needs to increase its pool of virtual appliances due to load, it gets a new IP address from the global pool and begins resolving the ELB DNS name to that IP address in addition to the ones it already uses. And when an ELB notices decreasing load, it releases one of its virtual appliance IP addresses back to the global pool, and no longer returns that IP address when resolving the ELB DNS name. According to testing performed by AWS forum user wizardofcrowds, ELB scales up under sustained load by increasing its pool of IP addresses at the rate of one additional address every 5 minutes. And, ELB scales down by relinquishing an IP address at the rate of one every 2 hours. Thus it is possible that a single ELB virtual appliance IP address can be in service to a number of different ELBs over the course of a few hours.

The problem is that DNS resolution is cached at many layers across the internet. When the ELB scales up and gets a new virtual appliance IP address from the global pool, some client somewhere might still be using that IP address as the resolution of a different ELB's DNS name. This other ELB might not even belong to you. A few hours ago, another ELB with a different DNS name returned that IP address from a DNS lookup. Now, that IP address is serving your ELB. But some client somewhere may still be using that IP address to attempt to reach an application that is not yours.

The flip side occurs when the ELB scales down and releases a virtual appliance IP address back to the global pool. Some client somewhere might continue resolving your ELB's DNS name to the now-relinquished IP address. When the address is returned to the pool, that client's attempts to connect to your service will fail. If that same virtual appliance IP is then put into service for another ELB, then the client working with the cached but no-longer-current DNS resolution for your ELB DNS name will be directed to the other ELB virtual appliance, and then onward to back-end instances that are not yours.

So your application served by ELB may receive traffic destined for other ELBs during increasing load, and may experience lost traffic during decreasing load.

What is the Solution?

Fundamentally, this issue is caused by badly-configured DNS implementations. Some DNS servers (including those of some major ISPs) ignore the TTL ("time to live") setting of the original DNS record, and thus end up resolving DNS names to an expired IP address. Some DNS clients (browsers such as IE7, and Java programs by default) also ignore DNS TTLs, causing the same problem. Short of fixing all the misconfigured DNS servers and patching all the IE and Java VMs, however, the issue cannot be solved. So a workaround is the best we can hope for.

You, the EC2 user, can minimize the risk that a well-behaved client will experience this issue. Set up your DNS CNAME entries for the ELB to have a small TTL - 120 seconds is good. This will help for clients whose DNS honors the TTL, but not for clients that ignore TTLs or for clients using DNS servers that ignore TTLs.

Amazon can work around the problem on their end. When an ELB needs to scale up and use a new virtual appliance IP address, that address could remain "reserved" for its use for a longer time. When the ELB scales down and releases the virtual appliance IP address, that address would not be reused by another ELB until the reservation period has expired. This would prevent "recent" ELB virtual appliance IP address from being reused by other ELBs, and reduce the risk of misdirecting traffic.

It should be noted that DNS caching and TTLs influence all load balancing solutions that rely on DNS (such as round-robin DNS), so this issue is not unique to ELB. Caching DNS entries is a good thing for the internet in general, but not all implementations honor the TTL of the cached DNS records. Services relying on DNS for scalability need to be designed with this in mind.


  1. Using DNS as a load balancer is a bad idea. Thats not what it was designed for. Not only do you have to deal with stale caches out there, DNS is also inherently unaware of whether an instance it's routing traffic to still has the server software up & running. Not does it know anything about the load metrics, blindly splitting the traffic equally amongst all servers.

  2. We've been experiencing this very issue recently and I'm glad to find an article that corroborates my theory. Thanks for posting, Shlomo.