Server farms achieve high scalability and high availability through server load balancing, a technique that makes the server farm appear to clients as a single server. In this two-part article, Gregor Roth explores server load balancing architectures, with a focus on open source solutions. Part 1 covers server load balancing basics and discusses the pros and cons of transport-level server load balancing. Part 2 covers application-level server load balancing architectures, which address some of the limitations of the architectures discussed in Part 1.
The barrier to entry for many Internet companies is low. Anyone with a good idea can develop a small application, purchase a domain name, and set up a few PC-based servers to handle incoming traffic. The initial investment is small, so the start-up risk is minimal. But a successful low-cost infrastructure can become a serious problem quickly. A single server that handles all the incoming requests may not have the capacity to handle high traffic volumes once the business becomes popular. In such a situations companies often start to scale up: they upgrade the existing infrastructure by buying a larger box with more processors or add more memory to run the applications.
Scaling up, though, is only a short-term solution. And it's a limited approach because the cost of upgrading is disproportionately high relative to the gains in server capability. For these reasons most successful Internet companies follow a scale out approach. Application components are processed as multiple instances on server farms, which are based on low-cost hardware and operating systems. As traffic increases, servers are added.
The server-farm approach has its own unique demands. On the software side, you must design applications so that they can run as multiple instances on different servers. You do this by splitting the application into smaller components that can be deployed independently. This is trivial if the application components are stateless. Because the components don't retain any transactional state, any of them can handle the same requests equally. If more processing power is required, you just add more servers and install the application components.
A more challenging problem arises when the application components are stateful. For instance, if the application component holds shopping-cart data, an incoming request must be routed to an application component instance that holds that requester's shopping-cart data. Later in this article, I'll discuss how to handle such application-session data in a distributed environment. However, to reduce complexity, most successful Internet-based application systems try to avoid stateful application components whenever possible.
On the infrastructure side, the processing load must be distributed among the group of servers. This is known as server load balancing. Load balancing technologies also pertain to other domains, for instance spreading work among components such as network links, CPUs, or hard drives. This article focuses on server load balancing.
Availability and scalability
Server load balancing distributes service requests across a group of real servers and makes those servers look like a single big server to the clients. Often dozens of real servers are behind a URL that implements a single virtual service.
How does this work? In a widely used server load balancing architecture, the incoming request is directed to a dedicated server load balancer that is transparent to the client. Based on parameters such as availability or current server load, the load balancer decides which server should handle the request and forwards it to the selected server. To provide the load balancing algorithm with the required input data, the load balancer also retrieves information about the servers' health and load to verify that they can respond to traffic. Figure 1 illustrates this classic load balancer architecture.
The load-dispatcher architecture illustrated in Figure 1 is just one of several approaches. To decide which load balancing solution is the best for your infrastructure, you need to consider availability and scalability.
Availability is defined by uptime -- the time between failures. (Downtime is the time to detect the failure, repair it, perform required recovery, and restart tasks.) During uptime the system must respond to each request within a predetermined, well-defined time. If this time is exceeded, the client sees this as a server malfunction. High availability, basically, is redundancy in the system: if one server fails, the others take over the failed server's load transparently. The failure of an individual server is invisible to the client.
Scalability means that the system can serve a single client, as well as thousands of simultaneous clients, by meeting quality-of-service requirements such as response time. Under an increased load, a high scalable system can increase the throughput almost linearly in proportion to the power of added hardware resources.
In the scenario in Figure 1, high scalability is reached by distributing the incoming request over the servers. If the load increases, additional servers can be added, as long as the load balancer does not become the bottleneck. To reach high availability, the load balancer must monitor the servers to avoid forwarding requests to overloaded or dead servers. Furthermore, the load balancer itself must be redundant too. I'll discuss this point later in this article.
Server load balancing techniques
In general, server load balancing solutions are of two main types:
- Transport-level load balancing -- such as the DNS-based approach or TCP/IP-level load balancing -- acts independently of the application payload.
- Application-level load balancing uses the application payload to make load balancing decisions.
Load balancing solutions can be further classified into software-based load balancers and hardware-based load balancers. Hardware-based load balancers are specialized hardware boxes that include application-specific integrated circuits (ASICs) customized for a particular use. ASICs enable high-speed forwarding of network traffic without the overhead of a general-purpose operating system. Hardware-based load balancers are often used for transport-level load balancing. In general, hardware-based load balancers are faster than software-based solutions. Their drawback is their cost.
In contrast to hardware load balancers, software-based load balancers run on standard operating systems and standard hardware components such as PCs. Software-based solutions runs either within a dedicated load balancer hardware node as in Figure 1, or directly in the application.
DNS-based load balancing
DNS-based load balancing represents one of the early server load balancing approaches. The Internet's domain name system (DNS) associates IP addresses with a host name. If you type a host name (as part of the URL) into your browser, the browser requests that the DNS server resolve the host name to an IP address.
The DNS-based approach is based on the fact that DNS allows multiple IP addresses (real servers) to be assigned to one host name, as shown in the DNS lookup example in Listing 1.
Listing 1. Example DNS lookup
>nslookup amazon.com Server: ns.box Address: 192.168.1.1 Name: amazon.com Addresses: 220.127.116.11, 18.104.22.168, 22.214.171.124
If the DNS server implements a round-robin approach, the order of the IP addresses for a given host changes after each DNS response. Usually clients such as browsers try to connect to the first address returned from a DNS query. The result is that responses to multiple clients are distributed among the servers. In contrast to the server load balancing architecture in Figure 1, no intermediate load balancer hardware node is required.
DNS is an efficient solution for global server load balancing, where load must be distributed between data centers at different locations. Often the DNS-based global server load balancing is combined with other server load balancing solutions to distribute the load within a dedicated data center.
Although easy to implement, the DNS approach has serious drawbacks. To reduce DNS queries, client tend to cache the DNS queries. If a server becomes unavailable, the client cache as well as the DNS server will continue to contain a dead server address. For this reason, the DNS approach does little to implement high availability.
TCP/IP server load balancing
TCP/IP server load balancers operate on low-level layer switching. A popular software-based low-level server load balancer is the Linux Virtual Server (LVS). The real servers appear to the outside world as a single "virtual" server. The incoming requests on a TCP connection are forwarded to the real servers by the load balancer, which runs a Linux kernel patched to include IP Virtual Server (IPVS) code.
To ensure high availability, in most cases a pair of load balancer nodes are set up, with one load balancer node in passive mode. If a load balancer fails, the heartbeat program that runs on both load balancers activates the passive load balancer node and initiates the takeover of the Virtual IP address (VIP). While the heartbeat is responsible for managing the failover between the load balancers, simple send/expect scripts are used to monitor the health of the real servers.
Transparency to the client is achieved by using a VIP that is assigned to the load balancer. If the client issues a request, first the requested host name is translated into the VIP. When it receives the request packet, the load balancer decides which real server should handle the request packet. The target IP address of the request packet is rewritten into the Real IP (RIP) of the real server. LVS supports several scheduling algorithms for distributing requests to the real servers. It is often is set up to use round-robin scheduling, similar to DNS-based load balancing. With LVS, the load balancing decision is made on the TCP level (Layer 4 of the OSI Reference Model).
After receiving the request packet, the real server handles it and returns the response packet. To force the response packet to be returned through the load balancer, the real server uses the VIP as its default response route. If the load balancer receives the response packet, the source IP of the response packet is rewritten with the VIP (OSI Model Layer 3). This LVS routing mode is called Network Address Translation (NAT) routing. Figure 2 shows an LVS implementation that uses NAT routing.
LVS also supports other routing modes such as Direct Server Return. In this case the response packet is sent directly to the client by the real server. To do this, the VIP must be assigned to all real servers, too. It is important to make the server's VIP unresolvable to the network; otherwise, the load balancer becomes unreachable. If the load balancer receives a request packet, the MAC address (OSI Model Layer 2) of the request is rewritten instead of the IP address. The real server receives the request packet and processes it. Based on the source IP address, the response packet is sent to the client directly, bypassing the load balancer. For Web traffic this approach can reduce the balancer workload dramatically. Typically, many more response packets are transferred than request packets. For instance, if you request a Web page, often only one IP packet is sent. If a larger Web page is requested, several response IP packets are required to transfer the requested page.
Low-level server load balancer solutions such as LVS reach their limit if application-level caching or application-session support is required. Caching is an important scalability principle for avoiding expensive operations that fetch the same data repeatedly. A cache is a temporary store that holds redundant data resulting from a previous data-fetch operation. The value of a cache depends on the cost to retrieve the data versus the hit rate and required cache size.
Based on the load balancer scheduling algorithm, the requests of a user session are handled by different servers. If a cache is used on the server side, straying requests will become a problem. One approach to handle this is to place the cache in a global space. memcached is a popular distributed cache solution that provides a large cache across multiple machines. It is a partitioned, distributed cache that uses consistent hashing to determine the cache server (daemon) for a given cache entry. Based on the cache key's hash code, the client library always maps the same hash code to the same cache server address. This address is then used to store the cache entry. Figure 3 illustrates this caching approach.
Listing 2 uses
memcached client written in Java, to cache
HttpResponse messages across multiple machines. The
spymemcached library implements the required client logic I just described.