Server load balancing architectures, Part 1: Transport-level load balancing

High scalability and availability for server farms

Server farms achieve high scalability and high availability through server load balancing, a technique that makes the server farm appear to clients as a single server. In this two-part article, Gregor Roth explores server load balancing architectures, with a focus on open source solutions. Part 1 covers server load balancing basics and discusses the pros and cons of transport-level server load balancing. Part 2 covers application-level server load balancing architectures, which address some of the limitations of the architectures discussed in Part 1.

The barrier to entry for many Internet companies is low. Anyone with a good idea can develop a small application, purchase a domain name, and set up a few PC-based servers to handle incoming traffic. The initial investment is small, so the start-up risk is minimal. But a successful low-cost infrastructure can become a serious problem quickly. A single server that handles all the incoming requests may not have the capacity to handle high traffic volumes once the business becomes popular. In such a situations companies often start to scale up: they upgrade the existing infrastructure by buying a larger box with more processors or add more memory to run the applications.

Scaling up, though, is only a short-term solution. And it's a limited approach because the cost of upgrading is disproportionately high relative to the gains in server capability. For these reasons most successful Internet companies follow a scale out approach. Application components are processed as multiple instances on server farms, which are based on low-cost hardware and operating systems. As traffic increases, servers are added.

The server-farm approach has its own unique demands. On the software side, you must design applications so that they can run as multiple instances on different servers. You do this by splitting the application into smaller components that can be deployed independently. This is trivial if the application components are stateless. Because the components don't retain any transactional state, any of them can handle the same requests equally. If more processing power is required, you just add more servers and install the application components.

A more challenging problem arises when the application components are stateful. For instance, if the application component holds shopping-cart data, an incoming request must be routed to an application component instance that holds that requester's shopping-cart data. Later in this article, I'll discuss how to handle such application-session data in a distributed environment. However, to reduce complexity, most successful Internet-based application systems try to avoid stateful application components whenever possible.

On the infrastructure side, the processing load must be distributed among the group of servers. This is known as server load balancing. Load balancing technologies also pertain to other domains, for instance spreading work among components such as network links, CPUs, or hard drives. This article focuses on server load balancing.

Availability and scalability

Server load balancing distributes service requests across a group of real servers and makes those servers look like a single big server to the clients. Often dozens of real servers are behind a URL that implements a single virtual service.

How does this work? In a widely used server load balancing architecture, the incoming request is directed to a dedicated server load balancer that is transparent to the client. Based on parameters such as availability or current server load, the load balancer decides which server should handle the request and forwards it to the selected server. To provide the load balancing algorithm with the required input data, the load balancer also retrieves information about the servers' health and load to verify that they can respond to traffic. Figure 1 illustrates this classic load balancer architecture.

Classic load balancer architecture (load dispatcher) t
Figure 1. Classic load balancer architecture (load dispatcher)

The load-dispatcher architecture illustrated in Figure 1 is just one of several approaches. To decide which load balancing solution is the best for your infrastructure, you need to consider availability and scalability.

Availability is defined by uptime -- the time between failures. (Downtime is the time to detect the failure, repair it, perform required recovery, and restart tasks.) During uptime the system must respond to each request within a predetermined, well-defined time. If this time is exceeded, the client sees this as a server malfunction. High availability, basically, is redundancy in the system: if one server fails, the others take over the failed server's load transparently. The failure of an individual server is invisible to the client.

Scalability means that the system can serve a single client, as well as thousands of simultaneous clients, by meeting quality-of-service requirements such as response time. Under an increased load, a high scalable system can increase the throughput almost linearly in proportion to the power of added hardware resources.

In the scenario in Figure 1, high scalability is reached by distributing the incoming request over the servers. If the load increases, additional servers can be added, as long as the load balancer does not become the bottleneck. To reach high availability, the load balancer must monitor the servers to avoid forwarding requests to overloaded or dead servers. Furthermore, the load balancer itself must be redundant too. I'll discuss this point later in this article.

Server load balancing techniques

In general, server load balancing solutions are of two main types:

  • Transport-level load balancing -- such as the DNS-based approach or TCP/IP-level load balancing -- acts independently of the application payload.
  • Application-level load balancing uses the application payload to make load balancing decisions.

Load balancing solutions can be further classified into software-based load balancers and hardware-based load balancers. Hardware-based load balancers are specialized hardware boxes that include application-specific integrated circuits (ASICs) customized for a particular use. ASICs enable high-speed forwarding of network traffic without the overhead of a general-purpose operating system. Hardware-based load balancers are often used for transport-level load balancing. In general, hardware-based load balancers are faster than software-based solutions. Their drawback is their cost.

In contrast to hardware load balancers, software-based load balancers run on standard operating systems and standard hardware components such as PCs. Software-based solutions runs either within a dedicated load balancer hardware node as in Figure 1, or directly in the application.

DNS-based load balancing

DNS-based load balancing represents one of the early server load balancing approaches. The Internet's domain name system (DNS) associates IP addresses with a host name. If you type a host name (as part of the URL) into your browser, the browser requests that the DNS server resolve the host name to an IP address.

The DNS-based approach is based on the fact that DNS allows multiple IP addresses (real servers) to be assigned to one host name, as shown in the DNS lookup example in Listing 1.

Listing 1. Example DNS lookup



If the DNS server implements a round-robin approach, the order of the IP addresses for a given host changes after each DNS response. Usually clients such as browsers try to connect to the first address returned from a DNS query. The result is that responses to multiple clients are distributed among the servers. In contrast to the server load balancing architecture in Figure 1, no intermediate load balancer hardware node is required.

DNS is an efficient solution for global server load balancing, where load must be distributed between data centers at different locations. Often the DNS-based global server load balancing is combined with other server load balancing solutions to distribute the load within a dedicated data center.

Although easy to implement, the DNS approach has serious drawbacks. To reduce DNS queries, client tend to cache the DNS queries. If a server becomes unavailable, the client cache as well as the DNS server will continue to contain a dead server address. For this reason, the DNS approach does little to implement high availability.

TCP/IP server load balancing

TCP/IP server load balancers operate on low-level layer switching. A popular software-based low-level server load balancer is the Linux Virtual Server (LVS). The real servers appear to the outside world as a single "virtual" server. The incoming requests on a TCP connection are forwarded to the real servers by the load balancer, which runs a Linux kernel patched to include IP Virtual Server (IPVS) code.

To ensure high availability, in most cases a pair of load balancer nodes are set up, with one load balancer node in passive mode. If a load balancer fails, the heartbeat program that runs on both load balancers activates the passive load balancer node and initiates the takeover of the Virtual IP address (VIP). While the heartbeat is responsible for managing the failover between the load balancers, simple send/expect scripts are used to monitor the health of the real servers.

Transparency to the client is achieved by using a VIP that is assigned to the load balancer. If the client issues a request, first the requested host name is translated into the VIP. When it receives the request packet, the load balancer decides which real server should handle the request packet. The target IP address of the request packet is rewritten into the Real IP (RIP) of the real server. LVS supports several scheduling algorithms for distributing requests to the real servers. It is often is set up to use round-robin scheduling, similar to DNS-based load balancing. With LVS, the load balancing decision is made on the TCP level (Layer 4 of the OSI Reference Model).

After receiving the request packet, the real server handles it and returns the response packet. To force the response packet to be returned through the load balancer, the real server uses the VIP as its default response route. If the load balancer receives the response packet, the source IP of the response packet is rewritten with the VIP (OSI Model Layer 3). This LVS routing mode is called Network Address Translation (NAT) routing. Figure 2 shows an LVS implementation that uses NAT routing.

LVS implemented with NAT routing
Figure 2. LVS implemented with NAT routing

LVS also supports other routing modes such as Direct Server Return. In this case the response packet is sent directly to the client by the real server. To do this, the VIP must be assigned to all real servers, too. It is important to make the server's VIP unresolvable to the network; otherwise, the load balancer becomes unreachable. If the load balancer receives a request packet, the MAC address (OSI Model Layer 2) of the request is rewritten instead of the IP address. The real server receives the request packet and processes it. Based on the source IP address, the response packet is sent to the client directly, bypassing the load balancer. For Web traffic this approach can reduce the balancer workload dramatically. Typically, many more response packets are transferred than request packets. For instance, if you request a Web page, often only one IP packet is sent. If a larger Web page is requested, several response IP packets are required to transfer the requested page.


Low-level server load balancer solutions such as LVS reach their limit if application-level caching or application-session support is required. Caching is an important scalability principle for avoiding expensive operations that fetch the same data repeatedly. A cache is a temporary store that holds redundant data resulting from a previous data-fetch operation. The value of a cache depends on the cost to retrieve the data versus the hit rate and required cache size.

Based on the load balancer scheduling algorithm, the requests of a user session are handled by different servers. If a cache is used on the server side, straying requests will become a problem. One approach to handle this is to place the cache in a global space. memcached is a popular distributed cache solution that provides a large cache across multiple machines. It is a partitioned, distributed cache that uses consistent hashing to determine the cache server (daemon) for a given cache entry. Based on the cache key's hash code, the client library always maps the same hash code to the same cache server address. This address is then used to store the cache entry. Figure 3 illustrates this caching approach.

Figure 3. Load balancer architecture enhanced by a partitioned, distributed cache

Listing 2 uses spymemcached, a memcached client written in Java, to cache HttpResponse messages across multiple machines. The spymemcached library implements the required client logic I just described.

Listing 2. memcached-based HttpResponse cache

interface IHttpResponseCache {

   IHttpResponse put(String key, IHttpResponse response) throws IOException;

   void remove(String key) throws IOException;

   IHttpResponse get(String key) throws IOException;

class RemoteHttpResponseCache implements IHttpResponseCache {

   private MemcachedClient memCachedClient;

   public RemoteHttpResponseCache(InetSocketAddress... cacheServers) throws IOException {
      memCachedClient = new MemcachedClient(Arrays.asList(cacheServers));

   public IHttpResponse put(String key, IHttpResponse response) throws IOException {
      byte[] bodyData = response.getBlockingBody().readBytes();

      memCachedClient.set(key, 3600, bodyData);
      return null;

   public IHttpResponse get(String key) throws IOException {
      byte[] bodyData = (byte[]) memCachedClient.get(key);
      if (bodyData != null) {
         return new HttpResponse(200, "text/plain", bodyData);
      } else {
         return null;

   public void remove(String key) throws IOException {

Listing 2 and the rest of this article's example code also uses the xLightweb HTTP library. Listing 3 shows an example business service implementation. The onRequest(...) method -- similar to the Servlet API's goGet(...) or doPost(...) method -- is called each time a request header is received. The exchange.send() method sends the response.

Listing 3. Example business service implementation

class MyRequestHandler implements IHttpRequestHandler {

   public void onRequest(IHttpExchange exchange) throws IOException {

      IHttpRequest request = exchange.getRequest();

      int customerId = request.getRequiredIntParameter("id");
      long amount = request.getRequiredLongParameter("amount");

      // perform some operations
      String response = ...

      // and return the response
      exchange.send(new HttpResponse(200, "text/plain", response));

class Server {

   public static void main(String[] args) throws Exception {
      HttpServer httpServer = new HttpServer(8180, new MyRequestHandler());;

Based on the HttpResponse cache, a simple caching solution can be implemented that caches the HTTP response for an HTTP request. If the same request is received twice, the corresponding response can be taken from the cache, without calling the business service. This requires intercepting the request-handling flow. This can be done by the interceptor shown in Listing 4.

Listing 4. Cache-supported business service example

class CacheInterceptor implements IHttpRequestHandler {

   private IHttpResponseCache cache;

   public CacheInterceptor(IHttpResponseCache cache) {
      this.cache = cache;

   public void onRequest(final IHttpExchange exchange) throws IOException {

      IHttpRequest request = exchange.getRequest();

      // check if request is cacheable (Cache-Control header, ...)
      // ...
      boolean isCacheable = ...

      // if request is not cacheable forward it to the next handler of the chain
      if (!isCacheable) {

      // create the cache key
      StringBuilder sb = new StringBuilder(request.getRequestURI());
      TreeSet<String> sortedParamNames = new TreeSet<String>(request.getParameterNameSet());
      for (String paramName : sortedParamNames) {
         sb.append(URLEncoder.encode(paramName) + "=");

         List<String> paramValues = Arrays.asList(request.getParameterValues(paramName));
         for (String paramValue : paramValues) {
            sb.append(URLEncoder.encode(paramValue) + ", ");
      final String cacheKey = URLEncoder.encode(sb.toString());

      // is request in cache?
      IHttpResponse cachedResponse = cache.get(cacheKey);
      if (cachedResponse != null) {
         IHttpResponse response = HttpUtils.copy(cachedResponse);
         response.setHeader("X-Cached", "true");

      // .. no -> forward it to the next handler of the chain
      } else {

         // define a intermediate response handler to intercept and copy the response
         IHttpResponseHandler respHdl = new IHttpResponseHandler() {

            public void onResponse(IHttpResponse response) throws IOException {
               cache.put(cacheKey, HttpUtils.copy(response));
               exchange.send(response);  // forward the response to the client

            public void onException(IOException ioe) throws IOException {
               exchange.sendError(ioe);  // forward the error to the client

         // forward the request to the next handler of the chain
         exchange.forward(request, respHdl);

class Server {

   public static void main(String[] args) throws Exception {
      RequestHandlerChain handlerChain = new RequestHandlerChain();
      handlerChain.addLast(new CacheInterceptor(new RemoteHttpResponseCache(new InetSocketAddress(cachSrv1, 11211), new InetSocketAddress(cachSrv2, 11211))));
      handlerChain.addLast(new MyRequestHandler());

      HttpServer httpServer = new HttpServer(8180, handlerChain);;

The CacheInterceptor in Listing 4 uses the memcached-based implementation to cache responses, based on the hashcode of dedicated header attributes. If the cache contains a response for this hashcode, the request is not forwarded to the business-service handler. Instead, the response is returned from the cache. If the cache does not contain a response, the request is forwarded by adding a response handler to intercept the response flow. If a response is received from the business-service handler, the response is added to the cache. (Note that Listing 4 does not show cache invalidation. Often dedicated business operations require the cache entry to be invalidated.)

The consistent-hashing approach leads to high scalability. Based on consistent hashing, the memcached client implements a failover strategy to support high availability. But if a daemon crashes, the cache data is lost. This is minor problem, because cache data is redundant by definition.

A simple approach to make the memcached architecture fail-safe is to store the cache entry on a primary and a secondary cache server. If the primary cache server goes down, the secondary server probably contains the entry. If not, the required (cached) data must be recovered from the underlying data source.

Application session data support

Supporting application session data in a fail-safe way is more problematic. Application session data represents the state of a user-specific application session. Examples include the ID of a selected folder or the articles in a user's shopping cart. The application session data must be maintained across requests. In classic ("WEB 1.0") Web applications, such session data must be held on the server side. Storing it in the client by using cookies or hidden fields has two major weaknesses. It exposes internal session data, such as the price fields in shopping cart data, to attack on the client side, so you must address this security risk. And this approach works only for a small amount of data that's limited by the maximum size of the HTTP cookie header and the overhead of transferring the application session data to and from the client.

Similarly to the memcached architecture, session servers can be used to store the application session data on the server side. However, in contrast to cached data, application session data is not redundant by definition. For this reason application session data is not removed to make room for new data if the maximum memory size is reached. Caches are free to remove cache entries for memory-management reasons at any time. Caching algorithms such as last recently used (LRU) remove cache entries if the maximum cache size is reached.

If the session server crashes, the application session data is lost. In contrast to cached data, application session data is not recoverable in most cases. For this reason it is important that failover solutions support application session data in a fail-safe way.

Client affinity

The disadvantage of the cache and session server approach is that each request leads to an additional network call from the server to the cache or session server. In most cases call latency is not a problem because the cache or session server and the business servers are placed in the same, fast network segment. But latency can become problematic if the size of the data entries increases. To avoid moving large sets of data between the business server and cache/session servers again and again, requests of a dedicated client must always be forwarded to the same server. This means all of a user session's requests are handled by the same server instance.

In the case of caching, a local cache can be used instead of the distributed memcached server infrastructure. This approach, known as client affinity, does not require cache servers. Client affinity always directs the client to "its" particular server.

The example in Listing 5 implements a local cache and requires client affinity.

Listing 5. Local cached-based example requiring client affinity

class LocalHttpResponseCache extends LinkedHashMap<String, IHttpResponse> implements IHttpResponseCache {

   public synchronized IHttpResponse put(String key, IHttpResponse value) {
      return super.put(key, value);

   public void remove(String key) {

   public synchronized IHttpResponse get(String key) {
      return super.get(key);

   protected boolean removeEldestEntry(Entry<String, IHttpResponse> eldest) {
      return size() > 1000;   // cache up to 1000 entries

class Server {

   public static void main(String[] args) throws Exception {
      RequestHandlerChain handlerChain = new RequestHandlerChain();
      handlerChain.addLast(new CacheInterceptor(new LocalHttpResponseCache()));
      handlerChain.addLast(new MyRequestHandler());

      HttpServer httpServer = new HttpServer(8080, handlerChain);;

LVS supports affinity by enabling persistence -- remembering the last connection for a predefined period of time. It makes a particular client connect to the same real server for different TCP connections. But persistence doesn't really help in case of incoming dial-up links. If a dial-up link comes through a provider proxy, it can use different TCP connections within the same session.

Conclusion to Part 1

Infrastructures based on pure transport-level server load balancers are common. They are simple, flexible, and highly efficient, and they present no restrictions on the client side. Often such architectures are combined with distributed cache or session servers to handle application-level caching and session data issues. However, if the overhead caused by moving data from and to the cache or session servers grows, such architectures become increasingly inefficient. By implementing client affinity based on application-level server load balancer, you can avoid copying large datasets between servers. Read Server load balancing architectures, Part 2 for a discussion of application-level load balancing.

Gregor Roth, creator of the xLightweb HTTP library, works as a software architect at United Internet group, a leading European Internet service provider to which GMX, 1&1, and belong. His areas of interest include software and system architecture, enterprise architecture management, object-oriented design, distributed computing, and development methodologies.

Learn more about this topic


Join the discussion
Be the first to comment on this article. Our Commenting Policies