Clustering and Load balancing Tomcat servers

Tomcat is by far the most popular open source Java servlet container that can easily be scaled to ensure high availability. In this post I will show how easy it is to distribute and scale java web applications using tomcat clusters. In this example we will also use Apache web server to load balance the cluster.

Below is a simple setup of a cluster with two Tomcat 7 nodes with Apache web server 2.2 as a load balancer.



Load balanced clustering on tomcat is done in four simple steps:

1- Uniquely naming each node
2- Configure clustering in server.xml
3- Configure the session persistence in context.xml (Optional)
4- Enable distribution in the Java web application

The following configuration is done on each tomcat instance either running on the same machine or on separate virtual or physical servers. If multiple instances are running on the same machine then it should be ensured that each instance is listening on different ports.

Naming the tomcat nodes

The server node can be named by adding the jvmRoute element to the tomcat Engine as shown below. This node name is appended to the session id.

Configuring the cluster

Clustering can easily be enabled in tomcat by adding the following to the server.xml file

[turbo_widget widget-prefix=text&obj-class=WP_Widget_Text&widget-text–title=&widget-text–text=%3C!–+Post+Ad+–%3E%0A%3Cins+class%3D%22adsbygoogle%22%0A+++++style%3D%22display%3Ablock%22%0A+++++data-ad-client%3D%22ca-pub-4433761869744390%22%0A+++++data-ad-slot%3D%229020480262%22%0A+++++data-ad-format%3D%22auto%22%3E%3C%2Fins%3E%0A%3Cscript%3E%0A(adsbygoogle+%3D+window.adsbygoogle+%7C%7C+%5B%5D).push(%7B%7D)%3B%0A%3C%2Fscript%3E&widget-text–filter=false]

This will also enable session replication on all tomcat nodes. Below is a detailed example that shows more options, including multicast and deployer configuration. See the tomcat documentation for more info.

Configuring session persistence (Optional)

Session persistence in configured in the context.xml file by defining a persistence manager. By default the session is persisted in-memory but optionally a disk or database storage can be also be configured. The example below shows how the session is persisted on the local disk using a session store.

Making the application distributable

The java web applications can be configured to run in a cluster by adding the following to application’s web.xml file.

Once all this configuration is done on each node, the servers will be running in a cluster with session replication enabled.

[turbo_widget widget-prefix=text&obj-class=WP_Widget_Text&widget-text–title=&widget-text–text=%3C!–+Post+Ad+–%3E%0A%3Cins+class%3D%22adsbygoogle%22%0A+++++style%3D%22display%3Ablock%22%0A+++++data-ad-client%3D%22ca-pub-4433761869744390%22%0A+++++data-ad-slot%3D%229020480262%22%0A+++++data-ad-format%3D%22auto%22%3E%3C%2Fins%3E%0A%3Cscript%3E%0A(adsbygoogle+%3D+window.adsbygoogle+%7C%7C+%5B%5D).push(%7B%7D)%3B%0A%3C%2Fscript%3E&widget-text–filter=false]

Load balancing the cluster

Now the next step is to configure load balancing to effectively use the scalable tomcat cluster and to ensure high availability. Meaning if one of the nodes, in the cluster, goes down, the application will still be available on the other node. And with session replication enabled the service and user experience will not be impacted.

In this example I am using Apache’s mod_proxy_ajp to load balance the two tomcat instances. The configuration is fairly simple and is added to the httpd.conf file. The example also shows how session sticky-ness is enabled. The loadfactor defines the weighted load that is applied on each node. The lbmethod configures the load balancing strategy, which can either be byrequests or bytraffic.

Note: mod_proxy_ajp is not the only way apache is used to load balance tomcat instances, the other commonly used methods are mod_jk and mod_proxy.

Crawl, index and search

Sometimes you need to search for files or pages on content-rich websites or browser-based information software/encyclopedias that don’t really have a search functionality. And it could be a pain to find what you are looking for. I wrote a little crawler once in python that works well to search for stuff, on websites, on the fly. But sometimes a real “index” is needed for searching. There are a few libraries available and among them is the open source Apache Lucene, an excellent high-performance text search engine library that one can use for free. Lucene coupled with a multi-threaded web crawler and you have a pretty good index and search functionality; though not as good as google, but close.

Below is an example of how you can use Lucene to build searchable indexes for websites.

Now once the index is created we can start searching it for content. Lucene provides IndexSearcher class that is used to search the index using a Query. Below is an example that searches for results in the above created index, and prints the website URL where the required content is found.

The full source of this example, including the web crawler, can be found here and is available under GPL.