you're reading...

Building a self maintaining Grid Environment

I have been primarily involved in setting up the Grid environment for Selenium automation in my company. One of the challenges that I faced in setting up the Grid environment [ I had around 40 physical machines where nodes were being run] was maintaining the Grid environment.

 Most often like all java processes, the nodes would turn stale after sometime and sometimes even get auto killed. The biggest reason behind I found behind this was because the Java presses were running out of memory.

Now there are a variety of reasons why the nodes were ending up running out of memory. The pre-dominant ones that I learnt were mainly two:

  • People ending up opening the File Open dialog boxes as part of their automation tests..
  • Print Dialog boxes being invoked as part of automation tests.

 In both these cases, the browsers would remain stuck because Selenium/WebDriver cannot deal with Operating System Level dialog boxes. This would leave the browser in an orphaned state. To make matters worse, the Grid has a logic built into it which causes the nodes to try and recycle browser windows which have been opened but left idle for sometime. Here’s where the entire confusion starts. The node tries continuously to recycle the browser (which as per it is idle since it is not processing any UI operations) but since the browser has an OS level dialog box sitting between it and the node, the browser window cannot be recycled. This over a period of time ends up causing the Java process to run out of Heap memory and thereby causing the Out Of Memory Exception. When this exception occurs, I noticed that either the node goes stale (This is a horrible condition which is difficult to detect because there are a lot of threads that run on the node and I noticed just one thread that conks with the Out Of memory Exception) or the java process gets terminated.

 So I decided to explore if there is any way wherein I could tweak the Grid environment such that, it heals itself on its own i.e., the nodes have the intelligence built into themselves such that they are can automatically recycle themselves when a particular condition occurs.

 I learnt that the Grid [ I am truly awed by the way in which so many things were thought before-hand even before the Grid2 was re-designed ] has provisions wherein I could inject :

  • A custom capability matcher [ This means that I could decide how should the Grid narrow down on a node so that a given test can be routed to it ]
  • The default proxy can be altered and instead it could be customized as well [A Proxy is essentially what represents a given node. As such a proxy can be altered to almost do anything that one wishes]
  • One or more servlets can be injected into the Grid hub or the node [ Servlets are what get invoked when one invokes a page on the hub. http://localhost:4444/grid/console which shows the Grid console is one such servlet. Its called the ConsoleServlet]

So after some research I figured out what are all the ingredients that I require for my automatic Grid maintenance mechanism that I wanted to built. They were :

  1. A way to spawn a jar and continuously monitor it.
  2. A customized Grid node which has the ability to count the number of tests that it services and after “n” number of tests, it gracefully shuts itself down.

 Now to build item #1, I basically did some googling and quickly found that apache commons had one such library that I was looking for. There’s a popular saying that I read somewhere.. if you thought about it, then there is almost a good chance that apache has a library ready for you 🙂

It was true in my case. Apache commons has a library called commons-exec which exactly does this. You can read more about this library here : http://commons.apache.org/exec/tutorial.html

Since I have been working with Maven as a build tool, here’s what the maven dependency looks like :


The draft version of this standalone application would look like this : https://gist.github.com/4649471

If you are working with Maven, you could always convert this into a standalone jar by leveraging the Maven-assembly plug-in.

So with this, the first part of the puzzle is solved. Now we need to build the custom proxy which is going to count the number of unique tests for us.

The draft version of the code can be found here : https://gist.github.com/4649607

 But we still haven’t tackled one big problem. How do you make a node gracefully shut itself down ? If you thought that merely invoking a System.exit() is going to do the trick, then you are wrong. Wondering why ?

Well, remember that this proxy is going to be part of the one big standalone jar that you would end up building. So if you called a System.exit() then you wouldn’t be killing the node, but instead you would be actually killing the grid.

 This is where we are going to be making use of a servlet. So we will basically create a servlet, inject it into the node and then once the Proxy is done with “n” tests, it would tell the servlet injected into the node, that the node can be shut down and the servlet would issue a System.exit() to shut itself down.

The draft version of the code can be found here : https://gist.github.com/4649702

So once we have built this as well as another standalone jar, how exactly do we put all of this together ?

Here’s how :

Lets assume that our standalone app for spawning jars is called jarspawner.jar and the standalone app that contains our custom proxy and node servlet is called mygrid.jar

The command to start the Grid would be :

java -jar mygrid.jar -role hub

The command to start the node would be :

java -jar jarspawner.jar mygrid.jar -role node -hub http://localhost:4444/grid/register -nodeConfig config.txt 
-servlets com.test.node.servlets.NodeShutDownServlet

The contents of the config.txt would be as below :

    "capabilities": [{
            "browserName": "firefox",
            "maxInstances": 5
    "configuration": {
        "proxy": "com.test.proxy.MyProxy",
        "maxSession": 5,
        "port": 5555,
        "register": true,
        "registerCycle": 5000,
        "hubPort": 4444

Notice that the configuration parameter proxy is being overridden with the name of this custom proxy viz., com.test.proxy.MyRemoteProxy

This is how we let the Grid know that it should not be using the default proxy but instead be using our proxy to talk to each of the nodes. That way the node would use our custom proxy which is capable of counting the number of tests that were serviced by a given node and then gracefully shut itself down by issuing a shutdown call the the servlet we injected into the node.

Well, I know this all sounds a lot gibberish and I could have taken the easy way out by just creating a github project with all the source code. But that is not my intention here. My intention here is just to decipher the critical parts and let you figure out the easy ones.. just as I did too 🙂

Since I did get some offline queries around how to make this as a standalone java application (Especially the proxy part) I am adding that information here as well.

I have so far used maven-assembly-plugin for all my packaging needs.

So if you are also using Maven as your build tool, here’s how the assembly plugin contents would look like :


And the assembly.xml would look like below

    xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">

For those of you who would like to use other mechanisms of creating a standalone jar, the main method resides in this class : 



16 thoughts on “Building a self maintaining Grid Environment

  1. Krishnan – Thanks for sharing info. Few things are still not clear i.e. nodes ending up running out of memory
    You mentioned about your problem of OS level dialogues whether it is print or upload window.
    Even if you have setup that you mentioned above how you are going to handle those things on nodes? To make it more clear, on very first test my browser got stuck with OS window and now what will happen? Your node will get stuck on very first test and will not be able to handle more test.
    1. How will node heal now?
    2. As it has not reached X number of test case how will it shutdown and re-register with hub
    3. As you told you have physical machine, do you manually restart the machine in this case? Even if this is case of VM then do we have to manually kill the processes and restart the machine and node?


    Posted by Gulshan Saini | January 28, 2013, 2:30 pm
    • The node wouldn’t heal itself by processing any of the OS level dialogs. What would happen here is, that after a given point, the JVM will run out of Memory and crash. All that the JarSpawner standalone app does is, it keeps checking if the selenium standalone jar is running. If it is not, it re-spawns it.

      A node running out of memory can happen due to a variety of reasons including the ones that I cited out. The other reasons include a prolonged JVM running. If a java process runs for a long time, over the period of time, it starts hogging on to the memory and eventually eats up all of the memory allotted to it from the heap. This is what they call as memory leaks in the programming world. I dont think the grid has been pruned enough to plug memory leaks. This I believe was not done because the Grid or the node are not “mission critical” applications whose downtimes can impact business.

      The physical machine is NOT restarted here (nor is there any need to do this). All that needs to be done is to kill the JVM and re-spawn. So in this case, either the OS throws the JVM out because of the OOM error or the JVM exits on its own. Once that happens the standalone JarSpawner app would re-spawn the node again (As if you were doing it).

      If its a VM, then the management is more elegant. You can actually enhance your Grid Proxy to manage the restarting of the VM itself. I believe there are APIs for this (I have never tried it)

      Posted by Confusions Personified | January 28, 2013, 2:47 pm
      • Thanks for clarification, Is the API mentioned in last line part of Selenium project or being maintained under some other project. If possible share the link, I want to look at the API

        Posted by Gulshan Saini | January 28, 2013, 7:45 pm
      • All APIs that are being used are part of Grid. Try exploring the selenium code base for more information

        Posted by Confusions Personified | January 28, 2013, 8:07 pm
  2. I’ve tried to create the mygrid jar adding the proper dependincies for javax.servlets and selenium. But when trying to execute the java -jar mygrid.jar -role hub i’m getting an error message “Failed to load Main-Class manifest attribute from mygrid-0.1.jar” . Please let me know what I am missing.

    Posted by opensourcelover | March 14, 2013, 4:29 pm
  3. I think consider ..

    Number Of GB RAM on Node = 1 for every browser instance. (3 Gigs for a 3 browser instance)

    Posted by Christian Clarke | September 9, 2013, 12:27 am
  4. We were able to automate the process to acquire a node “windows machine” ( through web services and a blueprint which already has selenium jars) and apply all the changes (registry) and register to a hub, this helped us to destroy and recreate a new node every 3 or 2 days automatically and nodes were clean

    Posted by Dheeraj | February 20, 2014, 11:40 pm
  5. I believe there is a bug in your code / flaw in the logic here. I’ve implemented the custom grid proxy (thanks for the help!) however, if I set the number of tests before cleanup to 2, but request 100 WebDrivers the message “Cannot forward any more tests to this proxy 10.0.0.xxx” shows up after 2 tests hit the grid node, however the tests continue to pile onto the node and execute until all 100 have executed, then the grid will be shutdown. Is this working as expected? I would think that once the max number of tests have been serviced on a node, no more tests should hit the node until after it is recycled.

    Posted by Adam | August 12, 2014, 1:45 am
  6. In regards to my previous comment. To fix the problem, replace the beforeSession override with the following:

    public TestSession getNewSession(Map requestedCapability) {
    return super.getNewSession(requestedCapability);

    System.out.println(“Cannot forward any more sessions to this node until it has been recycled.”);

    return null;

    Posted by Adam | August 12, 2014, 1:59 am
  7. Sorry to spam the comments section but my last code snippit has a flaw in it which I’ve corrected (I’m new to the grid code base). I didn’t realize that getNewSession method is spammed by the capability matcher when searching for a node to run a test. In other words, when the grid is searching for available node, it iterates over ALL nodes and hits getNewSession with a request. As a result the test run counters were being incremented way too often and my nodes were going offline and then back online too frequently. The code below ran perfectly last night:

    * overwrites the session allocation to discard the proxy that are down.
    public TestSession getNewSession(Map requestedCapability)
    return null;

    TestSession session = super.getNewSession(requestedCapability);

    if(session != null)
    boolean moarTestsOK = decrementCounter();

    System.out.println(“Cannot forward any more sessions to this node until it has been recycled.”);

    return session;

    Posted by Adam | August 13, 2014, 7:39 pm
  8. Hi, I am new to hacking selenium, where would you put something like NodeShutDownServlet.java in the source code of selenium? am I right in assume you want to alter whatever code that builds build/java/server/test/org/openqa/selenium/server-with-tests-standalone.jar?

    Posted by Yangda | October 8, 2014, 4:10 am
  9. Hi Author,

    Wondering if this is the right place to ask . I was trying to follow the steps described

    The command to start the Grid would be :
    1 java -jar mygrid.jar -role hub

    How that would run the hub by running mygrid.jar is it has only custom proxy and a custom servlet? Am I missing something here?


    Posted by Tanmay Ghorai | March 2, 2015, 5:21 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: