Better EC2 scaling for Jenkins

When setting up Jenkins and targeting AWS it’s a common pattern to jump to using the EC2 plugin. That is, unless you’re utilizing spot instances or Kubernetes. The plugin has its appeal when starting out, since it’s relatively easy to configure. However as you start to scale your Jenkins setup you may notice some downsides.

With the EC2 plugin you’re relegated to setting it up via the Jenkins management interface. Again, when starting out this is an easy solution as you can quickly throw in some configuration and have an agent magically appear! Later on when you’re trying to automate this process via changing settings dynamically it quickly becomes an issue.

While it can technically be done via Groovy scripting, it doesn’t seem like an optimal choice since the entire Jenkins configuration needs to be saved. It’d be nicer if only the portion relating to the EC2 plugin could be saved at a time.

The EC2 plugin uses node provisioning under the hood. By default this results in the instances scaling up based on queue length. So if you have a smaller Jenkins you’ll need to kick off a job then wait for the queue metrics to propagate to trigger an instance scale up. On top of this, you’ll also need the underlying EC2 instance to start. This can add minutes for a job and in my experience, angry developers!

Another issue with the provisioner used for Jenkins is that while it can be tuned, it can only be tuned via JVM arguments. In our setup this didn’t work since we didn’t have the luxury to tune parameters while incurring multiple Jenkins unavailability scenarios.

When nodes are provisioned they’ll get used as needed. They’ll also stick around until they’re not used anymore. In fact, agents are only deprovisioned when they’re fully idle for quite some time. I’ve found that nodes tend to get in a bad state after being up for awhile and it’s usually better to run as few builds on them as possible. Sometimes after a number of arbitrary builds, Jenkins agents may start failing builds due to the state that they’re in. By having the nodes constantly cycle it’s more likely they’ll always be in a clean state.

To do better than the EC2 plugin, you’ll need to do the management on your own. Luckily, AWS makes this fairly easy. We relied heavily on AWS EC2 Auto Scaling Groups. We built an AMI that contained the Jenkins remoting natives to allow the instance to attach to Jenkins as soon as it came up. We also defined Cloudformation for any infrastructure it required.

Target tracking scaling policies for Amazon EC2 Auto Scaling made it really simple to define a policy on how to auto scale. We set up a Lambda function that would connect to Jenkins and use the metrics Jenkins produced and publish them to CloudWatch. This was effectively set on a scheduled timer with CloudWatch events. In our case we published the “utilization percent”, which equaled (1.0 - idle_executors / total_executors) * 100.0. Then we set the utilization percent in our target tracking to some value, usually between 50% to 75%. This allowed Amazon to fully manage our auto scaling. Since we configured the utilization percent to under 100% we always had some idle workers. In our case it did create some overhead, but the time to having a build start was significantly decreased.

AWS EC2 Auto Scaling allows you to define a custom Lambda function that it’ll run before terminating any instances. In our case we needed to take advantage of this in order for no builds to fail due to the underlying EC2 instance being terminated. The idea behind the lambda is fairly simple and can be achieved via the Jenkins REST API. When an instance is marked for termination doing the following:

Mark the instance in Jenkins as offline, so that no new builds will be run on the instance.
Periodically check to see that the instance has no builds running.
When the instance is confirmed to have no builds running, complete the lifecycle action and allow AWS to terminate the node.

There is only one important caveat here. These lifecycle events to terminate nodes have a max timeout and therefore cannot be run indefinitely. Due to this, we imposed a timeout of 8 hours in which the node was removed. For our instance running 10,000 builds a day this never became an issue.

This solved most of the problems for us. We had better control on the auto scaling of nodes and didn’t require Jenkins restarts to tune. All of our configuration as code lived as a Cloudformation stack that we could manage outside of the Jenkins instance. Provisioning was faster in our case due to always having a pool of ready nodes to be used. This pool also grew proactively and its size could be tuned easily by changing it in the AWS console or via the API. Stale nodes never became an issue as nodes were cycled often, but it was easy to setup rules to terminate the nodes safely. By having our custom termination script we could safely assume that we could terminate any node without fear of breaking the build.

My hope is that someone else may find this useful and solve the same issues for them as we faced. In fact, our deployment of this has been working for about 3 years at the time of writing and has scaled to tens of thousands of builds per day with peak concurrency of 600 nodes at a time.