Building an Elixir cluster on ECS
Last month I needed to create a cluster of elixir nodes. There seems to be extensive literature on this topic. Unfortunately, since the solution is extremely dependent on your infrastructure needs and cloud provider, none of them worked for me. Here are the constraints I had:
- runs on AWS ECS: our entire infrastructure was using AWS ECS. I didn’t care to invest in switching to Kubernetes (though that’s more popular and arguably something I should learn — I’ll learn on my own another time).
- dynamic cluster size: I wanted to be able to spin up and down nodes in the cluster as needed. Many solutions suggested allocating specific servers and IP addresses and having each node in the cluster know about each other from static files commit to git.
- minimize changes: ideally, the changes to create a cluster would be mostly configuration. For this reason, articles that had lots of elixir code or involved creating new infrastructure like a Redis cache didn’t appeal to me.
Before we talk about how to set up a cluster, let’s talk about why we would need to set up a cluster.
disclaimer: this article assumes knowledge of AWS ECS, AWS networking, and EPMD. If you’re not familiar with these topics, here is a good primer on AWS networking; here is a good primer on AWS ECS; and here is a primer on distributed Elixir with EPMD.
Why do we need a cluster?
Let’s imagine that we want to push live updates to our website users. One of the approaches is to use a websocket so that the site visitor stays connected to the server and the server can “push” updates as needed.
This solution is pretty straightforward. Once a user connects to the server, the server can send updates to the user through the live connection. Here, we show some other service that’s telling the websocket server to send data to the user.
Eventually (hopefully), we’ll get enough users to need more than one server. When you need more than one server, you’ll likely use a load balancer to route requests equally to the different servers you have. Here is an example of what can go wrong in that situation:
While the load balancer is helpful in routing requests, here it’s actually hurt us. It prevented the “other service” from pushing data to the user. The solution here is not to remove or change the load balancer. If we want to send new data to users who subscribe to a feed of that data, it shouldn’t matter which server they’re connected to. The “other service” should be able to tell a single websocket server about the update and all users who care should get the new data.
For this reason, we need all the websocket servers to know about each other. When multiple servers communicate like this, it’s called a cluster. Each server in the cluster is called a node. In a cluster, each node would communicate changes to the other nodes in a cluster. In this way, the “other service” can push data to a user no matter which server either are connected to.
So the question is now:
How do we create a cluster?
Sure, we understand why we need one. But how do we actually make one? It can seem daunting, but I went through the trial and error of many different attempts so you don’t have to. Here are the different pieces and how they fit together.
Since we’re using ECS, we’ll first need to create a task definition:
Then create a new service in your ECS cluster. Make it a FARGATE service and give it a few subnets in a VPC. Make sure that a new ENI is created for each task and the service discovery is enabled for the service based on the IP (since each ENI will get its own IP address).
Next, we need to have a way to start EPMD in a running server, give the server a node name, and connect it to another node (or nodes). I did this directly in elixir code with a Phoenix controller:
When a new task launches, we can now:
- start EPMD
- start the node
- connect the node to the cluster
Finally, we need something to do those three things (as they don’t happen automatically when a task is launched). I accomplished this with a small Python script that runs on AWS Lambda every few minutes. Here is the script:
And that’s it!
You now have an elixir cluster that will dynamically register and deregister nodes as you create or de-provision them.