Erlang in The Land of Lisp

Details & Slides

This talk is dedicated to lessons we’ve learned while designing, developing, and deploying to production our very first Erlang project. The audience of this talk will learn about differences between Clojure and Erlang, both at the linguistic level as well as, even more importantly, at the level of underlying virtual machines. I’m going to discuss how Erlang challenges our methods of building systems in Clojure. I’ll use our new Erlang-based project as a source of concrete differences.

Slides

Machine transcript

Good morning, ladies and gentlemen. Very happy that so many of you decided to wake up early today. My name is Jan Stępień. I work as a software developer on the back end side of things, mostly with Clojure.

I work at stylefruits. We are a mobile fashion hub for Europe. Our main product is our mobile application, which is powered by a whole network of services, which are mostly running on top of Clojure. One problem which we encountered as we were growing our mobile application is that it took more and more requests to our REST API to deliver all the data. Some of those requests were dependent on each other. Some of them required multiple round trips. We've heard about the problem of multiple round trips during the keynote yesterday. We wanted to solve this problem.

One idea we had in mind was to have coarse-grained end points, which would deliver all data necessary to render entire view or deliver entire functionality in a single go. We didn't like the idea because we valued our small, narrow, focused end points. We're looking for a solution to combine them together somehow. This led us to another idea.

Instead, we wanted to build some sort of proxy in the middle, which would accept a batch of requests, all expressed in a single JSON document, and then execute them one by one against the underlying API. If you think about the problem, it's not that complex. On one hand, you have some HTTP server, which accepts those batches. Then you have some JSON processing in the middle. And finally, an HTTP client, which executes those requests as efficiently as possible against the backend API.

And this simplicity encouraged us to experiment. We were running all our backend on Clojure. And we decided to try out something completely different. We heard a lot of good things about Erlang. We were encouraged by success stories of various large companies. We were intrigued by impressive, impressive stories about fault tolerance, concurrency, and so on. We wanted to see how it works in practice.

Also, Zef recommended to pick boring technology, and you can't get far more boring than Erlang. So we chose the language, jumped head first into the project, which we codenamed Rib: Requests and Batches.

Allow me to start with a very brief introduction to the language itself. Erlang looks a bit different. It's not bad. It's just kind of nothing like you'd expect to see. But still, there are similarities.

You have files which have some functions defined at the main top level. Those functions are private by default and have to be explicitly made public by being exported. You have some pattern matching, list comprehensions. Pretty much the same thing as anywhere else. Punctuation can get tricky, but this is nothing you wouldn't learn after the first four hours of the language.

What is far more interesting is the virtual machine. The language is just a small, superficial layer on top. Everything interesting is happening beneath. Erlang VM is more of an operating system. Your code is executed in processes, separate concurrent entities. They cannot communicate with each other and share anything. Only thing they can do is to just send asynchronous messages, which are immutable, and that's all.

One interesting thing you can do with those processes is to link them. Creating a link creates this symmetric relationship between them. When one of the linked processes dies, the other one is killed as well. So if P3 crashed, P4, which was linked, crashes too, which causes P1 to die. P2, which was independent and not connected to this graph, continues to operate as if nothing happened. This might sound crazy at the beginning, but believe me, this is outstanding in practice, as I will show later during the talk.

Another interesting difference is how concurrency works. On JVM, for instance, every thread is backed by a single kernel operating system thread. In Erlang, that's not the case. In Erlang, VM will start as many schedulers as CPUs you have available. And then it will schedule the operations will schedule each of your processes which are ready to be executed on those schedulers.

A process will be executed as long as it has actual work to do. The moment it waits for any I/O operation—read from a file, write to a socket, wait for a message, anything else—it's being de-scheduled. Another process kicks in, and your process will just wait there hanging until its I/O is completed successfully or not.

You might recall Leandro's talk about his web server, where he was talking about complexity of asynchronous I/O. Notice that this system has built an asynchronous I/O behind a completely synchronous, simple facade. It's really outstanding.

Additionally, each process is only allowed to execute a certain number of instructions, also called reductions. The moment it hits this limit is de-scheduled, and another process kicks in, preventing from one process starving the rest.

Another interesting difference is how memory is organized. Each process has its separate heap. As a result, if a process is memory hungry and allocates a lot of memory, and it needs to be garbage collected, the garbage collector can focus on a single, small heap. And all the rest of the virtual machine can just work and operate as if nothing happened, which allows you to build pauseless systems in Erlang. It's just incredible.

OK, having a rough understanding of Erlang and seeing the differences between those virtual machines, we can move on to our project. And well, we moved on to setting up our environment, getting some dependencies, and trying to implement a minimal, viable prototype of Rib.

We quickly ran into Rebar, which is a build tool management dependency management tool, allowing you to get all your dependencies together. Look, it's very interesting that you can just point to Git repository and get your dependencies directly from there. And it's possible, because in the Erlang ecosystem, people follow a strict way of organizing your projects.

As long as you have your directories following a certain system, it all works together very nicely. Erlang comes with a shell as well, so you can experiment in the REPL, reload your code dynamically on the fly, and it's a very pleasant overall development experience.

Once we got the minimal, viable prototype done, we moved on to the most interesting thing, that is trying to make it as concurrent as possible. And by that, I mean once we have a batch of requests, finding all the ones which are independent from each other and executing them against the back-end API as efficiently, as concurrently as possible.

Coming from the Clojure background, the standard way to do it would be, of course, a PMAP function. We failed to find one in the standard library, so we decided to write our own. First mistake. We learned a lot during the process. Let me guide you through. This was our first approach, our first attempt. Let's go step by step.

If PMAP takes a function and a list, we start by taking a reference to the current process, and then we define the workers using the following list comprehension. For every x in the list, we define for every x in the list, we spawn a new process. And in this process, we run the following closure. We create a two-element tuple, a pair. And this pair has the reference to the current worker process and the result of applying function to x. And this pair is then being sent to the parent process.

In the last expression of the function, you see another list comprehension where we selectively receive messages which are incoming to the parent process. Note that inside of the receive block, worker is not a, so to speak, variable. But it's bound to a worker, allowing us to receive messages in a given order. As a result, the output list has the same order as the input list.

There is at least one serious problem with this function. If anything in the function, in the worker thread goes wrong and it crashes, we will never learn. We will wait forever in the receive block. And to solve it, we will change spawn into spawn link. I told you about links before. Link creates this interesting relation. Now when worker process crashes, the parent will be killed as well. And all the other workers will be killed too, because they are linked as well, resulting in a termination of the entire concurrent computation.

If I added one extra line of configuration at the beginning, instead of killing the parent, the death of a worker would just send a message to it upon which you can react. Now let's add one more thing. We can add a timeout in the receive block, allowing us to wait for one second for each worker. And if the worker doesn't send us a message within one second, we crash with timeout being our explanation of the problem.

This function does what we want it to. It behaves as we want it. And if you see any problems with it, let me know because we have it in production. And yes, it works. But there is one problem, and it's mostly stylistical. The send message operator, the bang. The bang is a low-level operator. It's as if you were emitting your own JVM bytecode running Java or using inline assembly writing C. This is a valid solution for certain problems, but it's typically too low level.

Normally, you need higher abstraction. And this brings me to the standard library. OTP, also known as Open Telecoms Platform—it goes also by another name, but it doesn't matter now. It's a very interesting set of tools. It comes with all the things you'd expect in a standard library, that is HTTP client, SSL, various other things.

But it also comes with, so to speak, generic behaviors. Those are years of experience of Erlang programmers serving distributed problems at scale as your standard library. It's just incredibly powerful. And this is where a lot of primitives, such as sending/receiving messages, spawning processes, are nicely encapsulated. Allow me to share two examples of how we applied parts of OTP in practice. The first use case is our connection killer. Our HTTP client in the rib maintained open connections to our backend API for too long, indefinitely. We wanted to periodically recycle those connections. We failed to configure the HTTP client properly. So instead, we went with brute force.

We wrote this small, simple module whose only responsibility, the only thing you can do with it, is to start it. It will spin up a new process in the background. And in this process, every 30 seconds, we will send a request to the backend API using the same connection pool as all the other rib workers with connection close header. This header will force the server to close the connection to us after delivering the response. As a result, we will achieve exactly the behavior we wanted. Periodically, some connection in the pool by random will be closed.

Notice one interesting thing. This is the happiest code you've ever seen. Everything works. There are no errors, no problems, no catching of any exceptions. Just everything is OK. How does it work? I'm sure you can recall Oscar's slide from yesterday when he was showing a piece of Go with `err != nil`, `err != nil`, `err !=`—every second line, right? Compare and contrast, right?

Error handling is elsewhere. Error handling is taken care of by an OTP supervisor. Another module, which we have in this particular use case, is a supervisor, which is responsible just for spinning up the connection killer, waiting until it dies, logging that it died, restarting it. That's it. And this is all the code. Give or take those two dots in the middle, right? There's a bit more configuration. It's outstanding.

Our error handling is completely decoupled from actual logic. They are not only in separate modules. They are in separate processes, right? This is decoupling. It's fascinating. And this comes as part of the standard library. Supervisors can also monitor more than one process. And they can monitor other supervisors, resulting in deeply nested trees of supervision if your application justifies such complex infrastructure architecture.

Let's move on to the next use case, a request limiter. Rib cannot tell upfront how many requests it will have to execute upon receiving your batch. It has to dynamically count your requests. Requests it executes against the backend. And when it hits a limit, it has to terminate the whole operation and send back a message saying, sorry, you exceeded your limit of requests.

Think about how you would solve this problem in your favorite programming language. The solution we found in Erlang surprised us, to say the least. This is all there is. The rib limiter is a GenServer, a generic server. A generic server has nothing to do with things which speak HTTP or even TCP. This is just a separate process which has its internal state and can react to asynchronous and synchronous messages.

When we get a new request from our client, we start an instance of this limiter using the start link function and passing maximum number of requests which are allowed to be executed to satisfy this request. It will call the init function internally to extract the end from our config param and set this as the initial state of the server, n, just a number.

Then the handle to the server is being given to all the workers which are participating in satisfying the request. Whenever a child worker executes a request, or rather is about to send n extra requests, it calls this subtract function, which will take, as its arguments, a handle to the limiter and the number of requests to be executed.

There, we call GenServer.cast, which sends an asynchronous message to the server. We handle this message in handle cast, which destructures the message, extracts the number of requests to be executed. It also has its current state as the second argument. It calculates the new state as a difference of requests which are allowed to be executed and how many are about to run.

And then if this difference is non-negative, everything's fine, move on. We return no reply new state, which instructs the gen server that we're not interested in the response. And secondly, the new state is new state. Things get really interesting in the second part of this if. Namely, if the new state is negative, we return a different tuple.

Stop limit exceeded new state. Stop instructs the gen server to terminate our server, log information saying that the server was in this state. It received this message resulting in this new state. And it crashed. And then it crashes. And what's happening next? It's linked to the central server, which is managing the entire computation of the entire request, which crashes, which was linked to all the other processes which are executing all the concurrent threads.

They all crash, right? Like, everything's on fire. It's like a domino effect. Once this domino reaches the outmost handler of HTTP server side, we translate it to a clean error message saying, sorry, you hit your limit. Try again with a smaller number of requests.

I can't overstate how awesome this is, right? Notice that we are completely relying on the Erlang VM and on the standard library to take care of terminating distributed computation by introducing minimal dependency on the rib limiter itself. All the child processes, all the workers just have to have a handle to the limiter and be able to count how many requests they have to execute and just let it know. That's it. They don't need to react to the condition. They will be killed when they hit the condition, if any of them does.

Additionally, notice that all the low level details, such as sending messages, receiving messages, creating processes, it's all handled internally by GenServer. We don't need to go solo. This is plumbing. Those are details. We operate here on a higher level of the standard library.

And finally, notice that we just outsourced the control flow to the Erlang standard library. We don't need to maintain our own state, the state of our server. The state is given to us as an argument, as an immutable argument on every invocation. And we're expected to return the new state of the server on every invocation. The entire control flow is elsewhere. We have just pure logic.

Erlang is something completely different. Coming from Clojure background, this was a shock, because my code is suddenly the hackiest code in the world. Everything works. Everything returns OK, at least I expect it to. And if it doesn't, it will crash. That's fine. And handling of this error is happening completely elsewhere in a dedicated error handling module, let's call it. It just changes the way you build systems, change the way you think about your problems if you can decouple your actual logic from the exceptional cases so much.

This wouldn't be possible without those rock-solid foundations, without the virtual machine giving you very strong guarantees, allowing you to create links between life cycles of different processes, and OTP, the standard library, which builds on this amazing foundation and allows you to create incredible things. Notice that the language itself, this Erlang layer, is just a small thing. Everything interesting is happening below.

So if, for some reason, syntax makes you a bit scary, which I can understand, no worries. There is Elixir. There are many other things which will allow you to use all those outstanding things happening below to your advantage.

I'm very happy to say that Rib is open source. It's Apache License. So if you're interested, just clone it. I would love to get your feedback. And I would love to see you use it and let me know why it doesn't work for you. Patches are even more welcome. It's like a love letter in the open source community. So get it, try it, and let me know.

As a closing thought, this just changes the way you think, changes the way you approach your problem. When you start to organize your system, not in classes, not in functions, but in processes which have different responsibilities, it was a very interesting experience for us. And I really encourage you to give it a try, check out the language which runs on the Erlang VM, and just see how it makes you a better programmer.

This is all I've got. You've been a great audience. Thank you all so much.