Last month, at the ElixirConf EU 2019, John Mertens, Principal Engineer at Change.org conducted a session - Lessons From Our First Trillion Messages with Flow for developers interested in using Elixir for building data pipelines in the real-world system.
For many Elixir converts, the attraction of Elixir is rooted in the promise of the BEAM concurrency model. The Flow library has made it easy to build concurrent data pipelines utilizing the BEAM (Originally BEAM was short for Bogdan's Erlang Abstract Machine, named after Bogumil "Bogdan" Hausman, who wrote the original version, but the name may also be referred to as Björn's Erlang Abstract Machine, after Björn Gustavsson, who wrote and maintains the current version). The problem is, that while the docs are great, there are not many resources on running Flow-based systems in production. In his talk, John shares some lessons his team learned from processing their first trillion messages through Flow.
Change.org is a platform for social change where people from all over the world come to start movements on all topics and of all sizes. Technologically, change.org is primarily built in Ruby and JavaScript but they started using Elixir in early 2018 to
build a high volume mission-critical data processing pipeline. They used Elixir for building this new system because of its library, Flow. Flow is a library for computational parallel flows in Elixir. It is built on top of GenStage. GenStage is “specification and computational flow for Elixir”, meaning it provides a way for developers to define a pipeline of work to be carried out by independent steps (or stages) in separate processes. Flow allows developers to express computations on collections, similar to the Enum and Stream modules, although computations will be executed in parallel using multiple GenStages.
At Change.org, the developers built some proofs of concept and a few different languages and put them against each other with the two main criteria being performance and developer happiness. Elixir came out as the clear winner. Whenever on change.org an event gets added to a queue, their elixir system pulls these messages off the queue, then prep and transforms them to some business
logic and generate some side effects. Next, depending on a few parameters, the messages are either passed on to another system, discarded or retried. So far things have gone smoothly for them, which brought John to discuss lessons from processing their first trillion messages with Flow.
Flow and GenStage both are great libraries which provide a few game-changing features by default. The first being parallelism. Parallelism is beneficial for large-scale data processing pipelines and Elixir flows abstractions make utilizing Parallelism easier. It is as easy as writing code that looks essentially like standard Elixir pipeline but that utilizes all of your CPU cores.
The second feature of Flow is Backpressure. GenStage specifies how Elixir processes should communicate with back-pressure. Simply put, Backpressure is when your system asks for more data to process instead of data being pushed on to it. With Flow your data processing is in charge of requesting more events. This means that if your process is dependent on some other service and that service becomes slow your whole flow just slows down accordingly and no service gets overloaded with requests, making the whole system stay up.
The next lesson is on how to set up your code to take advantage of Flow. These organizational tactics help Change.org keep their Flow system manageable in practice.
The golden rule according to John, is to keep your flow simple. Start simple and then increase the complexity depending on the constraints of your system. He discusses a quote from the Flow Docs, which states that:
[box type="shadow" align="" class="" width=""]If you can solve a problem without using partition at all, that is preferred. Those are typically called embarrassingly parallel problems.[/box]
If you can shape your problem into an embarrassingly parallel problem, he says, flow can really shine.
He also advises that developers should know their code and understand their systems. He then proceeds to give an example of how SQS is used in Flow. Amazon SQS (Simple Queue System) is a message-queuing system (also used at Change. org) that allows you to write distributed applications by exposing a message pipeline that can be processed in the background by workers. It’s two main features are the visibility window and acknowledgments. In acknowledgments, when you pull a message off a queue you have a set amount of time to acknowledge that you've received and processed that message and that amount of time is called the visibility window and that's configurable. If you don't acknowledge the message within the visibility window, it goes back into the queue. If a message is pulled and not acknowledged, a configured number of times then it is either discarded or sent to a dead letter queue. He then proceeds to use an example of a Flow they use in production.
You should also use a consistent data structure or a token throughout the data pipeline. The data structure most essential to their flow at Change.org is message struct - %Message{}. When a message comes in from SQS, they create a message struct based on it. The consistency of having the same data structure at every step and the flow is how they can keep their system simple. He then explains an example code on how they can handle different types of data while keeping the flow simple.
The next organizational tactic that helps Change.org employ to keep their Flow system manageable in practice is to isolate the side effects. Side effects are mutations; if something goes wrong in a mutation you need to be able to roll it back. In the spirit of keeping the flow simple, at Change.org, they batch all the side-effects together and put them at the ends so that a nothing gets lost if they need to roll it back. However, there are certain cases where you can’t put all side effects together and need a different strategy. These cases can be handled using Flow Sagas. Sagas pattern is a way to handle long live transactions providing rollback instructions for each step along the way so in case it goes bad it can just run that function. There is also an elixir implementation of sagas called Sage.
How you optimize your Flow is dependent upon the shape of your problem. This means tailoring the Flow to your own use case to squeeze all the throughput.
However, there are three things which you can do to shape your Flow. This includes measuring flow performance, what are the things that we can actually do to tune it and then how can we help outside of the Flow.
Apart from the three main lessons on data processing through Flow, John also mentions a few others, namely
Finally, John gave a glimpse of Broadway from change.org’s codebase. Broadway allows developers to build concurrent and multi-stage data ingestion and data processing pipelines with Elixir. It takes the burden of defining concurrent GenStage topologies and provides a simple configuration API that automatically defines concurrent producers, concurrent processing, batch handling, and more, leading to both time and cost efficient ingestion and processing of data. Some of its features include back-pressure
automatic acknowledgments at the end of the pipeline, batching, automatic restarts in case of failures, graceful shutdown, built-in testing, and partitioning. José Valim’s keynote at the ElixirConf2019 also talked about streamlining data processing pipelines using Broadway.
You can watch the full video of John Mertens’ talk here. John is the principal scientist at Change.org using Elixir to empower social action in his organization.
Why Ruby developers like Elixir
Introducing Mint, a new HTTP client for Elixir
Developer community mourns the loss of Joe Armstrong, co-creator of Erlang