Concurrent Patterns and Best Practices

Concurrency – An Introduction

–What are concurrency and parallelism? Why should we study them? In this chapter, we will look at the many aspects of the concurrent programming world. The chapter starts with a brief overview of parallel programming and why we need it. We cover ground pretty fast here, touching upon the basic concepts.

As two major forces, huge data size and fault tolerance drive concurrent programming. As we go through this chapter, some of our examples will touch upon some clustered computing patterns, such as MapReduce. Application scaling is an extremely important concept for today's developer. We will look at how concurrency helps applications scale. Horizontal scaling (https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases) is the magic behind today's massively parallel software systems.

Concurrency entails communication between the concurrent entities. We will look at two primary concurrency models: message passing and shared memory. We will describe the message passing model using a UNIX shell pipeline. We will then describe the shared memory model and show how explicit synchronization creates so many problems.

A design pattern is a solution to a design problem in context. Knowledge of the catalog of patterns helps us to come up with a good design for specific problems. This book explains the common concurrency design pattern.

We will wrap up the chapter by looking at some alternative ways of achieving concurrency, namely the actor paradigm and software transactional memory.

In this chapter, we will cover the following topics:

Concurrency
Message passing model
Shared memory and shared state model
Of patterns and paradigms

For complete code files you can visit https://github.com/PacktPublishing/Concurrent-Patterns-and-Best-Practices

Concurrency in a breeze

We start with a simple definition. When things happen at the same time, we say that things are happening concurrently. As far as this book is concerned, whenever parts of an executable program run at the same time, we are dealing with concurrent programming. We use the term parallel programming as a synonym for concurrent programming.

The world is full of concurrent occurrences. Let's look at a real-life example. Say that there are a certain number of cars driving on a multilane highway. In the same lane, though, cars need to follow other cars, the ones that are already ahead of them. A road lane, in this case, is a resource to be shared.

When a toll plaza is built, however, things change. Each car stops in its lane for a minute or two to pay the toll and collect a receipt. While the toll attendant is engaged with the car, other cars behind it need to queue up and wait. However, a toll plaza has more than one payment gate. There are attendants at each gate, attending to different cars at the same time. If there are three attendants, each serving one gate, then three cars could pay the toll at the same point in time; that is, they can get serviced in parallel, as shown in the following diagram:

Note that the cars queuing up at the same booth get serviced in sequence. At any given time, a toll attendant can service only one car, so others in the queue need to wait for their turn.

It would be really odd to see a toll booth with just one gate! People wouldn't be served in parallel. Strictly sequential processing of toll charges would make life unbearable for the frequent traveler.

Even when there are multiple gates and an abnormally large influx of cars (say on vacations), each gate becomes a bottleneck; there are far fewer resources for servicing the workload.

The push for concurrency

Let's come back to the software world. You want to listen to some music at the same time that you are writing an article. Isn't that a basic need? Oh yes, and your mail program should also be running so that you get important emails in time. It is difficult to imagine life if these programs don't run in parallel.

As time goes by, software is becoming bigger and demand for faster CPUs is ever increasing; for example, database transactions/second are increasing in number. The data processing demand is beyond the capabilities of any single machine. So a divide and conquer strategy is applied: many machines work concurrently on different data partitions.

Another problem is that chip manufacturers are hitting limits on how fast they can make chips go! Improving the chip to make the CPU faster has inherent limitations. See http://www.gotw.ca/publications/concurrency-ddj.htm for a lucid explanation of this problem.

Today's big data systems are processing trillions of messages per second, and all using commodity hardware (that is, the hardware you and me are using for our day-to-day development)—nothing fancy, like a supercomputer.

The rise of the cloud has put provisioning power in the hands of almost everyone. You don't need to spend too much upfront to try out new ideas—just rent the processing infrastructure on the cloud to try out the implementation of the idea. The following diagram shows both scaling approaches:

The central infrastructure design themes are horizontal versus vertical scaling. Horizontal scaling essentially implies the use of a distributed concurrency pattern; it's cost effective, and a prominent idea in the big data world. For example, NoSQL databases (such as Cassandra), analytics processing systems (such as Apache Spark), and message brokers (such as Apache Kafka) all use horizontal scaling, and that means distributed and concurrent processing.

On the other hand, installing more memory or processing power in a single computer is a good example of vertical scaling. See https://www.g2techgroup.com/horizontal-vs-vertical-scaling-which-is-right-for-your-app/ for a comparison between the two scaling approaches.

We will look at two common concurrency themes for horizontally scaled systems: MapReduce and fault tolerance.

The MapReduce pattern

The MapReduce pattern is an example of a common case where concurrency is needed. The following diagram shows a word frequency counter; given a huge text stream of trillions of words, we need to see how many times every word occurs in the text. The algorithm is super simple: we keep the running count of each word in a hash table, with the word as the key and the counter as the value. The hash table allows us to quickly look up the next word and increment the associated value (counter).

Given the size of the input text, one single node does not have the memory for the entire hash table! Concurrency provides a solution, using the MapReduce pattern, as shown in the following diagram:

The solution is divide and conquer: we maintain a distributed hash table and run the same algorithm, suitably adapted for a cluster.

The master node reads—parses—the text and pushes it to a set of slave processing nodes. The idea is to distribute the text in such a way that one word is processed by one slave node only. For example, given three processing nodes, as shown in the preceding diagram, we would divide rangewise: push nodes starting with the characters {a..j} to node 1, {k..r} to node 2 and the rest—{s..z}—onto node 3. This is the map part (distribution of work).

Once the stream is exhausted, each slave node sends its frequency result back to the master, which prints the result.

The slave nodes are all doing the same processing concurrently. Note that the algorithm would run faster if we add, more slave nodes; that is, if we scale it horizontally.

Fault tolerance

Another common approach is to build in intentional redundancy to provide fault tolerance; for example, big data processing systems, such as Cassandra, Kafka, and ZooKeeper, can't afford to go down completely. The following diagram shows how concurrently replicating the input stream protects against any one slave node going down. This pattern is commonly used by Apache Kafka, Apache Cassandra, and many other systems:

The right side of the diagram shows redundant machines on which a data stream is replicated.

In case any one node goes down (hardware failure), other redundant nodes take its place, thus ensuring that the system as a whole is never down.

Time sharing

In the real world, we also perform a number of tasks concurrently. We attend to a task and then if another task also needs our attention, we switch to it, attend to it for a while, and then go back to the first task. Let's look at a real-world example of how an office receptionist deals with their tasks.

When you visit any office, there is usually a receptionist who receives you and asks for your business. Say that, just as they are asking about who you need to meet, the office buzzer rings. They take the call, say "hello," speak for a while, and then ask you to wait for a second. Once the call is finished, they resume talking to you. These actions are shown in the following diagram:

The receptionist is sharing her time among all the parties interested in talking to her. She is working in a way so that everyone gets a slice of her time.

Now, keeping the toll plaza and the receptionist in mind, replace the toll operators with a CPU core and the cars with tasks, and you get a fairly good mental model of today's concurrent hardware. If we increase the number of toll operators from three to six, we will increase the number of cars getting serviced in parallel (at the exact same time) to six. A pleasant side effect is that the queued cars will also spread out, and every car will get serviced faster. The same holds true when we execute a concurrent program. Hence, things are faster overall.

Just as the receptionist is doing multiple things at the same time, such as time sharing between the visitors, a CPU shares time with processes (running programs). This is how concurrency gets supported on a single CPU.

Two models for concurrent programming

Concurrency implies that multiple tasks are happening in parallel to achieve a common goal. Just like communication with a group, we need to communicate and coordinate with the concurrently executing entities.

For example, let us say that we present the previously mentioned word frequency counter via a UI functionality. A user uploads a huge file and clicks the start button, resulting in a long-running MapReduce job. We need to distribute the work among the slaves. To send the workload, we need a way to communicate with them. The following diagram shows the various streams of communications that are required in this system:

If the user changes their mind and aborts the job, we need to communicate the stop message to each concurrent entity, as any further processing is futile.

There are two prominent models for concurrent communications: message passing and shared memory. The preceding diagram shows a message passing model.

We will first discuss the message passing model, using the celebrated UNIX shell pipeline as an example. Next, we will see the shared memory approach in depth and the problems that are associated with it.

The message passing model

Before we dive into the details of the message passing model, we will look at a bit of basic terminology.

When an executable program runs, it is a process. The shell looks up the executable, talks to the operating system (OS) using system calls, and thereby creates a child process. The OS also allocates memory and resources, such as file descriptors. So, for example, when you run the find command (the executable lives at /usr/bin/find), it becomes a child process whose parent process is the shell, as shown in the following diagram:

In case you don't have the pstree command, you could try the ptree command instead. The ps --forest command will also work to show you the tree of processes.

Here is a UNIX shell command recursively searching a directory tree for HTML files containing a word:

 % find . -type f -name '*.html' | xargs egrep -w Mon /dev/null
./Untitled.html:Mon Jun 5 10:23:38 IST 2017
./Untitled.html:Mon Jun 5 10:23:38 IST 2017
./Untitled.html:Mon Jun 5 10:23:38 IST 2017
./Untitled.html:Mon Jun 5 10:23:38 IST 2017

We see a shell pipeline in action here. The find command searches the directory tree rooted at the current directory. It searches for all files with the .html extension and outputs the filenames to standard output. The shell creates a process from the find command and another process for the xargs command. An active (running) program is called a process. The shell also arranges the output of the find command to go to the input of the xargs command via a pipe.

The find process is a producer here. The list of files it produces is consumed by the xargs command. xargs collects a bunch of filenames and invokes egrep on them. Lastly, the output appears in the console. It is important to note that both the processes are running concurrently, as shown in the following diagram:

Both these processes are collaborating with each other, so our goal of recursively searching the directory tree is achieved. One process is producing the filenames. The other is searching these files. As these are running in parallel, we start getting the results as soon as there are some qualifying filenames. We start getting results faster, which means the system is responsive.

Quiz: What would happen if both these processes ran one after another? How would the system arrange that the result of the find command is communicated to the xargs command?

Just as in real life, collaboration needs communication. The pipe is the mechanism that enables the find process to communicate with the xargs process. The pipe acts both as a coordinator and as a communication mechanism.

Coordination and communication

We need to make sure that when the find process has nothing more to report, which means that it has found all the qualifying filenames, egrep should also stop!

Similarly, if any of the processes in the pipeline quits for any reason, then the entire pipeline should stop.

For example, here is a pipeline that computes the factorial of 1,000:

% seq 1000 | paste -s -d '*' | bc
40238726007709377354370243392300398571937486421071463254379991042993\
85123986290205920442084869694048004799886101971960586316668729948085\
.... rest of the output truncated

The pipeline has three filters: seq, paste, and bc. The seq command just prints numbers from 1 to 1,000 and puts them in the console. The shell arranges things so that the output gets fed into the pipe that is consumed by the paste filter.

The paste filter now joins all the lines with the * delimiter. It just does that little bit, and outputs the line to standard output, as shown in the following screenshot:

The paste command writes to the console; the shell has arranged the output to go into a pipe again. At the other end, the consumer is bc. The bc command or filter is capable of arbitrary precision arithmetic; in simpler terms, it can perform very large computations.

When the seq command exits normally, this triggers an EOF (end of file) on the pipe. This tells paste that the input stream has nothing more to read, so it does the joining, writes the output on the console (which is going to a pipe really), and quits in turn.

This quitting results in an EOF for the bc process, so it computes the multiplication, prints the result to the standard output, which is really a console, and finally quits. This is an ordered shutdown; no more work needs to be done, so exit and relinquish the computing resources for other concurrent processes, if there are any. The melodramatic term for this marker is poison pill. See https://dzone.com/articles/producers-and-consumers-part-3 for more information.

At this point, the pipeline processing is done and we get back to the shell prompt again, as shown in the following diagram:

Unbeknownst to all the filters participating in the pipeline, the parent shell has arranged for this coordination. This ability of the framework to be composed of smaller parts without the parts themselves being aware of the composition is a great design pattern, called pipes and filters. We will see how composition is one central theme, yielding robust concurrent programs.

What happens when the seq process produces numbers way too fast? Would the consumer (paste in this case) get overwhelmed? Aha, no! The pipeline also has an implicit flow control built into it. This is yet another central theme, called back-pressure, where the faster producer (or consumer) is forced to wait so the slower filter catches up.

Let's next look at this flow control mechanism.

Flow control

The wonderful idea behind the previously mentioned pipeline is that the find producer and the xargs consumer don't know each other. That is, you could compose any filters using pipes. This is the celebrated pipes and filters design pattern in action. The shell command line gives you a framework that enables you to compose any filters together into a pipeline.

What does it give us? You can reuse the same filter in unexpected and creative ways to get your work done. Each filter just needs to follow a simple protocol of accepting input on file descriptor 0, writing output to file descriptor 1, and writing errors to descriptor 2.

You can refer to a UNIX shell programming guide for more information on descriptors and related ideas. My personal favorite is UNIX Power Tools, 3^rdEd. by Jerry Peek et al.

Flow control means we are trying to regulate the flow of something. When you tell someone to talk slowly so that you can follow their meaning, you are trying to control the flow of words.

Flow control is essential in ensuring that the producer (such as a fast speaker) does not overwhelm the consumer (such as a listener). In the example we have been working on, the find process could produce filenames faster; the egrep process might need more time to process each file. The find producer works at its own pace, and does not care about a slow consumer.

If the pipe gets full because of the slower consumption by xargs, the output call by find is blocked; that is, the process is waiting, and so it can't run. This pauses find until the consumer has finally found the time to consume some filenames and the pipe has some free space. It works the other way around as well. A fast consumer blocks an empty pipe. Blocking is a process-level mechanism, and find (or any other filter) does not know it is blocking or unblocking.

The moment a process starts running, it will perform its computation for the find filter, ferret out some filenames, and output these to the console. Here is a simplified state diagram , showing a process's life cycle:

What is this scheduled state? As mentioned, a running process could get blocked waiting for some I/O to happen, and thus it cannot use the CPU. So it is put on the back burner for a while, and other processes, waiting their turn, are given a chance to run. Drawing a parallel with the previously mentioned receptionist scenario, the receptionist can ask us to be seated and wait a while, and then move on to the next guest in the queue.

The other idea is that the process has run its allocated slice of time, so other processes should now get a chance, too. In this case, even though the process can run and utilize the CPU, it is moved back to the scheduled state, and can run again once other processes have used their run slices. This is preemptive multitasking we have here, which makes it a fair world to live in! Processes need to run so that useful work can happen. Preemptive scheduling is an idea to help each process get a slice of CPU time.

However, there is another notion that could throw a spanner into this scheme of things. A process with a higher priority is given preference over lower priority processes.

A real-world example should help make this clear. While driving on roads, when we see an ambulance or a police car with a screaming siren, we are required to make way for them. Similarly, a process executing a piece of business logic may need more priority than the data backup process.

Divide and conquer

GNU parallel (https://www.gnu.org/software/parallel/) is a tool for executing commands in parallel on one or more nodes. The following diagram shows a simple run where we generate 10 text files and zip them (using the gzip command) in parallel. All the available cores are used to run gzip , thereby reducing the overall processing time:

The core principle at work is divide and conquer. We see the same principle again and again: a parallelizable job is split into pieces, each of which is processed in parallel (thereby overlapping processing and reducing the time). The parallel command also allows you to distribute long-running jobs on different nodes (machines), thereby allowing you to harness the idle (possibly unused) cores to process jobs quickly.

The concept of state

The communication depicted in the preceding section could be looked at as message passing; find is passing on the filename as a message to the egrep process, or seq is passing messages (numbers) to the paste process. Generally speaking, a producer is sending messages to the consumer for consuming, as shown in the following diagram:

As shown in the preceding diagram, each process has its own state by design, and this state is hidden from other processes. The processes communicate with explicit messaging channels, in the same way that a pipe directs the flow of water.

This notion of state is very important to understand the various upcoming concurrency patterns. We could look at the state as data in a certain stage of processing. For example, the paste process could be using program counters to generate the numbers. It could also be writing the numbers to the standard output (file descriptor 1; by default, the console). At the same time, the paste process is processing its input and writing data to its standard output. Both processes do not care about each other's state; in fact, they don't even know anything about the other process.

The real world is full of encapsulated states. The following diagram shows an example:

It defeats common sense to share the state (the need to buy milk) with the postal department employee. It is useless for him to know it, and it could create confusion.

Likewise, the employee will be going about his daily tasks and has a state of his own. Why do we, as consumers, need to know the internal details (state) of how he is going to manage his work (dispatch this big stack of letters)? The world is concurrent, and the various entities in it also hide unnecessary details from each other to avoid confusion. If we don't hide the internal details (that is, the state), it would create havoc.

We could ask whether there is a global shared memory. If there is, then we could use it as a message channel. Using a shared data structure of our choice, the producer could put the data in it for subsequent consumption; that is, the memory is used as a channel of communication.

The shared memory and shared state model

What if we write a multithreaded program to achieve the same result? A thread of execution is a sequence of programming instructions, scheduled and managed by the operating system. A process could contain multiple threads; in other words, a process is a container for concurrently executing threads, as shown in the following diagram:

As shown in the preceding diagram, multiple threads share the process memory. Two concurrently running processes do not share memory or any other resources, such as file descriptors. In other words, different concurrent processes have their own address space, while multiple threads within the same process share their address space. Each thread also has a stack of its own. This stack is used for returning after a process call. Locally scoped variables are also created on the stack. The relationships between these elements are shown in the following diagram:

As shown in the preceding diagram, both the threads communicate via the process's global memory. There is a FIFO (first in first out) queue in which the producer thread t1 enters the filenames. The consumer thread, t2, picks up the queue entries as and when it can.

What does this data structure do? It works on a similar principle as the aforementioned pipe. The producer can produce items as fast or slow as it can. Likewise, the consumer thread picks the entries from the queue as needed. Both work at their own pace, without caring or knowing of each other.

Exchanging information this way looks simpler. However, it brings with it a host of problems. Access to the shared data structure needs to be synchronized correctly. This is surprisingly hard to achieve. The next few sections will deal with the various issues that crop up. We will also see the various paradigms that shy away from the shared state model and tilt towards message passing.

Threads interleaving – the need for synchronization

The way threads get scheduled to run is within the realm of the operating system. Factors such as the system load, the number of processes running at a time on the machine, make thread scheduling unpredictable. A good example is a seat in a movie hall.

Let's say that the movie that is playing is a big crowd-puller. We need to follow a protocol; we book a ticket by reserving a seat. The following diagram shows the rules that are based around the ticket booking system:

What if the booking is erroneously given to two people at the same time? The result would be chaotic, as both would rightfully try to go and occupy the seat at the same time.

There is a certain framework at work to make sure this situation does not happen in practice. The seat is shared among interested people and one person needs to book it in advance.

Likewise, threads need to lock a resource (that is, a shared, mutable data structure). The problem is with explicit locking. If the onus of correctly synchronizing is with the application, then someone, someday may forget to do it right and all hell will break loose.

To illustrate the need for correct synchronization, the following diagram shows an integer variable shared between two threads:

As shown in the preceding diagram, if the interleaving happens to be right, things may work as expected. Otherwise, we will have a lost update to deal with.

Instead, just like a movie ticket acting as a lock, every Java object has a lock. A thread acquires it, performs the state mutations, and unlocks. This entire sequence is a critical section. If your critical section needs to mutate a set of objects, you need to lock each one separately.

Generally, the advice is to keep the critical section as small as possible. We will discuss the reason in the next section.

Race conditions and heisenbugs

The lost update is an example of a race condition. A race condition means that the correctness of the program (the way it is expected to work) depends on the relative timing of the threads getting scheduled. So sometimes it works right, and sometimes it does not!

This is a situation that is very hard to debug. We need to reproduce a problem to investigate it, possibly running it in a debugger. What makes it hard is that the race condition cannot be reproduced! The sequence of interleaved instructions depends on the relative timing of events that are strongly influenced by the environment. Delays are caused by other running programs, other network traffic, OS scheduling decisions, variations in the processor's clock speed, and so on. A program containing a race condition may exhibit different behavior, at different times.

A heisenbug and race conditions are explained in the diagram:

These are heisenbugs—essentially nondeterministic and hard to reproduce. If we try debugging a heisenbug by attaching a debugger, the bug may disappear!

There is simply no way to debug and fix these. There is some tooling support, such as the tha tool (https://docs.oracle.com/cd/E37069_01/html/E54439/tha-1.html) and helgrind (http://valgrind.org/docs/manual/drd-manual.html); however, these are language or platform specific, and don't necessarily prove the absence of races.

Clearly, we need to avoid race conditions by design, hence the need to study concurrency patterns and this book.

Correct memory visibility and happens-before

There is yet another problem that could come up with incorrect synchronization: incorrect memory visibility. The synchronized keyword prevents the execution of critical sections by more than one thread. The synchronized keyword also makes sure the thread's local memory syncs up correctly with the shared memory, as shown in the following diagram:

What is this local memory? Note that on a multicore CPU, each CPU has a cache for performance reasons. This cache needs to be synced with the main shared memory. The cache coherence needs to be ensured so that each thread running on a CPU has the right view of the shared data.

As shown in the preceding diagram, when a thread exits a synchronized block, it issues a write barrier, thereby syncing the changes in its cache to the shared memory. On the other hand, when a thread enters a synchronized block, it issues a read barrier, so its local cache is updated with the latest changes in the shared memory.

Note that this is again not easy. In fact, very seasoned programmers were tripped up when they proposed the double-checked locking pattern. This seemingly brilliant optimization was found to be flawed in light of the preceding memory synchronization rules.

For more information on this botched optimization attempt, take a look at https://www.javaworld.com/article/2074979/java-concurrency/double-checked-locking--clever--but-broken.html.

However, Java's volatile keyword guarantees correct memory visibility. You don't need to synchronize just to ensure correct visibility. This keyword also guarantees ordering, which is a happens-before relationship. A happens-before relationship ensures that any memory writes done by a statement are visible to another statement, as shown in the following code:

private int i = 0;
private int j = 0;
private volatile boolean k = false;
// first thread sets values 
i = 1;
j = 2;
k = true;

All the variable values will be set to have a happens-before relationship because of the volatile that is being set. This means that after the variable k is set, all the previous changes are guaranteed to have happened! So the value of the i and j variables are guaranteed to be set, as shown in the following snippet:

       // second thread prints them
System.out.println("i = " + i + ", j = " + j + ", k = " + k) // the i and j values will have been flushed to memory

The volatile keyword, however, does not guarantee atomicity. See http://tutorials.jenkov.com/java-concurrency/volatile.html for more information.

Sharing, blocking, and fairness

Just like the process life cycle, threads also have a life cycle. The following figure shows the various thread states. It shows three threads, t1, t2, and t3, in the Runnable, Running, and Timed Wait states. Here is a brief explanation of each state:

New: When a Thread object is just created. The thread is not alive, as yet.
Runnable: When the start() function is called on the thread object, its state is changed to runnable. As shown in the following diagram, a thread scheduler kicks in to decide when to schedule this thread to be run.
Running: Eventually, the thread scheduler picks one of the threads from the runnable thread pool and changes its state to Running. This is when the the thread starts executing. The CPU starts the execution of this thread.
Blocked: The thread is waiting for a monitor lock. As noted previously, for a shared resource such as a mutable memory data structure, only the thread can access/read/mutate it. While a thread has the lock, other threads will be blocked.
Waiting: Wait for another thread to perform an action. Threads commonly block while doing I/O.
Timed Wait: The thread waits for an event for a finite amount of time.
Terminated: The thread is dead and cannot go back to any other state.

A thread goes back to the Runnable state once the event it waited for happens:

As shown in the preceding diagram, a blocking thread is expensive and wasteful. Why is this so? Remember, a thread is a resource itself. Instead of a thread that is just blocked and doing nothing, it would be far more optimal to employ it for processing something else. Wouldn't it be good to allocate the thread to do something useful?

Keeping the critical sections small is one way to be fair to all threads. No thread holds the lock for a long time (although this is can be altered).

Could we avoid blocking the thread and instead use it for something else? Well, that brings us to the theme of asynchronous versus synchronous executions.

Asynchronous versus synchronous executions

Blocking operations are bad, as they waste resources. By blocking, we mean operations that take a long time to complete. Synchronous execution allows tasks to execute in a sequence, waiting for the current operation to complete before starting with the next. For example, making a phone call is synchronous. We dial the number, wait for the person on other side to say "hello," and then proceed with the conversation.

On the other hand, posting a letter is done asynchronously. One does not post a letter and block for its response. We post it and then go our own way, doing other stuff. Some time in the future, we can expect a response (or an error if the letter could not be delivered).

As another example, some restaurants give you a lunch token. You pay for lunch and get a token, which is a promise that you will get to eat in the near future. If there is a big queue at the counter, you may occupy yourself with something else in the meantime and then try again later.

This is an asynchronous model.

Imagine what would happen in a case where there is no token system in place. You pay and then just wait for your turn to be served, blocked while users at the front of the queue are served.

Coming back to the software world, file and network I/O are blocking. So are database calls using blocking drivers, as shown in the following diagram:

Instead of blocking and wasting the thread doing nothing, we could look at the workflow as a mix of unblocking and blocking tasks. We would then handle a blocking task using a future: an abstraction that will complete eventually and call us back with the results or an error.

This is a change in the paradigm, where we start thinking of designing our tasks differently and representing them using higher-level abstractions, such as a future (which we discussed previously), and not deal with the threads directly. Actors are another abstraction over threads, that is, another paradigm.

Futures offer composability. They are monads. You can create a pipeline of future operations to perform higher-level computations, as we will soon see in an upcoming chapter.

Java's nonblocking I/O

Java NIO (New IO) is a nonblocking I/O API for Java. This NIO is an alternative to the standard Java I/O API. It provides abstractions such as channels, buffers, and selectors. The idea is to provide an implementation that can use the most efficient operations provided by the operating system, as shown in the following screenshot:

A channel is just a bidirectional I/O stream. A single thread can monitor all the channels an application has opened. Data arriving at any channel is an event, and the listening thread is notified of its arrival.

The selector uses event notification: a thread can then check whether the I/O is complete without any need for blocking. A single thread can handle multiple concurrent connections.

This translates into two primary benefits:

Overall, you would need fewer threads. As a thread has a memory footprint, the memory management would have less overhead.
Threads could do something useful when there is no I/O. This opens up the possibility of optimization, as threads are a valuable resource.

The Netty framework (https://netty.io/) is an NIO-based client-server framework. The Play framework is a high-performance, reactive web framework based on Netty.

Of patterns and paradigms

Moving away from explicit state management is a very prominent theme in programming. We always need a higher level of abstraction over the shared state model. As explained earlier, explicit locking does not cut it.

The various concurrency patterns that we will study in this book try to stay away from explicit locking. For example, immutability is a major theme, giving us persistent data structures. A persistent data structure performs a smart copy on a write, thereby avoiding mutation altogether, as shown in the following diagram:

As shown in the preceding diagram, the original linked list has three elements, {1, 2, 3}. The head element of the list has the value 1. Thread T1 starts counting the number of elements in the list.

At any point in time, thread T2 can prepend an element to the original list. This should not disturb the world of thread T1; it should still see the original list as it is. In other words, T1's version of the list as it sees it is preserved. Any change in the list creates a new version of the data structure. As all the versions live as long as they are needed (that is, are persistent), we don't need any locking.

Similarly, thread T2 removes the first two elements. This is achieved by just setting its head to the third element; again, this doesn't disturb the state as seen by T1 and T2.

This is essentially copy-on-write. Immutability is a cornerstone of functional programming languages.

A typical concurrency pattern is an active object. For example, how would you consume a legacy code base from multiple threads? The code base was written without any parallelism in mind, the state is strewn around, and it is almost impossible to figure out.

A brute-force approach could be to just wrap up the code in a big God object. Each thread could lock this object, use it, and relinquish the lock. However, this design would hurt concurrency, as it means that other threads would simply have to wait! Instead, we could use an active object, as shown in the following diagram:

To use this active object, a proxy sits in between the caller threads and the actual code base. It converts each invocation of the API into a runnable and puts it in a blocking queue (a thread-safe FIFO queue).

There is just one thread running in the God object. It executes the runnables on the queue one by one, in contrast to how a typical Java object method is invoked (passively). Here, the object itself executes the work placed on the queue, hence the term active object.

The rest of this chapter describes the many patterns and paradigms, that have evolved over the years, and are used in order to avoid the explicit locking of the shared state.

Event-driven architecture

Event-driven programming is a programming style in which code executes in response to an event, such as a keypress or a mouse click. In short, the flow of a program is driven by events.

GUI programming is an example of event-driven programming. For example, X Windows (driving most of your Linux GUI) processes a series of XEvents. Every keypress, mouse button press or release, and mouse movement generates a series of events. If you are on Linux, there is a command called xev. Running it via Terminal spawns a window. When moving a mouse over the window or pressing some keys, you can see the events that are generated.

Here is a capture of the xev program on my Linux laptop:

You can plug in a callback, which gets triggered upon the reception of such an event. For example, an editor program could use keypress events to update its state (resulting in its documents being edited). Traditional event-driven programming could create a complex callback flow, thereby making it hard to figure out the control flows in the code.

Event-driven architecture (EDA) helps in decoupling a system's modules. Components communicate using events, which are encapsulated in messages. A component that emits an event does not know anything about the consumers. This makes EDA extremely loosely coupled. The architecture is inherently asynchronous. The producer is oblivious of the consumers of the event messages. This process is shown in the following diagram:

Given one thread and an event loop, with the callbacks executing quickly, we have a nice architecture. How does all this relate to concurrency? There could be multiple event loops running on a pool of threads. Thread pooling is an essential concept, as we will see in the upcoming chapters.

As we have seen, an event loop manages events. The events are passed on to an installed handler, where they are processed. The handler can react to an event in two ways: either it succeeds or it fails. A failure is passed to the event loop again as another event. The handler for the exception decides to react accordingly.

Reactive programming

Reactive programming is a related programming paradigm. A spreadsheet is an excellent example of a reactive application. If we set a formula and change any column value, the spreadsheet program reacts and computes the new result columns.

A message-driven architecture is the foundation of Reactive applications. A message-driven application may be event-driven, actor-based, or a combination of the two.

The following is a diagram of observable composition:

Composable event streams make event handling easier to understand. Reactive Extensions (Rx) is a framework that provides composable observables. At the heart of this framework is the observer pattern, with a functional flavor. The framework allows us to compose multiple observables. The observers are given the resulting event stream in an asynchronous fashion. For more information, see http://reactivex.io/intro.html.

Function of composition is shown in the following code:

This Scala code shows five standalone methods. Each method is converted to a function and then collected in a variable, list. The reduceRightcall iterates over this list and composes all the functions into a bigger one, f.

The f("hello") call shows that the composition has worked!

The actor paradigm

All this concurrent programming is tricky. What is the correct synchronization and visibility? What if we could go back to our simpler sequential programming model and let the platform handle concurrency for us?

Look at the following diagram:

Actors are the abstraction over threads. We write our code using the message passing model only. The only way of talking to an actor is by sending it a message.

Looking back at our UNIX shell model, the concurrency is there, but we don't deal with it directly. Using actors, we write code as if it were for a sequential message processor.

We need to be aware of the underlying threading model, though. For example, we should always use the tell and not the ask pattern, as shown in the picture. The tell pattern is where we send a message to an actor and then forget about it, that is, we don't block for an answer. This is essentially the asynchronous way of doing things:

An actor is a lightweight entity (threads are heavyweight). The creation and destruction of actors, monetarily speaking, is similar to the creation and destruction of Java objects. Just as we don't think of the cost while designing a UNIX pipeline (we focus largely on getting our job done), actors give us the same freedom.

Actors also allow us to add supervision and restart capabilities, thereby allowing us to write robust, resilient systems. This is the let it crash philosophy.

Actors are pretty old as a design; the paradigm was tried and tested in the Telecom domain using the Erlang language.

We will be looking at the actor model and the Akka library in detail in an upcoming chapter.

Message brokers

A message broker is an architectural pattern for enabling application integrations via a message-driven paradigm. You can, for example, make a Python application and integrate it with another that is written in C (or Java). Integrations are vital to an enterprise where different applications are made to cooperate with each other.

Concurrent processing is obviously implied here. As the producers and consumers are completely decoupled (they don't even know if the others exist), the producer and consumer applications could even run on different machines, thereby overlapping the processing and increasing the overall throughput:

Decoupling is really a central concept when you start thinking about concurrent systems. Designing systems consisting of loosely coupled component systems gives us many benefits. For example, we could reuse the components, which allows us to cut down on development and maintenance costs. It also paves the way for enabling greater concurrency.

What happens when a consumer produces messages too fast? The messages will be buffered in the broker. This essentially means there is an inherent flow control mechanism at work here. A slow consumer can consume at its own pace. Likewise, the producer can produce messages at its own (faster) pace. As both are oblivious of each other, the overall system works smoothly.

Software transactional memory

The idea of database transactions is also based around concurrent reads and writes. A transaction embodies an atomic operation, which means that either all or none of the steps in the operation are completed. If all the operations are completed, the transaction succeeds; otherwise, the transaction aborts. The software transactional memory is a concurrency control mechanism on similar lines. It, again, is a different paradigm, an alternative to lock-based synchronization.

Just like a database transaction, a thread makes modifications and then tries to commit the changes. Of course, if some other transaction wins, we roll back and retry. If there is an error, the transaction aborts and we retry again.

This scheme of things is called optimistic locking, wherein we don't care about other possible concurrent transactions. We just make changes and hope the commit succeeds. If it fails, we keep trying until it eventually succeeds.

What are the benefits? We get increased concurrency, as there is no explicit locking, and all threads keep progressing; only in the case of a conflict will they retry.

STM simplifies our understanding of multithreaded programs. This, in turn, helps make programs more maintainable. Each transaction can be expressed as a single-threaded computation, as shown in the following diagram. We don't have to worry about locking at all:

Composability is a big theme: lock-based programs do not compose. You cannot take two atomic operations and create one more atomic operation out of them. You need to specifically program a critical section around these. STM, on the other hand, can wrap these two operations inside a transaction block, as shown in the preceding diagram.

Parallel collections

Say that I am describing some new and exciting algorithm to you. I start telling you about how the algorithm exploits hash tables. We typically think of such data structures as all residing in memory, locked (if required), and worked upon by one thread.

For example, take a list of numbers. Say that we want to sum all these numbers. This operation could be parallelized on multiple cores by using threads.

Now, we need to stay away from explicit locking. An abstraction that works concurrently on our list would be nice. It would split the list, run the function on each sublist, and collate the result in the end, as shown in the following diagram. This is the typical MapReduce paradigm in action:

The preceding diagram shows a Scala collection that has been parallelized in order to use concurrency internally.

What if the data structure is so large that it cannot all fit in the memory of a single machine? We could split the collection across a cluster of machines instead.

The Apache Spark framework does this for us. Spark's Resilient Distributed Dataset (RDD) is a partitioned collection that spreads the data structure across cluster machines, and thus can work on huge collections, typically to perform analytical processing.

Placeholder Sep 17, 2021

1) Good high level view of things in the world of concurrent design/implementation.2) Very nice, real life examples; easy to relate to.3) Informal approach to the subject, makes the subject less intimidating.4) General things of importance in the world of software development, those that are not directly related to concurrency are also mentioned and briefly talked about.I wish the author writes more books on software development in general. Many thanks and all the best.

Amazon Verified review