For our proposed architecture, we need to decouple acquisition from processing, improving the scalability and the independence of the layers. To achieve this goal, we can use a queue. We could either use Java Message Service (JMS) or Advanced Message Queuing Protocol (AMQP), but in this case we are going to use Apache Kafka. This is supported by most common analytics platforms, it has a very high performance and scalability, and it also has a good analytics framework.
In Kafka, each topic is divided into a set of logs called partitions. The producers write to the tail of Kafka's logs and consumers read the logs. Apache Kafka scales topic consumption by distributing partitions among a consumer group. A consumer group is a set of consumers which share a common group identifier. The following diagram shows a topic with three partitions and two...