Batch processing
Batch-processing data is the most common form of data processing, and for most companies, it is their bread-and-butter approach to data. Batch processing is the method of data processing that is done at a “triggered” pace. This trigger may be manual or based on a schedule. Streaming, on the other hand, involves attempting to trigger something very quickly. This is also known as micro-batch processing. Streaming can exist in different ways on different systems. In Spark, streaming is designed to look and work like batch processing but without the need to constantly trigger the job.
In this section, we will set up some fake data for our examples using the Faker Python library. Faker will only be used for example purposes since it’s very important to the learning process. If you prefer an alternative way to generate data, please feel free to use that instead:
from faker import Faker import pandas as pd import random fake = Faker() def generate_data...