Chapter 5: Data Engineering
Data engineering, in general, refers to the management and organization of data and data flows across an organization. It involves data gathering, processing, versioning, data governance, and analytics. It is a huge topic that revolves around the development and maintenance of data processing platforms, data lakes, data marts, data warehouses, and data streams. It is an important practice that contributes to the success of big data and machine learning (ML) projects. In this chapter, you will learn about the ML-specific topics of data engineering.
A sizable number of ML tutorials/books start with a clean dataset and a CSV file to build your model against. The real world is different. Data comes in many shapes and sizes, and it is important that you have a well-defined strategy to harvest, process, and prepare data at scale. This chapter will discuss open source tools that can provide the foundations for data engineering in ML projects. You will learn...