What this book covers
Chapter 1, Data Ingestion Techniques, provides a comprehensive overview of the data ingestion process, emphasizing its role in collecting and importing data from various sources into storage systems for analysis. You will explore different ingestion methods such as batch and streaming modes, compare real-time and semi-real-time ingestion, and understand the technologies behind data sources. The chapter highlights the advantages, disadvantages, and practical applications of these methods.
Chapter 2, Importance of Data Quality, emphasizes the critical role data quality plays in business decision-making. It highlights the risks of using inaccurate, inconsistent, or outdated data, which can lead to poor decisions, damaged reputations, and missed opportunities. You will explore why data quality is essential, how to measure it across different dimensions, and the impact of data silos on maintaining data quality.
Chapter 3, Data Profiling – Understanding Data Structure, Quality, and Distribution, explores data profiling and focuses on scrutinizing and validating datasets to understand their structure, patterns, and quality. You will learn how to perform data profiling using tools such as the pandas Profiler and Great Expectations and understand when to use each tool. Additionally, the chapter covers tactics for handling large data volumes and compares profiling methods to improve data validation.
Chapter 4, Cleaning Messy Data and Data Manipulation, focuses on the key strategies for cleaning and manipulating data, enabling efficient and accurate analysis. It covers techniques for renaming columns, removing irrelevant or redundant data, fixing inconsistent data types, and handling date and time formats. By mastering these methods, you will learn how to enhance the quality and reliability of your datasets.
Chapter 5, Data Transformation – Merging and Concatenating, explores techniques for transforming and manipulating data through merging, joining, and concatenating datasets. It covers methods to combine multiple datasets from various sources, handle duplicates effectively, and improve merging performance. The chapter also provides practical tricks to streamline the merging process, ensuring efficient data integration for insightful analysis.
Chapter 6, Data Grouping, Aggregation, Filtering, and Applying Functions, covers the essential techniques of data grouping and aggregation, which are vital for summarizing large datasets and generating meaningful insights. It discusses methods to handle missing or noisy data by aggregating values, reducing data volume, and enhancing processing efficiency. The chapter also focuses on grouping data by various keys, applying aggregate and custom functions, and filtering data to create valuable features for deeper analysis or ML.
Chapter 7, Data Sinks, focuses on the critical decisions involved in data processing, particularly the selection of appropriate data sinks for storage and processing needs. It delves into four essential pillars: choosing the right data sink, selecting the correct file type, optimizing partitioning strategies, and understanding how to design a scalable online retail data platform. The chapter equips you with the tools to enhance efficiency, scalability, and performance in data processing pipelines.
Chapter 8, Detecting and Handling Missing Values and Outliers, delves into techniques for identifying and managing missing values and outliers. It covers a range of methods, from statistical approaches to advanced ML models, to address these issues effectively. The key areas of focus include detecting and handling missing data, identifying univariate and multivariate outliers, and managing outliers in various datasets.
Chapter 9, Normalization and Standardization, covers essential preprocessing techniques such as feature scaling, normalization, and standardization, which ensure that ML models can effectively learn from data. You will explore different techniques, including scaling features to a range, Z-score scaling, and using a robust scaler, to address various data challenges in ML tasks.
Chapter 10, Handling Categorical Features, addresses the importance of managing categorical features, which represent non-numerical information in datasets. You will learn various encoding techniques, including label encoding, one-hot encoding, target encoding, frequency encoding, and binary encoding, to transform categorical data for ML models.
Chapter 11, Consuming Time Series Data, delves into the fundamentals of time series analysis, covering key concepts, methodologies, and applications across various industries. It includes understanding the components and types of time series data, identifying and handling missing values, and techniques for analyzing trends and patterns over time. The chapter also addresses dealing with outliers and feature engineering to enhance predictive modeling with time series data.
Chapter 12, Text Preprocessing in the Era of LLMs, focuses on mastering text preprocessing techniques that are essential for optimizing the performance of LLMs. It covers methods for cleaning text, handling rare words and spelling variations, chunking, and tokenization strategies. Additionally, it addresses the transformation of tokens into embeddings, highlighting the importance of adapting preprocessing approaches to maximize the potential of LLMs.
Chapter 13, Image and Audio Preprocessing with LLMs, examines preprocessing techniques for unstructured data, particularly images and audio, to extract meaningful information. It includes methods for image preprocessing, such as optical character recognition (OCR) and image caption generation with the BLIP model. The chapter also explores audio data handling, including converting audio to text using the Whisper model, providing a comprehensive overview of working with multimedia data in the context of LLMs.