Machine learning use case and dataset
Throughout this book, we will be using examples to demonstrate the best practices that apply across the ML life cycle. For this, we'll focus on a single ML use case and use an open dataset with data relating to the ML use case.
The primary use case we'll explore in this book is predicting air quality readings. Given a location (weather station) and date, we'll try to predict a value for a particular type of air quality measurement (for example, pm25 or o3). We'll treat this as a regression problem and explore XGBoost and neural network-based model approaches.
For this, we'll use a dataset from OpenAQ (https://registry.opendata.aws/openaq/) that includes air quality data from public data sources. The dataset that we will use is the realtime
dataset (https://openaq-fetches.s3.amazonaws.com/index.html) and the realtime-parquet-gzipped
dataset (https://openaq-fetches.s3.amazonaws.com/index.html), which includes daily...