Practicing ML in Databricks
In this section, we will perform a simple ML experiment in Databricks using Spark ML. We will begin with exploratory data analysis (EDA) to understand the dataset. Following this, an ML experiment using the Decision tree algorithm will be performed. Decision tree algorithms are usually used for classification problems.
We will be using data from the DataSF project that was launched in 2009 and contains hundreds of datasets from the city of San Francisco. The dataset we will be using concerns San Francisco's fire department. The dataset contains data about all the calls made to the fire department and their responses.
We will divide this section into three phases, as follows:
- Environment setup: Setting up a Spark cluster and getting the data
- EDA: Analyzing the dataset by answering questions to enhance data understanding
- ML: Using the Decision tree algorithm to address a classification problem
Environment setup
Let&apos...