Performing data analysis on a tabular dataset
If you haven't followed the steps in Chapter 4, Ingesting Data and Managing Datasets, to download the snapshot of the Melbourne Housing dataset from Kaggle (https://www.kaggle.com/dansbecker/melbourne-housing-snapshot), please do this before continuing with this section. In the end, you should have the raw dataset file, melb_data.csv
, in the mlfiles
container in your storage account and have this connected to a datastore called mldemoblob
in your Azure Machine Learning workspace.
In the following sections, we will explore the dataset, do some basic statistical analysis, find missing values and outliers, find correlations between features, and take an initial measurement of feature importance while utilizing a random forest model, as we saw in the Visualizing feature and label dependency for classification section of this chapter. You can either create a new Jupyter notebook and follow along with this book or open the 06_ dataprep_melbhousing...