Downloading Pima Diabetes data for supervised classification
In this recipe, we download and inspect the Pima Diabetes dataset from the UCI machine learning repository. We will use the dataset later with Spark's streaming logistic regression algorithm.
How to do it...
You will need one of the following command-line tools curl
or wget
to retrieve the specified data:
- You can start by downloading the dataset using either two of the following commands. The first command is as follows:
http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data
This is an alternative that you can use:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data -o pima-indians-diabetes.data
- Now we begin our first steps of data exploration by seeing how the data in
pima-indians-diabetes.data
is formatted (from Mac or Linux Terminal):
head -5 pima-indians-diabetes.data 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31...