In this article, we take a look at the common Iris dataset using simple statistical methods. Then we create a simple Julia project to load and save data from the Iris dataset.
To start, we'll load, the Iris flowers dataset, from the RDatasets package and we'll manipulate it using standard data analysis functions. Then we'll look more closely at the data by employing common visualization techniques. And finally, we'll see how to persist and (re)load our data.
But, in order to do that, first, we need to take a look at some of the language's most important building blocks.
Here are the external packages used in this tutorial and their specific versions:
CSV@v0.4.3 DataFrames@v0.15.2 Feather@v0.5.1 Gadfly@v1.0.1 IJulia@v1.14.1 JSON@v0.20.0 RDatasets@v0.6.1
In order to install a specific version of a package you need to run:
pkg> add PackageName@vX.Y.Z
For example:
pkg> add IJulia@v1.14.1
Alternatively, you can install all the used packages by downloading the Project.toml file using pkg> instantiate as follows:
julia> download("https://raw.githubusercontent.com/PacktPublishing/Julia-Programming-Projects/master/Chapter02/Project.toml", "Project.toml") pkg> activate . pkg> instantiate
Now that it's clear how the data is structured and what is contained in the collection, we can get a better understanding by looking at some basic stats.
To get us started, let's invoke the describe function:
julia> describe(iris)
The output is as follows:
This function summarizes the columns of the iris DataFrame. If the columns contain numerical data (such as SepalLength), it will compute the minimum, median, mean, and maximum. The number of missing and unique values is also included. The last column reports the type of data stored in the row.
A few other stats are available, including the 25th and the 75th percentile, and the first and the last values. We can ask for them by passing an extra stats argument, in the form of an array of symbols:
julia> describe(iris, stats=[:q25, :q75, :first, :last])
The output is as follows:
Any combination of stats labels is accepted. These are all the options—:mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing.
In order to get all the stats, the special :all value is accepted:
julia> describe(iris, stats=:all)
The output is as follows:
We can also compute these individually by using Julia's Statistics package. For example, to calculate the mean of the SepalLength column, we'll execute the following:
julia> using Statistics julia> mean(iris[:SepalLength]) 5.843333333333334
In this example, we use iris[:SepalLength] to select the whole column. The result, not at all surprisingly, is the same as that returned by the corresponding describe() value.
In a similar way we can compute the median():
julia> median(iris[:SepalLength]) 5.8
And there's (a lot) more, such as, for instance, the standard deviation std():
julia> std(iris[:SepalLength]) 0.828066127977863
Or, we can use another function from the Statistics package, cor(), in a simple script to help us understand how the values are correlated:
julia> for x in names(iris)[1:end-1] for y in names(iris)[1:end-1] println("$x \t $y \t $(cor(iris[x], iris[y]))") end println("-------------------------------------------") end
Executing this snippet will produce the following output:
SepalLength SepalLength 1.0 SepalLength SepalWidth -0.11756978413300191 SepalLength PetalLength 0.8717537758865831 SepalLength PetalWidth 0.8179411262715759 ------------------------------------------------------------ SepalWidth SepalLength -0.11756978413300191 SepalWidth SepalWidth 1.0 SepalWidth PetalLength -0.42844010433053953 SepalWidth PetalWidth -0.3661259325364388 ------------------------------------------------------------ PetalLength SepalLength 0.8717537758865831 PetalLength SepalWidth -0.42844010433053953 PetalLength PetalLength 1.0 PetalLength PetalWidth 0.9628654314027963 ------------------------------------------------------------ PetalWidth SepalLength 0.8179411262715759 PetalWidth SepalWidth -0.3661259325364388 PetalWidth PetalLength 0.9628654314027963 PetalWidth PetalWidth 1.0 ------------------------------------------------------------
The script iterates over each column of the dataset with the exception of Species (the last column, which is not numeric), and generates a basic correlation table. The table shows strong positive correlations between SepalLength and PetalLength (87.17%), SepalLength and PetalWidth (81.79%), and PetalLength and PetalWidth (96.28%). There is no strong correlation between SepalLength and SepalWidth.
We can use the same script, but this time employ the cov() function to compute the covariance of the values in the dataset:
julia> for x in names(iris)[1:end-1] for y in names(iris)[1:end-1] println("$x \t $y \t $(cov(iris[x], iris[y]))") end println("--------------------------------------------") end
This code will generate the following output:
SepalLength SepalLength 0.6856935123042507 SepalLength SepalWidth -0.04243400447427293 SepalLength PetalLength 1.2743154362416105 SepalLength PetalWidth 0.5162706935123043 ------------------------------------------------------- SepalWidth SepalLength -0.04243400447427293 SepalWidth SepalWidth 0.189979418344519 SepalWidth PetalLength -0.3296563758389262 SepalWidth PetalWidth -0.12163937360178968 ------------------------------------------------------- PetalLength SepalLength 1.2743154362416105 PetalLength SepalWidth -0.3296563758389262 PetalLength PetalLength 3.1162778523489933 PetalLength PetalWidth 1.2956093959731543 ------------------------------------------------------- PetalWidth SepalLength 0.5162706935123043 PetalWidth SepalWidth -0.12163937360178968 PetalWidth PetalLength 1.2956093959731543 PetalWidth PetalWidth 0.5810062639821031 -------------------------------------------------------
The output illustrates that SepalLength is positively related to PetalLength and PetalWidth, while being negatively related to SepalWidth. SepalWidth is negatively related to all the other values.
Moving on, if we want a random data sample, we can ask for it like this:
julia> rand(iris[:SepalLength]) 7.4
Optionally, we can pass in the number of values to be sampled:
julia> rand(iris[:SepalLength], 5) 5-element Array{Float64,1}: 6.9 5.8 6.7 5.0 5.6
We can convert one of the columns to an array using the following:
julia> sepallength = Array(iris[:SepalLength]) 150-element Array{Float64,1}: 5.1 4.9 4.7 4.6 5.0 # ... output truncated ...
Or we can convert the whole DataFrame to a matrix:
julia> irisarr = convert(Array, iris[:,:]) 150×5 Array{Any,2}: 5.1 3.5 1.4 0.2 CategoricalString{UInt8} "setosa" 4.9 3.0 1.4 0.2 CategoricalString{UInt8} "setosa" 4.7 3.2 1.3 0.2 CategoricalString{UInt8} "setosa" 4.6 3.1 1.5 0.2 CategoricalString{UInt8} "setosa" 5.0 3.6 1.4 0.2 CategoricalString{
UInt8} "setosa" # ... output truncated ...
Julia comes with excellent facilities for reading and storing data out of the box. Given its focus on data science and scientific computing, support for tabular-file formats (CSV, TSV) is first class.
Let's extract some data from our initial dataset and use it to practice persistence and retrieval from various backends.
We can reference a section of a DataFrame by defining its bounds through the corresponding columns and rows. For example, we can define a new DataFrame composed only of the PetalLength and PetalWidth columns and the first three rows:
julia> iris[1:3, [:PetalLength, :PetalWidth]] 3×2 DataFrames.DataFrame │ Row │ PetalLength │ PetalWidth │ ├─────┼─────────────┼────────────┤ │ 1 │ 1.4 │ 0.2 │ │ 2 │ 1.4 │ 0.2 │ │ 3 │ 1.3 │ 0.2 │
The generic indexing notation is dataframe[rows, cols], where rows can be a number, a range, or an Array of boolean values where true indicates that the row should be included:
julia> iris[trues(150), [:PetalLength, :PetalWidth]]
This snippet will select all the 150 rows since trues(150) constructs an array of 150 elements that are all initialized as true. The same logic applies to cols, with the added benefit that they can also be accessed by name.
Armed with this knowledge, let's take a sample from our original dataset. It will include some 10% of the initial data and only the PetalLength, PetalWidth, and Species columns:
julia> test_data = iris[rand(150) .<= 0.1, [:PetalLength, :PetalWidth, :Species]] 10×3 DataFrames.DataFrame │ Row │ PetalLength │ PetalWidth │ Species │ ├─────┼─────────────┼────────────┼──────────────┤ │ 1 │ 1.1 │ 0.1 │ "setosa" │ │ 2 │ 1.9 │ 0.4 │ "setosa" │ │ 3 │ 4.6 │ 1.3 │ "versicolor" │ │ 4 │ 5.0 │ 1.7 │ "versicolor" │ │ 5 │ 3.7 │ 1.0 │ "versicolor" │ │ 6 │ 4.7 │ 1.5 │ "versicolor" │ │ 7 │ 4.6 │ 1.4 │ "versicolor" │ │ 8 │ 6.1 │ 2.5 │ "virginica" │ │ 9 │ 6.9 │ 2.3 │ "virginica" │ │ 10 │ 6.7 │ 2.0 │ "virginica" │
What just happened here? The secret in this piece of code is rand(150) .<= 0.1. It does a lot—first, it generates an array of random Float values between 0 and 1; then, it compares the array, element-wise, against 0.1 (which represents 10% of 1); and finally, the resultant Boolean array is used to filter out the corresponding rows from the dataset. It's really impressive how powerful and succinct Julia can be!
In my case, the result is a DataFrame with the preceding 10 rows, but your data will be different since we're picking random rows (and it's quite possible you won't have exactly 10 rows either).
We can easily save this data to a file in a tabular file format (one of CSV, TSV, and others) using the CSV package. We'll have to add it first and then call the write method:
pkg> add CSV julia> using CSV julia> CSV.write("test_data.csv", test_data)
And, just as easily, we can read back the data from tabular file formats, with the corresponding CSV.read function:
julia> td = CSV.read("test_data.csv") 10×3 DataFrames.DataFrame │ Row │ PetalLength │ PetalWidth │ Species │ ├─────┼─────────────┼────────────┼──────────────┤ │ 1 │ 1.1 │ 0.1 │ "setosa" │ │ 2 │ 1.9 │ 0.4 │ "setosa" │ │ 3 │ 4.6 │ 1.3 │ "versicolor" │ │ 4 │ 5.0 │ 1.7 │ "versicolor" │ │ 5 │ 3.7 │ 1.0 │ "versicolor" │ │ 6 │ 4.7 │ 1.5 │ "versicolor" │ │ 7 │ 4.6 │ 1.4 │ "versicolor" │ │ 8 │ 6.1 │ 2.5 │ "virginica" │ │ 9 │ 6.9 │ 2.3 │ "virginica" │ │ 10 │ 6.7 │ 2.0 │ "virginica" │
Just specifying the file extension is enough for Julia to understand how to handle the document (CSV, TSV), both when writing and reading.
Feather is a binary file format that was specially designed for storing data frames. It is fast, lightweight, and language-agnostic. The project was initially started in order to make it possible to exchange data frames between R and Python. Soon, other languages added support for it, including Julia.
Support for Feather files does not come out of the box, but is made available through the homonymous package. Let's go ahead and add it and then bring it into scope:
pkg> add Feather
julia> using Feather
Now, saving our DataFrame is just a matter of calling Feather.write:
julia> Feather.write("test_data.feather", test_data)
Next, let's try the reverse operation and load back our Feather file. We'll use the counterpart read function:
julia> Feather.read("test_data.feather") 10×3 DataFrames.DataFrame │ Row │ PetalLength │ PetalWidth │ Species │ ├─────┼─────────────┼────────────┼──────────────┤ │ 1 │ 1.1 │ 0.1 │ "setosa" │ │ 2 │ 1.9 │ 0.4 │ "setosa" │ │ 3 │ 4.6 │ 1.3 │ "versicolor" │ │ 4 │ 5.0 │ 1.7 │ "versicolor" │ │ 5 │ 3.7 │ 1.0 │ "versicolor" │ │ 6 │ 4.7 │ 1.5 │ "versicolor" │ │ 7 │ 4.6 │ 1.4 │ "versicolor" │ │ 8 │ 6.1 │ 2.5 │ "virginica" │ │ 9 │ 6.9 │ 2.3 │ "virginica" │ │ 10 │ 6.7 │ 2.0 │ "virginica" │
Yeah, that's our sample data all right!
Let's also take a look at using a NoSQL backend for persisting and retrieving our data.
In order to follow through this part, you'll need a working MongoDB installation. You can download and install the correct version for your operating system from the official website, at https://www.mongodb.com/download-center?jmp=nav#community. I will use a Docker image which I installed and started up through Docker's Kitematic (available for download at https://github.com/docker/kitematic/releases).
Next, we need to make sure to add the Mongo package. The package also has a dependency on LibBSON, which is automatically added. LibBSON is used for handling BSON, which stands for Binary JSON, a binary-encoded serialization of JSON-like documents. While we're at it, let's add the JSON package as well; we will need it. I'm sure you know how to do that by now—if not, here is a reminder:
pkg> add Mongo, JSON
Easy! Let's let Julia know that we'll be using all these packages:
julia> using Mongo, LibBSON, JSON
We're now ready to connect to MongoDB:
julia> client = MongoClient()
Once successfully connected, we can reference a dataframes collection in the db database:
julia> storage = MongoCollection(client, "db", "dataframes")
Julia's MongoDB interface uses dictionaries (a data structure called Dict in Julia) to communicate with the server. For now, all we need to do is to convert our DataFrame to such a Dict. The simplest way to do it is to sequentially serialize and then deserialize the DataFrame by using the JSON package. It generates a nice structure that we can later use to rebuild our DataFrame:
julia> datadict = JSON.parse(JSON.json(test_data))
Thinking ahead, to make any future data retrieval simpler, let's add an identifier to our dictionary:
julia> datadict["id"] = "iris_test_data"
Now we can insert it into Mongo:
julia> insert(storage, datadict)
In order to retrieve it, all we have to do is query the Mongo database using the "id" field we've previously configured:
Julia> data_from_mongo = first(find(storage, query("id" => "iris_test_data")))
We get a BSONObject, which we need to convert back to a DataFrame. Don't worry, it's straightforward. First, we create an empty DataFrame:
julia> df_from_mongo = DataFrame() 0×0 DataFrames.DataFrame
Then we populate it using the data we retrieved from Mongo:
for i in 1:length(data_from_mongo["columns"]) df_from_mongo[Symbol(data_from_mongo["colindex"]["names"][i])] = Array(data_from_mongo["columns"][i]) end julia> df_from_mongo 10×3 DataFrames.DataFrame │ Row │ PetalLength │ PetalWidth │ Species │ ├─────┼─────────────┼────────────┼──────────────┤ │ 1 │ 1.1 │ 0.1 │ "setosa" │ │ 2 │ 1.9 │ 0.4 │ "setosa" │ │ 3 │ 4.6 │ 1.3 │ "versicolor" │ │ 4 │ 5.0 │ 1.7 │ "versicolor" │ │ 5 │ 3.7 │ 1.0 │ "versicolor" │ │ 6 │ 4.7 │ 1.5 │ "versicolor" │ │ 7 │ 4.6 │ 1.4 │ "versicolor" │ │ 8 │ 6.1 │ 2.5 │ "virginica" │ │ 9 │ 6.9 │ 2.3 │ "virginica" │ │ 10 │ 6.7 │ 2.0 │ "virginica" │
And that's it! Our data has been loaded back into a DataFrame.
In this tutorial, we looked at the Iris dataset and worked on loading and saving the data in a simple Julia project. To learn more about machine learning recommendation in Julia and testing the model check out this book Julia Programming Projects.
Julia for machine learning. Will the new language pick up pace?
Announcing Julia v1.1 with better exception handling and other improvement
GitHub Octoverse: top machine learning packages, languages, and projects of 2018