In this article by Ivo Balbaert, author of the book Getting Started with Julia Programming, we will explore how Julia interacts with the outside world, reading from standard input and writing to standard output, files, networks, and databases. Julia provides asynchronous networking I/O using the libuv library. We will see how to handle data in Julia. We will also discover the parallel processing model of Julia.
In this article, the following topics are covered:
(For more resources related to this topic, see here.)
To work with files, we need the IOStream type. IOStream is a type with the supertype IO and has the following characteristics:
4-element Array{Symbol,1}: :handle :ios :name :mark
(Ptr{None}, Array{Uint8,1}, String, Int64)
The file handle is a pointer of the type Ptr, which is a reference to the file object.
Opening and reading a line-oriented file with the name example.dat is very easy:
// code in Chapter 8io.jl fname = "example.dat" f1 = open(fname)
fname is a string that contains the path to the file, using escaping of special characters with when necessary; for example, in Windows, when the file is in the test folder on the D: drive, this would become d:\test\example.dat. The f1 variable is now an IOStream(<file example.dat>) object.
To read all lines one after the other in an array, use data = readlines(f1), which returns 3-element Array{Union(ASCIIString,UTF8String),1}:
"this is line 1.rn" "this is line 2.rn" "this is line 3."
For processing line by line, now only a simple loop is needed:
for line in data println(line) # or process line end close(f1)
Always close the IOStream object to clean and save resources. If you want to read the file into one string, use readall. Use this only for relatively small files because of the memory consumption; this can also be a potential problem when using readlines.
There is a convenient shorthand with the do syntax for opening a file, applying a function process, and closing it automatically. This goes as follows (file is the IOStream object in this code):
open(fname) do file process(file) end
The do command creates an anonymous function, and passes it to open. Thus, the previous code example would have been equivalent to open(process, fname). Use the same syntax for processing a file fname line by line without the memory overhead of the previous methods, for example:
open(fname) do file for line in eachline(file) print(line) # or process line end end
Writing a file requires first opening it with a "w" flag, then writing strings to it with write, print, or println, and then closing the file handle that flushes the IOStream object to the disk:
fname = "example2.dat" f2 = open(fname, "w") write(f2, "I write myself to a filen") # returns 24 (bytes written) println(f2, "even with println!") close(f2)
Opening a file with the "w" option will clear the file if it exists. To append to an existing file, use "a".
To process all the files in the current folder (or a given folder as an argument to readdir()), use this for loop:
for file in readdir() # process file end
A CSV file is a comma-separated file. The data fields in each line are separated by commas "," or another delimiter such as semicolons ";". These files are the de-facto standard for exchanging small and medium amounts of tabular data. Such files are structured so that one line contains data about one data object, so we need a way to read and process the file line by line. As an example, we will use the data file Chapter 8winequality.csv that contains 1,599 sample measurements, 12 data columns, such as pH and alcohol per sample, separated by a semicolon. In the following screenshot, you can see the top 20 rows:
In general, the readdlm function is used to read in the data from the CSV files:
# code in Chapter 8csv_files.jl: fname = "winequality.csv" data = readdlm(fname, ';')
The second argument is the delimiter character (here, it is ;). The resulting data is a 1600x12 Array{Any,2} array of the type Any because no common type could be found:
"fixed acidity" "volatile acidity" "alcohol" "quality"
7.4 0.7 9.4 5.0
7.8 0.88 9.8 5.0
7.8 0.76 9.8 5.0
…
If the data file is comma separated, reading it is even simpler with the following command:
data2 = readcsv(fname)
The problem with what we have done until now is that the headers (the column titles) were read as part of the data. Fortunately, we can pass the argument header=true to let Julia put the first line in a separate array. It then naturally gets the correct datatype, Float64, for the data array. We can also specify the type explicitly, such as this:
data3 = readdlm(fname, ';', Float64, 'n', header=true)
The third argument here is the type of data, which is a numeric type, String or Any. The next argument is the line separator character, and the fifth indicates whether or not there is a header line with the field (column) names. If so, then data3 is a tuple with the data as the first element and the header as the second, in our case, (1599x12 Array{Float64,2}, 1x12 Array{String,2}) (There are other optional arguments to define readdlm, see the help option). In this case, the actual data is given by data3[1] and the header by data3[2].
Let's continue working with the variable data. The data forms a matrix, and we can get the rows and columns of data using the normal array-matrix syntax). For example, the third row is given by row3 = data[3, :] with data: 7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 9.8 5.0, representing the measurements for all the characteristics of a certain wine.
The measurements of a certain characteristic for all wines are given by a data column, for example, col3 = data[ :, 3] represents the measurements of citric acid and returns a column vector 1600-element Array{Any,1}: "citric acid" 0.0 0.0 0.04 0.56 0.0 0.0 … 0.08 0.08 0.1 0.13 0.12 0.47.
If we need columns 2-4 (volatile acidity to residual sugar) for all wines, extract the data with x = data[:, 2:4]. If we need these measurements only for the wines on rows 70-75, get these with y = data[70:75, 2:4], returning a 6 x 3 Array{Any,2} outputas follows:
0.32 0.57 2.0
0.705 0.05 1.9
…
0.675 0.26 2.1
To get a matrix with the data from columns 3, 6, and 11, execute the following command:
z = [data[:,3] data[:,6] data[:,11]]
It would be useful to create a type Wine in the code.
For example, if the data is to be passed around functions, it will improve the code quality to encapsulate all the data in a single data type, like this:
type Wine fixed_acidity::Array{Float64} volatile_acidity::Array{Float64} citric_acid::Array{Float64} # other fields quality::Array{Float64} end
Then, we can create objects of this type to work with them, like in any other object-oriented language, for example, wine1 = Wine(data[1, :]...), where the elements of the row are splatted with the ... operator into the Wine constructor.
To write to a CSV file, the simplest way is to use the writecsv function for a comma separator, or the writedlm function if you want to specify another separator. For example, to write an array data to a file partial.dat, you need to execute the following command:
writedlm("partial.dat", data, ';')
If more control is necessary, you can easily combine the more basic functions from the previous section. For example, the following code snippet writes 10 tuples of three numbers each to a file:
// code in Chapter 8tuple_csv.jl fname = "savetuple.csv" csvfile = open(fname,"w") # writing headers: write(csvfile, "ColName A, ColName B, ColName Cn") for i = 1:10 tup(i) = tuple(rand(Float64,3)...) write(csvfile, join(tup(i),","), "n") end close(csvfile)
If you measure n variables (each of a different type) of a single object of observation, then you get a table with n columns for each object row. If there are m observations, then we have m rows of data. For example, given the student grades as data, you might want to know "compute the average grade for each socioeconomic group", where grade and socioeconomic group are both columns in the table, and there is one row per student.
The DataFrame is the most natural representation to work with such a (m x n) table of data. They are similar to pandas DataFrames in Python or data.frame in R. A DataFrame is a more specialized tool than a normal array for working with tabular and statistical data, and it is defined in the DataFrames package, a popular Julia library for statistical work. Install it in your environment by typing in Pkg.add("DataFrames") in the REPL. Then, import it into your current workspace with using DataFrames. Do the same for the packages DataArrays and RDatasets (which contains a collection of example datasets mostly used in the R literature).
A common case in statistical data is that data values can be missing (the information is not known). The DataArrays package provides us with the unique value NA, which represents a missing value, and has the type NAtype. The result of the computations that contain the NA values mostly cannot be determined, for example, 42 + NA returns NA. (Julia v0.4 also has a new Nullable{T} type, which allows you to specify the type of a missing value). A DataArray{T} array is a data structure that can be n-dimensional, behaves like a standard Julia array, and can contain values of the type T, but it can also contain the missing (Not Available) values NA and can work efficiently with them. To construct them, use the @data macro:
// code in Chapter 8dataarrays.jl using DataArrays using DataFrames dv = @data([7, 3, NA, 5, 42])
This returns 5-element DataArray{Int64,1}: 7 3 NA 5 42.
The sum of these numbers is given by sum(dv) and returns NA. One can also assign the NA values to the array with dv[5] = NA; then, dv becomes [7, 3, NA, 5, NA]). Converting this data structure to a normal array fails: convert(Array, dv) returns ERROR: NAException.
How to get rid of these NA values, supposing we can do so safely? We can use the dropna function, for example, sum(dropna(dv)) returns 15. If you know that you can replace them with a value v, use the array function:
repl = -1 sum(array(dv, repl)) # returns 13
A DataFrame is a kind of an in-memory database, versatile in the ways you can work with the data. It consists of columns with names such as Col1, Col2, Col3, and so on. Each of these columns are DataArrays that have their own type, and the data they contain can be referred to by the column names as well, so we have substantially more forms of indexing. Unlike two-dimensional arrays, columns in a DataFrame can be of different types. One column might, for instance, contain the names of students and should therefore be a string. Another column could contain their age and should be an integer.
We construct a DataFrame from the program data as follows:
// code in Chapter 8dataframes.jl using DataFrames # constructing a DataFrame: df = DataFrame() df[:Col1] = 1:4 df[:Col2] = [e, pi, sqrt(2), 42] df[:Col3] = [true, false, true, false] show(df)
Notice that the column headers are used as symbols. This returns the following 4 x 3 DataFrame object:
We could also have used the full constructor as follows:
df = DataFrame(Col1 = 1:4, Col2 = [e, pi, sqrt(2), 42], Col3 = [true, false, true, false])
You can refer to the columns either by an index (the column number) or by a name, both of the following expressions return the same output:
show(df[2]) show(df[:Col2])
This gives the following output:
[2.718281828459045, 3.141592653589793, 1.4142135623730951,42.0]
To show the rows or subsets of rows and columns, use the familiar splice (:) syntax, for example:
| Row | Col1 | Col2 | Col3 |
|-----|------|---------|------|
| 1 | 1 | 2.71828 | true |
2x2 DataFrame
| Row | Col2 | Col3 |
|---- |----- -|-------|
| 1 | 3.14159 | false |
| 2 | 1.41421 | true |
The following functions are very useful when working with DataFrames:
Col2
Min 1.4142135623730951
1st Qu. 2.392264761937558
Median 2.929937241024419
Mean 12.318522011105483
3rd Qu. 12.856194490192344
Max 42.0
NAs 0
NA% 0.0%
To load in data from a local CSV file, use the method readtable. The returned object is of type DataFrame:
// code in Chapter 8dataframes.jl using DataFrames fname = "winequality.csv" data = readtable(fname, separator = ';') typeof(data) # DataFrame size(data) # (1599,12)
Here is a fraction of the output:
The readtable method also supports reading in gzipped CSV files.
Writing a DataFrame to a file can be done with the writetable function, which takes the filename and the DataFrame as arguments, for example, writetable("dataframe1.csv", df). By default, writetable will use the delimiter specified by the filename extension and write the column names as headers.
Both readtable and writetable support numerous options for special cases. Refer to the docs for more information (refer to http://dataframesjl.readthedocs.org/en/latest/). To demonstrate some of the power of DataFrames, here are some queries you can do:
Here, we use the .== operator, which does element-wise comparison. data[:alcohol] .== 9.5 returns an array of Boolean values (true for datapoints, where :alcohol is 9.5, and false otherwise). data[boolean_array, : ] selects those rows where boolean_array is true.
6x2 DataFrame
| Row | quality | x1 |
|-----|---------|-----|
| 1 | 3 | 10 |
| 2 | 4 | 53 |
| 3 | 5 | 681 |
| 4 | 6 | 638 |
| 5 | 7 | 199 |
| 6 | 8 | 18 |
The DataFrames package contains the by function, which takes in three arguments:
Another easy way to get the distribution among quality is to execute the histogram hist function hist(data[:quality]) that gives the counts over the range of quality (2.0:1.0:8.0,[10,53,681,638,199,18]). More precisely, this is a tuple with the first element corresponding to the edges of the histogram bins, and the second denoting the number of items in each bin. So there are, for example, 10 wines with quality between 2 and 3, and so on.
To extract the counts as a variable count of type Vector, we can execute _, count = hist(data[:quality]); the _ means that we neglect the first element of the tuple. To obtain the quality classes as a DataArray class, we will execute the following:
class = sort(unique(data[:quality]))
We can now construct a df_quality DataFrame with the class and count columns as df_quality = DataFrame(qual=class, no=count). This gives the following output:
6x2 DataFrame
| Row | qual | no |
|-----|------|-----|
| 1 | 3 | 10 |
| 2 | 4 | 53 |
| 3 | 5 | 681 |
| 4 | 6 | 638 |
| 5 | 7 | 199 |
| 6 | 8 | 18 |
To deepen your understanding and learn about the other features of Julia DataFrames (such as joining, reshaping, and sorting), refer to the documentation available at http://dataframesjl.readthedocs.org/en/latest/.
Julia can work with other human-readable file formats through specialized packages:
In this article we discussed the basics of network programming in Julia.
Further resources on this subject: