Data arrays and data frames
Users of R will be aware of the success of data frames when employed in analyzing datasets, a success that has been mirrored by Python with the pandas
package.
Julia too adds data frame support through the use of a DataFrames
package.
The package extends Julia’s base by introducing three basic types, as follows:
-
Missing.missing
: An indicator that a data value is missing -
DataArray
: An extension to theArray
type that can contain missing values -
DataFrame
: A data structure for representing tabular datasets
It is such a large topic that we will be looking at data frames in some depth when we consider statistical computing.
However, here’s some code to get a flavor of processing data with these packages:
julia>
using DataFramesjulia>
df1 = DataFrame(ID = 1:4, Cost = [10.1,7.9,missing,4.5]) 4 ×2 DataFrame │ Row │ ID │ Cost │ ├─────┼────┼─────────┤ │ 1 │ 1 │ 10.1 │ │ 2 │ 2 │ 7.9 │ │ 3 │ 3 │ missing │ │ 4 │ 4 │ 4.5 │
Common operations include computing mean(d) or var(d) of the Cost because of the missing value in row 3:
julia>
using Statisticsjulia>
mean(!, df1[:Cost]) missing
We can create a new data frame by dropping ALL rows with missing values, and now statistical functions can be applied as normal:
julia>
df2 = dropmissing(df1). << This might have changed ??? >>> 3 ×2 DataFrames.DataFrame │ Row │ ID │ Cost │ ├─────┼────┼──────┤ │ 1 │ 1 │ 10.1 │ │ 2 │ 2 │ 7.9 │ │ 3 │ 4 │ 4.5 │julia>
(μ,σ) = (mean(df2[!,:Cost]),std(df2[!,:Cost])) (7.5, 2.8213471959331766)
We will cover data frames in much greater detail when we consider data I/O in Chapter 6.
At this time, we will look at the Tables
API, implemented in the Tables.jl
file, which is used by a large number of packages.