You're reading from Machine Learning With Go Implement Regression, Classification, Clustering, Time-series Models, Neural Networks, and More using the Go Programming Language

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781785882104

Length 304 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Joseph Langstaff Whitenack

View More author details

CSV files

CSV files might not be a go-to format for big data, but as a data scientist or developer working in machine learning, you are sure to encounter this format. You might need a mapping of zip codes to latitude/longitude and find this as a CSV file on the internet, or you may be given sales figures from your sales team in a CSV format. In any event, we need to understand how to parse these files.

The main package that we will utilize in parsing CSV files is encoding/csv from Go's standard library. However, we will also discuss a couple of packages that allow us to quickly manipulate or transform CSV data--github.com/kniren/gota/dataframe and go-hep.org/x/hep/csvutil.

Reading in CSV data from a file

Let's consider a simple CSV file, which we will return to later, named iris.csv (available here: https://archive.ics.uci.edu/ml/datasets/iris). This CSV file includes four float columns of flower measurements and a string column with the corresponding flower species:

$ head iris.csv 
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

With encoding/csv imported, we first open the CSV file and create a CSV reader value:

// Open the iris dataset file.
f, err := os.Open("../data/iris.csv")
if err != nil {
    log.Fatal(err)
}
defer f.Close()

// Create a new CSV reader reading from the opened file.
reader := csv.NewReader(f)

Then we can read in all of the records (corresponding to rows) of the CSV file. These records are imported as [][]string:

// Assume we don't know the number of fields per line. By setting
// FieldsPerRecord negative, each row may have a variable
// number of fields.
reader.FieldsPerRecord = -1

// Read in all of the CSV records.
rawCSVData, err := reader.ReadAll()
if err != nil {
    log.Fatal(err)
}

We can also read in records one at a time in an infinite loop. Just make sure that you check for the end of the file (io.EOF) so that the loop ends after reading in all of your data:

// Create a new CSV reader reading from the opened file.
reader := csv.NewReader(f)
reader.FieldsPerRecord = -1

// rawCSVData will hold our successfully parsed rows.
var rawCSVData [][]string

// Read in the records one by one.
for {

    // Read in a row. Check if we are at the end of the file.
    record, err := reader.Read()
    if err == io.EOF {
        break
    }

    // Append the record to our dataset.
    rawCSVData = append(rawCSVData, record)
}

If your CSV file is not delimited by commas and/or if your CSV file contains commented rows, you can utilize the csv.Reader.Comma and csv.Reader.Comment fields to properly handle uniquely formatted CSV files. In cases where the fields in your CSV file are single-quoted, you may need to add in a helper function to trim the single quotes and parse the values.

Handling unexpected fields

The preceding methods work fine with clean CSV data, but, in general, we don't encounter clean data. We have to parse messy data. For example, you might find unexpected fields or numbers of fields in your CSV records. This is why reader.FieldsPerRecord exists. This field of the reader value lets us easily handle messy data, as follows:

4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,blah,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa

This version of the iris.csv file has an extra field in one of the rows. We know that each record should have five fields, so let's set our reader.FieldsPerRecord value to 5:

// We should have 5 fields per line. By setting
// FieldsPerRecord to 5, we can validate that each of the
// rows in our CSV has the correct number of fields.
reader.FieldsPerRecord = 5

Then as we are reading in records from the CSV file, we can check for unexpected fields and maintain the integrity of our data:

// rawCSVData will hold our successfully parsed rows.
var rawCSVData [][]string

// Read in the records looking for unexpected numbers of fields.
for {

    // Read in a row. Check if we are at the end of the file.
    record, err := reader.Read()
    if err == io.EOF {
        break
    }

    // If we had a parsing error, log the error and move on.
    if err != nil {
        log.Println(err)
        continue
    }

    // Append the record to our dataset, if it has the expected
    // number of fields.
    rawCSVData = append(rawCSVData, record)
}

Here, we have chosen to handle the error by logging the error, and we only collect successfully parsed records into rawCSVData. The reader will note that this error could be handled in many different ways. The important thing is that we are forcing ourselves to check for an expected property of the data and increasing the integrity of our application.

Handling unexpected types

We just saw that CSV data is read into Go as [][]string. However, Go is statically typed, which allows us to enforce strict checks for each of the CSV fields. We can do this as we parse each field for further processing. Consider some messy data that has random fields that don't match the type of the other values in a column:

4.6,3.1,1.5,0.2,Iris-setosa
5.0,string,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,string,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa

To check the types of the fields in our CSV records, let's create a struct variable to hold successfully parsed values:

// CSVRecord contains a successfully parsed row of the CSV file.
type CSVRecord struct {
    SepalLength  float64
    SepalWidth   float64
    PetalLength  float64
    PetalWidth   float64
    Species      string
    ParseError   error
}

Then, before we loop over the records, let's initialize a slice of these values:

// Create a slice value that will hold all of the successfully parsed
// records from the CSV.
var csvData []CSVRecord

Now as we loop over the records, we can parse into the relevant type for that record, catch any errors, and log as needed:


// Read in the records looking for unexpected types.
for {

    // Read in a row. Check if we are at the end of the file.
    record, err := reader.Read()
    if err == io.EOF {
        break
    }

    // Create a CSVRecord value for the row.
    var csvRecord CSVRecord

    // Parse each of the values in the record based on an expected type.
    for idx, value := range record {

        // Parse the value in the record as a string for the string column.
        if idx == 4 {

            // Validate that the value is not an empty string. If the
            // value is an empty string break the parsing loop.
            if value == "" {
                log.Printf("Unexpected type in column %d\n", idx)
                csvRecord.ParseError = fmt.Errorf("Empty string value")
                break
            }

            // Add the string value to the CSVRecord.
            csvRecord.Species = value
            continue
        }

        // Otherwise, parse the value in the record as a float64.
        var floatValue float64

        // If the value can not be parsed as a float, log and break the
        // parsing loop.
        if floatValue, err = strconv.ParseFloat(value, 64); err != nil {
            log.Printf("Unexpected type in column %d\n", idx)
            csvRecord.ParseError = fmt.Errorf("Could not parse float")
            break
        }

        // Add the float value to the respective field in the CSVRecord.
        switch idx {
        case 0:
            csvRecord.SepalLength = floatValue
        case 1:
            csvRecord.SepalWidth = floatValue
        case 2:
            csvRecord.PetalLength = floatValue
        case 3:
            csvRecord.PetalWidth = floatValue
        }
    }

    // Append successfully parsed records to the slice defined above.
    if csvRecord.ParseError == nil {
        csvData = append(csvData, csvRecord)
    }
}

Manipulating CSV data with data frames

As you can see, manually parsing many different fields and performing row-by-row operations can be rather verbose and tedious. This is definitely not an excuse to increase complexity and import a bunch of non standard functionalities. You should still default to the use of encoding/csv in most cases.

However, manipulation of data frames has proven to be a successful and somewhat standardized way (in the data science community) of dealing with tabular data. Thus, in some cases, it is worth employing some third-party functionality to manipulate tabular data, such as CSV data. For example, data frames and the corresponding functionality can be very useful when you are trying to filter, subset, and select portions of tabular datasets. In this section, we will introduce github.com/kniren/gota/dataframe, a wonderful dataframe package for Go:

import "github.com/kniren/gota/dataframe"

To create a data frame from a CSV file, we open a file with os.Open() and then supply the returned pointer to the dataframe.ReadCSV() function:

// Open the CSV file.
irisFile, err := os.Open("iris.csv")
if err != nil {
    log.Fatal(err)
}
defer irisFile.Close()

// Create a dataframe from the CSV file.
// The types of the columns will be inferred.
irisDF := dataframe.ReadCSV(irisFile)

// As a sanity check, display the records to stdout.
// Gota will format the dataframe for pretty printing.
fmt.Println(irisDF)

If we compile and run this Go program, we will see a nice, pretty-printed version of our data with the types that were inferred during parsing:

$ go build
$ ./myprogram
[150x5] DataFrame

 sepal_length sepal_width petal_length petal_width species 
 0: 5.100000 3.500000 1.400000 0.200000 Iris-setosa
 1: 4.900000 3.000000 1.400000 0.200000 Iris-setosa
 2: 4.700000 3.200000 1.300000 0.200000 Iris-setosa
 3: 4.600000 3.100000 1.500000 0.200000 Iris-setosa
 4: 5.000000 3.600000 1.400000 0.200000 Iris-setosa
 5: 5.400000 3.900000 1.700000 0.400000 Iris-setosa
 6: 4.600000 3.400000 1.400000 0.300000 Iris-setosa
 7: 5.000000 3.400000 1.500000 0.200000 Iris-setosa
 8: 4.400000 2.900000 1.400000 0.200000 Iris-setosa
 9: 4.900000 3.100000 1.500000 0.100000 Iris-setosa
 ... ... ... ... ... 
 <float> <float> <float> <float> <string>

Once we have the data parsed into a dataframe, we can filter, subset, and select our data easily:

// Create a filter for the dataframe.
filter := dataframe.F{
    Colname: "species",
    Comparator: "==",
    Comparando: "Iris-versicolor",
}

// Filter the dataframe to see only the rows where
// the iris species is "Iris-versicolor".
versicolorDF := irisDF.Filter(filter)
if versicolorDF.Err != nil {
    log.Fatal(versicolorDF.Err)
}

// Filter the dataframe again, but only select out the
// sepal_width and species columns.
versicolorDF = irisDF.Filter(filter).Select([]string{"sepal_width", "species"})

// Filter and select the dataframe again, but only display
// the first three results.
versicolorDF = irisDF.Filter(filter).Select([]string{"sepal_width", "species"}).Subset([]int{0, 1, 2})

This is really only scratching the surface of the github.com/kniren/gota/dataframe package. You can merge datasets, output to other formats, and even process JSON data. For more information about this package, you should visit the auto generated GoDocs at https://godoc.org/github.com/kniren/gota/dataframe, which is good practice, in general, for any packages we discuss in the book.