Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
F# for Machine Learning Essentials

You're reading from   F# for Machine Learning Essentials Get up and running with machine learning with F# in a fun and functional way

Arrow left icon
Product type Paperback
Published in Feb 2016
Publisher
ISBN-13 9781783989348
Length 194 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Sudipta Mukherjee Sudipta Mukherjee
Author Profile Icon Sudipta Mukherjee
Sudipta Mukherjee
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface 1. Introduction to Machine Learning FREE CHAPTER 2. Linear Regression 3. Classification Techniques 4. Information Retrieval 5. Collaborative Filtering 6. Sentiment Analysis 7. Anomaly Detection Index

Recognizing handwritten digits – your "Hello World" ML program

Handwritten digits can be recognized with k-nearest neighbor algorithm.

Each handwritten digit is written on a 28*28 matrix. So there are 28*28 -> 784 pixels and each of these are represented as a single column of the dataset. Thus, the dataset has 785 columns. The first column is the label/digit and the remaining 784 values are the pixel values.

Following is a small example. Let's say, if we're to imagine this example as an 8 by 8 matrix, we would have something like the following figure for the digit 2:

Recognizing handwritten digits – your "Hello World" ML program

A matrix can be represented as a 2-D array where each pixel is represented by each cell. However, any 2-D array can be visually unwrapped to be a 1-D array where the length of the array is the product of the length and the breadth of the array. For example, for the 8 by 8 matrix, the size of the single dimensional array will be 64. Now if we store several images and their 2D matrix representations, we will have something as shown in the following spreadsheet:

Recognizing handwritten digits – your "Hello World" ML program

The header Label denotes the number and the remaining values are the pixel values. Lesser the pixel values, the darker the cell is in the pictorial representation of the number 2, as shown previously.

In this program, you will write code to solve the digit recognizer challenge from Kaggle, available at:

https://www.kaggle.com/c/digit-recognizer.

Once you get there, download the data and save it in some folder. We will be using the train.csv file (You can get the file from www.kaggle.com/c/digit-recognizer/data) for training our classifier. In this example, you will implement the k nearest neighbor algorithm from scratch, and then deploy this algorithm to recognize the digit.

For your convenience, I have pasted the code at https://gist.github.com/sudipto80/72e6e56d07110baf4d4d.

Following are the steps to create the classifier:

  1. Open Visual Studio 2013.
  2. Create a new project:
    Recognizing handwritten digits – your "Hello World" ML program
  3. Select F# and give a name for the console app:
    Recognizing handwritten digits – your "Hello World" ML program
  4. Once you create the project by clicking "OK", your program.fs file will look as the following image:
    Recognizing handwritten digits – your "Hello World" ML program
  5. Add the following functions and types in your file:
    Recognizing handwritten digits – your "Hello World" ML program
    Recognizing handwritten digits – your "Hello World" ML program
    Recognizing handwritten digits – your "Hello World" ML program
    Recognizing handwritten digits – your "Hello World" ML program
  6. Finally, in the main method, add the following code:
    Recognizing handwritten digits – your "Hello World" ML program

When this program runs, it will produce the following output:

Recognizing handwritten digits – your "Hello World" ML program

How does this work?

The distance function is based on the Euclidean distance function, as mentioned earlier in the chapter. Now you see that a general purpose Euclidean distance function is coded in the distance function. You might have noticed that there is a small difference between the formula and the implementation. The implementation finds the squared Euclidean distance given by the following formula:

How does this work?

Here How does this work? and How does this work? denote the two vectors. In this case, How does this work? might denote one example from the training set and How does this work? might denote the test example or the new uncategorized data that we have depicted by newEntry in the preceding code.

The loadValues function loads the pixel values and the category for each training/test data, and creates a list of Entry types from the CSV file.

The k-NN algorithm is implemented in the kNN function. Refer to the following line of code:

|> List.map( fun x -> ( x.Label, distance  (x.Values, snd (newEntry) |>Array.toList )))

This preceding code creates a list of tuples where the first element is the category of the entry and the second is the distance square value for the test data from each of the training entry. So it might look as follows:

How does this work?

Now consider the following line:

|> List.sortBy ( fun x -> snd x)

It sorts this list of tuples based on the increasing distance from the test data. Thus, the preceding list will become as shown in the following image:

How does this work?

If you see, there are four 9s and three 4s in this list. The following line transforms this list into a histogram:

|> Seq.countBy (fun x -> fst x)

So if k is chosen to be 5, then we will have four 9s and one 4. Thus, k nearest neighbor will conclude that the digit is probably a "9" since most of the nearest neighbors are "9".

The drawDigit function draws the digit pixel by pixel and writes out the guessed label for the digit. It does so by drawing each pixel on a tile size of 20.

You have been reading a chapter from
F# for Machine Learning Essentials
Published in: Feb 2016
Publisher:
ISBN-13: 9781783989348
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image