NLP workflow template

Some of us would love to work on Natural Language Processing for its sheer intellectual challenges – across research and engineering. To measure our progress, having a workflow with rough time estimates is really valuable. In this short section, we will briefly outline what a usual NLP or even most applied machine learning processes look like.

Most people I've learned from like to use a (roughly) five-step process:

Understanding the problem
Understanding and preparing data
Quick wins: proof of concepts
Iterating and improving the results
Evaluation and deployment

This is just a process template. It has a lot of room for customization regarding the engineering culture in your company. Any of these steps can be broken down further. For instance, data preparation and understanding can be split further into analysis and cleaning. Similarly, the proof of concept step may involve multiple experiments, and a demo or a report submission of best results from those.

Although this appears to be a strictly linear process, it is not so. More often than not, you will want to revisit a previous step and change a parameter or a particular data transform to see the effect on later performance.

In order to do so, it is important to factor in the cyclic nature of this process in your code. Write code with well-designed abstractions with each component being independently reusable.

If you are interested in how to write better NLP code, especially for research or experimentation, consider looking up the slide deck titled Writing Code for NLP Research, by Joel Grus of AllenAI.

Let's expand a little bit into each of these sections.

Understanding the problem

We will begin by understanding the requirements and constraints from a practical business view point. This tends to answer the following the questions:

What is the main problem? We will try to understand – formally and informally – the assumptions and expectations from our project.
How will I solve this problem? List some ideas that you might have seen earlier or in this book. This is the list that you will use to plan your work ahead.

Understanding and preparing the data

Text and language is inherently unstructured. We might want to clean it in certain ways, such as expanding abbreviations and acronyms, removing punctuation, and so on. We also want to select a few samples that are the best representatives of the data we might see in the wild.

The other common practice is to prepare a gold dataset. A gold dataset is the best available data under reasonable conditions. This is not the best available data under ideal conditions. Creating the gold dataset often involves manual tagging and cleaning processes.

The next few sections are dedicated to text cleaning and text representations at this stage of the NLP workflow.

Quick wins – proof of concept

We want to quickly spot the types of algorithms and dataset combinations that sort of work for us. We can then focus on them and study them in greater detail.

The results from here will help you estimate the amount of work ahead of you. For instance, if you are going to develop a search system for documents based exclusively on keywords, your main effort will probably be deploying an open source solution such as ElasticSearch.

Let's say that you now want to add a similar documents feature. Depending on the expected quality of results, you will want to look into techniques such as doc2vec and word2vec, or even some convolutional neural network solution using Keras/Tensorflow or PyTorch.

This step is essential to get a greater buy-in from others around you, such as your boss, to invest more energy and resources into this. In an engineering role, this demo should highlight parts of your work that the shelf systems usually can't do. These are your unique strengths. These are usually insights, customization, and control that other systems can't provide.

Iterating and improving

At this point, we have a selected list of algorithms, data, and methods that have encouraging results for us.

Algorithms

If your algorithms are machine learning or statistical in nature, you will quite often have a lot of juice left.

There are quite often parameters for which you simply pick a good enough default during the earlier stage. Here, you might want to double down and check for the best value of those parameters. This idea is sometimes referred to as parameter search, or hyperparameter tuning in machine learning parlance.

You might want to combine the results of one technique with the other in particular ways. For instance, some statistical methods might be very good for finding noun phrases in your text and using them to classify it, while a deep learning method (let's call it DL-LSTM) might be the best suited for text classification of the entire document. In that case, you might want to pass the extra information from both your noun phrase extraction and DL-LSTM to another model. This will allow it to the use the best of both worlds. This idea is sometimes referred to as stacking in machine learning parlance. This was quite successful on the machine learning contest platform Kaggle until very recently.

Pre-processing

Simple changes in data pre-processing or the data cleaning stage can quite often give you dramatically better results. For instance, making sure that your entire corpus is in lowercase can help you reduce the number of unique words (your vocabulary size) by a significant fraction.

If your numeric representation of words is skewed by the word frequency, sometimes it helps to normalize and/or scale the same. The laziest hack is to simply divide by the frequency.

Evaluation and deployment

Evaluation and deployment are critical components in making your work widely available. The quality of your evaluation determines how trustworthy your work is by other people. Deployment varies widely, but quite often is abstracted out in single function calls or REST API calls.

Evaluation

Let's say you have a model with 99% accuracy in classifying brain tumors. Can you trust this model? No.

If your model had said that no-one has a brain tumor, it would still have 99%+ accuracy. Why?

Because luckily 99% or more of the population does not have a brain tumor!

To use our models for practical use, we need to look beyond accuracy. We need to understand what the model gets right or wrong in order to improve it. A minute spent understanding the confusion matrix will stop us from going ahead with such dangerous models.

Additionally, we will want to develop an intuition of what the model is doing underneath the black box optimization algorithms. Data visualization techniques such as t-SNE can assist us with this.

For continuously running NLP applications such as email spam classifiers or chatbots, we would want the evaluation of the model quality to happen continuously as well. This will help us ensure that the model's performance does not degrade with time.

Deployment

This book is written with a programmer-first mindset. We will learn how to deploy any machine learning or NLP application as a REST API which can then be used for the web and mobile. This architecture is quite prevalent in the industry. For instance, we know that this is how data science teams such as those at Amazon and LinkedIn deploy their work to the web.