What is NLP?
NLP is a set of techniques that helps computers work with human language. However, it can be used for more than dealing with words and sentences. It can also work with application log files, source code, or anything else where human text is used, and on imaginary languages as well, so long as the text is consistent in following a language’s rules. Natural language is a language that humans speak or write. Processing is the act of a computer using data. So, NLP is the act of a computer using spoken or written human language. It’s that simple.
Many of us software developers have been doing NLP for years, maybe even without realizing it. I will give my own example. I started my career as a web developer. I was entirely self-educated in web development. Early in my career, I built a website that became very popular and had a nice community, so I took inspiration from Yahoo Chats (popular at the time), reverse-engineered it, and built my own internet message board. It grew rapidly, providing years of entertainment and making me some close friends. However, with any good social application, trolls, bots, and generally nasty people eventually became a problem, so I needed a way to flag and quarantine abusive content automatically.
Back then, I created lists of examples of abusive words and strings that could help catch abuse. I was not interested in stopping all obscenities, as I do not believe in completely controlling how people post text online; however, I was looking to identify toxic behavior, violence, and other nasty things. Anyone with a comment section on their website is very likely doing something similar in order to moderate their website, or they should be. The point is that I have been doing NLP since the beginning of my career without even noticing, but it was rule-based.
These days, machine learning dominates the NLP landscape, as we are able to train models to detect abuse, violence, or pretty much anything we can imagine, which is one thing that I love the most about NLP. I feel that I am limited only by the extent of my own creativity. As such, I have created classifiers to detect discussions that contained or were about extreme political sentiment, violence, music, art, data science, natural sciences, and disinformation, and at any given moment, I typically have several NLP models in mind that I want to build but haven’t found time. I have even used NLP to detect malware. But, again, NLP doesn’t have to be against written or spoken words, as my malware classifier has shown. If you keep that in mind, then your potential uses for NLP massively expand. My rule of thumb is that if there are sequences in data that can be extracted as words – even if they are not words – they can potentially be used with NLP techniques.
In the past, and probably still now, analysts would drop columns containing text or do very basic transformations or computations, such as one-hot encoding, counts, or determining the presence/absence (true/false). However, there is so much more that you can do, and I hope this chapter and book will ignite some inspiration and curiosity in you from reading this.