Chapter 1. Introduction to Natural Language Processing
I will start with the introduction to Natural Language Processing (NLP). Language is a central part of our day to day life, and it's so interesting to work on any problem related to languages. I hope this book will give you a flavor of NLP, will motivate you to learn some amazing concepts of NLP, and will inspire you to work on some of the challenging NLP applications.
In my own language, the study of language processing is called NLP. People who are deeply involved in the study of language are linguists, while the term 'computational linguist' applies to the study of processing languages with the application of computation. Essentially, a computational linguist will be a computer scientist who has enough understanding of languages, and can apply his computational skills to model different aspects of the language. While computational linguists address the theoretical aspect of language, NLP is nothing but the application of computational linguistics.
NLP is more about the application of computers on different language nuances, and building real-world applications using NLP techniques. In a practical context, NLP is analogous to teaching a language to a child. Some of the most common tasks like understanding words, sentences, and forming grammatically and structurally correct sentences, are very natural to humans. In NLP, some of these tasks translate to tokenization, chunking, part of speech tagging, parsing, machine translation, speech recognition, and most of them are still the toughest challenges for computers. I will be talking more on the practical side of NLP, assuming that we all have some background in NLP. The expectation for the reader is to have minimal understanding of any programming language and an interest in NLP and Language.
By end of the chapter we want readers
- A brief introduction to NLP and related concepts.
- Install Python, NLTK and other libraries.
- Write some very basic Python and NLTK code snippets.
If you have never heard the term NLP, then please take some time to read any of the books mentioned here—just for an initial few chapters. A quick reading of at least the Wikipedia page relating to NLP is a must:
- Speech and Language Processing by Daniel Jurafsky and James H. Martin
- Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze
Why learn NLP?
I start my discussion with the Gartner's new hype cycle and you can clearly see NLP on top of the cycle. Currently, NLP is one of the rarest skill sets that is required in the industry. After the advent of big data, the major challenge is that we need more people who are good with not just structured, but also with semi or unstructured data. We are generating petabytes of Weblogs, tweets, Facebook feeds, chats, e-mails, and reviews. Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.
We are in the age of information; we can't even imagine our life without Google. We use Siri for the most of basic stuff. We use spam filters for filtering spam emails. We need spell checker on our Word document. There are many examples of real world NLP applications around us.
Let me also give you some examples of the amazing NLP applications that you can use, but are not aware that they are built on NLP:
- Spell correction (MS Word/ any other editor)
- Search engines (Google, Bing, Yahoo, wolframalpha)
- Speech engines (Siri, Google Voice)
- Spam classifiers (All e-mail services)
- News feeds (Google, Yahoo!, and so on)
- Machine translation (Google Translate, and so on)
- IBM Watson
Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.
To achieve some of the above applications and other basic NLP preprocessing, there are many open source tools available. Some of them are developed by organizations to build their own NLP applications, while some of them are open-sourced. Here is a small list of available NLP tools:
- GATE
- Mallet
- Open NLP
- UIMA
- Stanford toolkit
- Genism
- Natural Language Tool Kit (NLTK)
Most of the tools are written in Java and have similar functionalities. Some of them are robust and have a different variety of NLP tools available. However, when it comes to the ease of use and explanation of the concepts, NLTK scores really high. NLTK is also a very good learning kit because the learning curve of Python (on which NLTK is written) is very fast. NLTK has incorporated most of the NLP tasks, it's very elegant and easy to work with. For all these reasons, NLTK has become one of the most popular libraries in the NLP community:
I am assuming all you guys know Python. If not, I urge you to learn Python. There are many basic tutorials on Python available online. There are lots of books also available that give you a quick overview of the language. We will also look into some of the features of Python, while going through the different topics. But for now, even if you only know the basics of Python, such as lists, strings, regular expressions, and basic I/O, you should be good to go.
Note
Python can be installed from the following website:
https://www.python.org/downloads/
http://continuum.io/downloads
I would recommend using Anaconda or Canopy Python distributions. The reason being that these distributions come with bundled libraries, such as scipy
, numpy
, scikit
, and so on, which are used for data analysis and other applications related to NLP and related fields. Even NLTK is part of this distribution.
Let's test everything.
Open the terminal on your respective operating systems. Then run:
$ python
This should open the Python interpreter:
Python 2.6.6 (r266:84292, Oct 15 2013, 07:32:41) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
I hope you got a similar looking output here. There is a chance that you will have received a different looking output, but ideally you will get the latest version of Python (I recommend that to be 2.7), the compiler GCC, and the operating system details. I know the latest version of Python will be in 3.0+ range, but as with any other open source systems, we should tries to hold back to a more stable version as opposed to jumping on to the latest version. If you have moved to Python 3.0+, please have a look at the link below to gain an understanding about what new features have been added:
https://docs.python.org/3/whatsnew/3.4.html.
UNIX based systems will have Python as a default program. Windows users can set the path to get Python working. Let's check whether we have installed NLTK correctly:
>>>import nltk >>>print "Python and NLTK installed successfully" Python and NLTK installed successfully
Hey, we are good to go!
Let's start playing with Python!
We'll not be diving too deep into Python; however, we'll give you a quick tour of Python essentials. Still, I think for the benefit of the audience, we should have a quick five minute tour. We'll talk about the basics of data structures, some frequently used functions, and the general construct of Python in the next few sections.
Note
I highly recommend the two hour Google Python class. https://developers.google.com/edu/python should be good enough to start. Please go through the Python website https://www.python.org/ for more tutorials and other resources.
Lists
Lists are one of the most commonly used data structures in Python. They are pretty much comparable to arrays in other programming languages. Let's start with some of the most important functions that a Python list provide.
Try the following in the Python console:
>>> lst=[1,2,3,4] >>> # mostly like arrays in typical languages >>>print lst [1, 2, 3, 4]
Python lists can be accessed using much more flexible indexing. Here are some examples:
>>>print 'First element' +lst[0]
You will get an error message like this:
TypeError: cannot concatenate 'str' and 'int' objects
The reason being that Python is an interpreted language, and checks for the type of the variables at the time it evaluates the expression. We need not initialize and declare the type of variable at the time of declaration. Our list has integer object and cannot be concatenated as a print function. It will only accept a string object. For this reason, we need to convert list elements to string. The process is also known as type casting.
>>>print 'First element :' +str(lst[0]) >>>print 'last element :' +str(lst[-1]) >>>print 'first three elements :' +str(lst[0:2]) >>>print 'last three elements :'+str(lst[-3:]) First element :1 last element :4 first three elements :[1, 2,3] last three elements :[2, 3, 4]
Helping yourself
The best way to learn more about different data types and functions is to use help functions like help()
and dir(lst)
.
The dir(python object)
command is used to list all the given attributes of the given Python object. Like if you pass a list object, it will list all the cool things you can do with lists:
>>>dir(lst) >>>' , '.join(dir(lst)) '__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __delslice__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __getslice__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __setslice__ , __sizeof__ , __str__ , __subclasshook__ , append , count , extend , index , insert , pop , remove , reverse , sort'
With the help(python object)
command, we can get detailed documentation for the given Python object, and also give a few examples of how to use the Python object:
>>>help(lst.index) Help on built-in function index: index(...) L.index(value, [start, [stop]]) -> integer -- return first index of value. This function raises a ValueError if the value is not present.
So help
and dir
can be used on any Python data type, and are a very nice way to learn about the function and other details of that object. It also provides you with some basic examples to work with, which I found useful in most cases.
Strings in Python are very similar to other languages, but the manipulation of strings is one of the main features of Python. It's immensely easy to work with strings in Python. Even something very simple, like splitting a string, takes effort in Java / C, while you will see how easy it is in Python.
Using the help function that we used previously, you can get help for any Python object and any function. Let's have some more examples with the other most commonly used data type strings:
- Split: This is a method to split the string based on some delimiters. If no argument is provided it assumes whitespace as delimiter.
>>> mystring="Monty Python ! And the holy Grail ! \n" >>> print mystring.split() ['Monty', 'Python', '!', 'and', 'the', 'holy', 'Grail', '!']
- Strip: This is a method that can remove trailing whitespace, like '\n', '\n\r' from the string:
>>> print mystring.strip() >>>Monty Python ! and the holy Grail !
If you notice the '\n' character is stripped off. There are also methods like
rstrip()
andlstrip()
to strip trailing whitespaces to the right and left of the string. - Upper/Lower: We can change the case of the string using these methods:
>>> print (mystring.upper() >>>MONTY PYTHON !AND THE HOLY GRAIL !
- Replace: This will help you substitute a substring from the string:
>>> print mystring.replace('!','''''') >>> Monty Python and the holy Grail
There are tons of string functions. I have just talked about some of the most frequently used.
Note
Please look the following link for more functions and examples:
Regular expressions
One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:
(a period)
: This expression matches any single character except newline\n
.\w
: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]- \W (upper case W) matches any non-word character.
\s
: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f
].\S
: This expression matches any non-whitespace character.\t
: This expression performs a tab operation.\n
: This expression is used for a newline character.\r
: This expression is used for a return character.\d
: Decimal digit [0-9].^
: This expression is used at the start of the string.$
: This expression is used at the end of the string.\
: This expression is used to nullify the specialness of the special character. For example, you want to match the$
symbol, then add\
in front of it.
Let's search for something in the running example, where mystring
is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re
module. Let's implement this:
>>># We have to import re module to use regular expression >>>import re >>>if re.search('Python',mystring): >>> print "We found python " >>>else: >>> print "NO "
Once this is executed, we get the message as follows:
We found python
We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall
. It will look for the given patterns in the string, and will give you a list of all the matched objects:
>>>import re >>>print re.findall('!',mystring) ['!', '!']
As we can see there were two instances of the "!
" in the mystring
and findall
return both object as a list.
Dictionaries
The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.
Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.
I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:
>>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def
followed by the function name and parentheses ()
. Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol
. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq
start with def
keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={}
 >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str="This is my fist python program" >>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
- Open an empty python file
mywordfreq.py
in your prefered text editor. - Write/Copy the code above in the code snippet to the file.
- Open the command prompt in your Operating system.
- Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"
- Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
Lists
Lists are one of the most commonly used data structures in Python. They are pretty much comparable to arrays in other programming languages. Let's start with some of the most important functions that a Python list provide.
Try the following in the Python console:
>>> lst=[1,2,3,4] >>> # mostly like arrays in typical languages >>>print lst [1, 2, 3, 4]
Python lists can be accessed using much more flexible indexing. Here are some examples:
>>>print 'First element' +lst[0]
You will get an error message like this:
TypeError: cannot concatenate 'str' and 'int' objects
The reason being that Python is an interpreted language, and checks for the type of the variables at the time it evaluates the expression. We need not initialize and declare the type of variable at the time of declaration. Our list has integer object and cannot be concatenated as a print function. It will only accept a string object. For this reason, we need to convert list elements to string. The process is also known as type casting.
>>>print 'First element :' +str(lst[0]) >>>print 'last element :' +str(lst[-1]) >>>print 'first three elements :' +str(lst[0:2]) >>>print 'last three elements :'+str(lst[-3:]) First element :1 last element :4 first three elements :[1, 2,3] last three elements :[2, 3, 4]
Helping yourself
The best way to learn more about different data types and functions is to use help functions like help()
and dir(lst)
.
The dir(python object)
command is used to list all the given attributes of the given Python object. Like if you pass a list object, it will list all the cool things you can do with lists:
>>>dir(lst) >>>' , '.join(dir(lst)) '__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __delslice__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __getslice__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __setslice__ , __sizeof__ , __str__ , __subclasshook__ , append , count , extend , index , insert , pop , remove , reverse , sort'
With the help(python object)
command, we can get detailed documentation for the given Python object, and also give a few examples of how to use the Python object:
>>>help(lst.index) Help on built-in function index: index(...) L.index(value, [start, [stop]]) -> integer -- return first index of value. This function raises a ValueError if the value is not present.
So help
and dir
can be used on any Python data type, and are a very nice way to learn about the function and other details of that object. It also provides you with some basic examples to work with, which I found useful in most cases.
Strings in Python are very similar to other languages, but the manipulation of strings is one of the main features of Python. It's immensely easy to work with strings in Python. Even something very simple, like splitting a string, takes effort in Java / C, while you will see how easy it is in Python.
Using the help function that we used previously, you can get help for any Python object and any function. Let's have some more examples with the other most commonly used data type strings:
- Split: This is a method to split the string based on some delimiters. If no argument is provided it assumes whitespace as delimiter.
>>> mystring="Monty Python ! And the holy Grail ! \n" >>> print mystring.split() ['Monty', 'Python', '!', 'and', 'the', 'holy', 'Grail', '!']
- Strip: This is a method that can remove trailing whitespace, like '\n', '\n\r' from the string:
>>> print mystring.strip() >>>Monty Python ! and the holy Grail !
If you notice the '\n' character is stripped off. There are also methods like
rstrip()
andlstrip()
to strip trailing whitespaces to the right and left of the string. - Upper/Lower: We can change the case of the string using these methods:
>>> print (mystring.upper() >>>MONTY PYTHON !AND THE HOLY GRAIL !
- Replace: This will help you substitute a substring from the string:
>>> print mystring.replace('!','''''') >>> Monty Python and the holy Grail
There are tons of string functions. I have just talked about some of the most frequently used.
Note
Please look the following link for more functions and examples:
Regular expressions
One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:
(a period)
: This expression matches any single character except newline\n
.\w
: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]- \W (upper case W) matches any non-word character.
\s
: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f
].\S
: This expression matches any non-whitespace character.\t
: This expression performs a tab operation.\n
: This expression is used for a newline character.\r
: This expression is used for a return character.\d
: Decimal digit [0-9].^
: This expression is used at the start of the string.$
: This expression is used at the end of the string.\
: This expression is used to nullify the specialness of the special character. For example, you want to match the$
symbol, then add\
in front of it.
Let's search for something in the running example, where mystring
is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re
module. Let's implement this:
>>># We have to import re module to use regular expression >>>import re >>>if re.search('Python',mystring): >>> print "We found python " >>>else: >>> print "NO "
Once this is executed, we get the message as follows:
We found python
We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall
. It will look for the given patterns in the string, and will give you a list of all the matched objects:
>>>import re >>>print re.findall('!',mystring) ['!', '!']
As we can see there were two instances of the "!
" in the mystring
and findall
return both object as a list.
Dictionaries
The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.
Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.
I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:
>>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def
followed by the function name and parentheses ()
. Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol
. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq
start with def
keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={}
 >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str="This is my fist python program" >>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
- Open an empty python file
mywordfreq.py
in your prefered text editor. - Write/Copy the code above in the code snippet to the file.
- Open the command prompt in your Operating system.
- Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"
- Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
Helping yourself
The best way to learn more about different data types and functions is to use help functions like help()
and dir(lst)
.
The dir(python object)
command is used to list all the given attributes of the given Python object. Like if you pass a list object, it will list all the cool things you can do with lists:
>>>dir(lst) >>>' , '.join(dir(lst)) '__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __delslice__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __getslice__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __setslice__ , __sizeof__ , __str__ , __subclasshook__ , append , count , extend , index , insert , pop , remove , reverse , sort'
With the help(python object)
command, we can get detailed documentation for the given Python object, and also give a few examples of how to use the Python object:
>>>help(lst.index) Help on built-in function index: index(...) L.index(value, [start, [stop]]) -> integer -- return first index of value. This function raises a ValueError if the value is not present.
So help
and dir
can be used on any Python data type, and are a very nice way to learn about the function and other details of that object. It also provides you with some basic examples to work with, which I found useful in most cases.
Strings in Python are very similar to other languages, but the manipulation of strings is one of the main features of Python. It's immensely easy to work with strings in Python. Even something very simple, like splitting a string, takes effort in Java / C, while you will see how easy it is in Python.
Using the help function that we used previously, you can get help for any Python object and any function. Let's have some more examples with the other most commonly used data type strings:
- Split: This is a method to split the string based on some delimiters. If no argument is provided it assumes whitespace as delimiter.
>>> mystring="Monty Python ! And the holy Grail ! \n" >>> print mystring.split() ['Monty', 'Python', '!', 'and', 'the', 'holy', 'Grail', '!']
- Strip: This is a method that can remove trailing whitespace, like '\n', '\n\r' from the string:
>>> print mystring.strip() >>>Monty Python ! and the holy Grail !
If you notice the '\n' character is stripped off. There are also methods like
rstrip()
andlstrip()
to strip trailing whitespaces to the right and left of the string. - Upper/Lower: We can change the case of the string using these methods:
>>> print (mystring.upper() >>>MONTY PYTHON !AND THE HOLY GRAIL !
- Replace: This will help you substitute a substring from the string:
>>> print mystring.replace('!','''''') >>> Monty Python and the holy Grail
There are tons of string functions. I have just talked about some of the most frequently used.
Note
Please look the following link for more functions and examples:
Regular expressions
One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:
(a period)
: This expression matches any single character except newline\n
.\w
: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]- \W (upper case W) matches any non-word character.
\s
: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f
].\S
: This expression matches any non-whitespace character.\t
: This expression performs a tab operation.\n
: This expression is used for a newline character.\r
: This expression is used for a return character.\d
: Decimal digit [0-9].^
: This expression is used at the start of the string.$
: This expression is used at the end of the string.\
: This expression is used to nullify the specialness of the special character. For example, you want to match the$
symbol, then add\
in front of it.
Let's search for something in the running example, where mystring
is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re
module. Let's implement this:
>>># We have to import re module to use regular expression >>>import re >>>if re.search('Python',mystring): >>> print "We found python " >>>else: >>> print "NO "
Once this is executed, we get the message as follows:
We found python
We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall
. It will look for the given patterns in the string, and will give you a list of all the matched objects:
>>>import re >>>print re.findall('!',mystring) ['!', '!']
As we can see there were two instances of the "!
" in the mystring
and findall
return both object as a list.
Dictionaries
The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.
Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.
I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:
>>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def
followed by the function name and parentheses ()
. Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol
. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq
start with def
keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={}
 >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str="This is my fist python program" >>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
- Open an empty python file
mywordfreq.py
in your prefered text editor. - Write/Copy the code above in the code snippet to the file.
- Open the command prompt in your Operating system.
- Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"
- Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
Regular expressions
One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:
(a period)
: This expression matches any single character except newline\n
.\w
: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]- \W (upper case W) matches any non-word character.
\s
: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f
].\S
: This expression matches any non-whitespace character.\t
: This expression performs a tab operation.\n
: This expression is used for a newline character.\r
: This expression is used for a return character.\d
: Decimal digit [0-9].^
: This expression is used at the start of the string.$
: This expression is used at the end of the string.\
: This expression is used to nullify the specialness of the special character. For example, you want to match the$
symbol, then add\
in front of it.
Let's search for something in the running example, where mystring
is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re
module. Let's implement this:
>>># We have to import re module to use regular expression >>>import re >>>if re.search('Python',mystring): >>> print "We found python " >>>else: >>> print "NO "
Once this is executed, we get the message as follows:
We found python
We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall
. It will look for the given patterns in the string, and will give you a list of all the matched objects:
>>>import re >>>print re.findall('!',mystring) ['!', '!']
As we can see there were two instances of the "!
" in the mystring
and findall
return both object as a list.
Dictionaries
The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.
Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.
I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:
>>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def
followed by the function name and parentheses ()
. Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol
. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq
start with def
keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={}
 >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str="This is my fist python program" >>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
- Open an empty python file
mywordfreq.py
in your prefered text editor. - Write/Copy the code above in the code snippet to the file.
- Open the command prompt in your Operating system.
- Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"
- Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
Dictionaries
The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.
Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.
I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:
>>># declare a dictionary >>>word_freq={} >>>for tok in string.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>>print word_freq {'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def
followed by the function name and parentheses ()
. Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol
. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq
start with def
keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={}
 >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str="This is my fist python program" >>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
- Open an empty python file
mywordfreq.py
in your prefered text editor. - Write/Copy the code above in the code snippet to the file.
- Open the command prompt in your Operating system.
- Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"
- Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
Writing functions
As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def
followed by the function name and parentheses ()
. Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol
. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq
start with def
keyword, there is no argument to this function and the function ends with a return statement.
>>>import sys >>>def wordfreq (mystring): >>> ''' >>> Function to generated the frequency distribution of the given text >>> ''' >>> print mystring >>> word_freq={}
 >>> for tok in mystring.split(): >>> if tok in word_freq: >>> word_freq [tok]+=1 >>> else: >>> word_freq [tok]=1 >>> print word_freq >>>def main(): >>> str="This is my fist python program" >>> wordfreq(str) >>>if __name__ == '__main__': >>> main()
This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.
- Open an empty python file
mywordfreq.py
in your prefered text editor. - Write/Copy the code above in the code snippet to the file.
- Open the command prompt in your Operating system.
- Run following command prompt:
$ python mywordfreq,py "This is my fist python program !!"
- Output should be:
{'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.
Note
Please have a look at some Python tutorials on the following website to learn more commands on Python:
Diving into NLTK
Instead of going further into the theoretical aspects of natural language processing, let's start with a quick dive into NLTK. I am going to start with some basic example use cases of NLTK. There is a good chance that you have already done something similar. First, I will give a typical Python programmer approach, and then move on to NLTK for a much more efficient, robust, and clean solution.
We will start analyzing with some example text content. For the current example, I have taken the content from Python's home page.
>>>import urllib2 >>># urllib2 is use to download the html content of the web link >>>response = urllib2.urlopen('http://python.org/') >>># You can read the entire content of a file using read() method >>>html = response.read() >>>print len(html) 47020
We don't have any clue about the kind of topics that are discussed in this URL, so let's start with an exploratory data analysis (EDA). Typically in a text domain, EDA can have many meanings, but will go with a simple case of what kinds of terms dominate the document. What are the topics? How frequent they are? The process will involve some level of preprocessing steps. We will try to do this first in a pure Python way, and then we will do it using NLTK.
Let's start with cleaning the html tags. One ways to do this is to select just the tokens
, including numbers and character. Anybody who has worked with regular expression should be able to convert html string into list of tokens
:
>>># Regular expression based split the string >>>tokens = [tok for tok in html.split()] >>>print "Total no of tokens :"+ str(len(tokens)) >>># First 100 tokens >>>print tokens[0:100] Total no of tokens :2860 ['<!doctype', 'html>', '<!--[if', 'lt', 'IE', '7]>', '<html', 'class="no-js', 'ie6', 'lt-ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'IE', '7]>', '<html', 'class="no-js', 'ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', ''type="text/css"', 'media="not', 'print,', 'braille,' ...]
As you can see, there is an excess of html tags and other unwanted characters when we use the preceding method. A cleaner version of the same task will look something like this:
>>>import re >>># using the split function >>>#https://docs.python.org/2/library/re.html >>>tokens = re.split('\W+',html) >>>print len(tokens) >>>print tokens[0:100] 5787 ['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language', 'meta', 'name', 'apple' ...]
This looks much cleaner now. But still you can do more; I leave it to you to try to remove as much noise as you can. You can clean some HTML tags that are still popping up, You probably also want to look for word length as a criteria and remove words that have a length one—it will remove elements like 7
, 8
, and so on, which are just noise in this case. Now instead writing some of these preprocessing steps from scratch let's move to NLTK for the same task. There is a function called clean_html()
that can do all the cleaning that we were looking for:
>>>import nltk >>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html >>>clean = nltk.clean_html(html) >>># clean will have entire string removing all the html noise >>>tokens = [tok for tok in clean.split()] >>>print tokens[:100] ['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network', '≡', 'Menu', 'Arts', 'Business' ...]
Cool, right? This definitely is much cleaner and easier to do.
Let's try to get the frequency distribution of these terms. First, let's do it the Pure Python way, then I will tell you the NLTK recipe.
>>>import operator >>>freq_dis={} >>>for tok in tokens: >>> if tok in freq_dis: >>> freq_dis[tok]+=1 >>> else: >>> freq_dis[tok]=1 >>># We want to sort this dictionary on values ( freq in this case ) >>>sorted_freq_dist= sorted(freq_dis.items(), key=operator.itemgetter(1), reverse=True) >>> print sorted_freq_dist[:25] [('Python', 55), ('>>>', 23), ('and', 21), ('to', 18), (',', 18), ('the', 14), ('of', 13), ('for', 12), ('a', 11), ('Events', 11), ('News', 11), ('is', 10), ('2014-', 10), ('More', 9), ('#', 9), ('3', 9), ('=', 8), ('in', 8), ('with', 8), ('Community', 7), ('The', 7), ('Docs', 6), ('Software', 6), (':', 6), ('3:', 5), ('that', 5), ('sum', 5)]
Naturally, as this is Python's home page, Python and the (>>>)
interpreter symbol are the most common terms, also giving a sense of the website.
A better and more efficient approach is to use NLTK's FreqDist()
function. For this, we will take a look at the same code we developed before:
>>>import nltk >>>Freq_dist_nltk=nltk.FreqDist(tokens) >>>print Freq_dist_nltk >>>for k,v in Freq_dist_nltk.items(): >>> print str(k)+':'+str(v) <FreqDist: 'Python': 55, '>>>': 23, 'and': 21, ',': 18, 'to': 18, 'the': 14, 'of': 13, 'for': 12, 'Events': 11, 'News': 11, ...> Python:55 >>>:23 and:21 ,:18 to:18 the:14 of:13 for:12 Events:11 News:11
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Let's now do some more funky things. Let's plot this:
>>>Freq_dist_nltk.plot(50, cumulative=False) >>># below is the plot for the frequency distributions
We can see that the cumulative frequency is growing, and at some point the curve is going into long tail. Still, there is some noise, there are words like the
, of
, for
, and =
. These are useless words, and there is a terminology for them. These words are stop words; words like the
, a
, an
, and so on. Article pronouns are generally present in most of the documents, hence they are not discriminative enough to be informative. In most of the NLP and information retrieval tasks, people generally remove stop words. Let's go back again to our running example:
>>>stopwords=[word.strip().lower() for word in open("PATH/english.stop.txt")] >>>clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)] >>>Freq_dist_nltk=nltk.FreqDist(clean_tokens) >>>Freq_dist_nltk.plot(50, cumulative=False)
Note
Please go to http://www.wordle.net/advanced for more word clouds.
Looks much cleaner now! After finishing this much, you can go to wordle and put the distribution in a form of a CSV and you should be able to get something like this word cloud:
Your turn
- Please try the same exercise for different URLs.
- Try to reach the word cloud.
Summary
To summarize, this chapter was intended to give you a brief introduction to Natural Language Processing. The book does assume some background in NLP and programming in Python, but we have tried to give a very quick head start to Python and NLP. We have installed all the related packages that are require for us to work with NLTK. We wanted to give you, with a few simple lines of code, an idea of how to use NLTK. We were able to deliver an amazing word cloud, which is a great way of visualizing the topics in a large amount of unstructured text, and is quite popular in the industry for text analytics. I think the goal was to set up everything around NLTK, and to get Python working smoothly on your system. You should also be able to write and run basic Python programs. I wanted the reader to feel the power of the NLTK library, and build a small running example that will involve a basic application around word cloud. If the reader is able to generate the word cloud, I think we were successful.
In the next few chapters, we will learn more about Python as a language, and its features related to process natural language. We will explore some of the basic NLP preprocessing steps and learn about some of basic concepts related to NLP.