The description of the job listing is still in HTML. We will want to extract the valuable content out of this data, so we will need to parse this HTML and perform tokenization, stop word removal, common word removal, do some tech 2-gram processing, and in general all of those different processes. Let's look at doing these.
Reading and cleaning the description in the job listing
Getting ready
I have collapsed the code for determining tech-based 2-grams into the 07/tech2grams.py file. We will use the tech_2grams function within the file.
How to do it...
The code...