Understanding the datasets
Here, we are using two datasets. The two datasets are as follows:
The scraped dataset
The job recommendation challenge dataset
Let's start with the scraped dataset.
Scraped dataset
For this dataset, we have scraped the dummy resume from indeed.com (we are using this data just for learning and research purposes). We will download the resumes of users in PDF format. These will become our dataset. The code for this is given at this GitHub link: https://github.com/jalajthanaki/Basic_job_recommendation_engine/blob/master/indeed_scrap.py.
Take a look at the code given in the following screenshot:
Using the preceding code, we can download the resumes. We have used the requests
library and urllib
to scrape the data. All these downloaded resumes are in PDF form, so we need to parse them. To parse the PDF document, we will use a Python library called PDFminer
. We need to extract the following data attributes from the PDF documents...