Explaining the preprocessing pipeline
We will explain a complete preprocessing pipeline that has been provided by the authors to you, the reader.
As shown in the following code, the input is a formatted text with encoded tags, similar to what we can extract from HTML web pages:
"<SUBJECT LINE> Employees details<END><BODY TEXT>Attached are 2 files,\n1st one is pairoll, 2nd is healtcare!<END>"
Let’s take a look at the effect of applying each step to the text:
- Decode/remove encoding:
Employees details. Attached are 2 files, 1st one is pairoll, 2nd is healtcare!
- Lowercasing:
employees details. attached are 2 files, 1st one is pairoll, 2nd is healtcare!
- Digits to words:
employees details. attached are two files, first one is pairoll, second is healtcare!
- Remove punctuation and other special characters:
employees details attached are two files first one is pairoll second is healtcare
- Spelling corrections:
employees details...