Training Naive Bayes
Now that we have extracted the blog posts, we can train our Naive Bayes model on them. The intuition is that we record the probability of a word being written by a particular gender, and record these values in our model. To classify a new sample, we would multiply the probabilities and find the most likely gender.
The aim of this code is to output a file that lists each word in the corpus, along with the frequencies of that word for each gender. The output file will look something like this:
"'ailleurs" {"female": 0.003205128205128205} "'air" {"female": 0.003205128205128205} "'an" {"male": 0.0030581039755351682, "female": 0.004273504273504274} "'angoisse" {"female": 0.003205128205128205} "'apprendra" {"male": 0.0013047113868622459, "female": 0.0014172668603481887} "'attendent" {"female": 0.00641025641025641} "'autistic" {"male": 0.002150537634408602} "'auto" {"female": 0.003205128205128205} "'avais" {"female": 0.00641025641025641} "'avait" {"female": 0.004273504273504274...