Further cleanup
There are still some small disturbing glitches in the wordlist. Maybe, we do not really want to keep numbers in the package descriptions at all (or we might want to replace all numbers with a placeholder text, such as NUM
), and there are some frequent technical words that can be ignored as well, for example, package
. Showing the plural version of nouns is also redundant. Let's improve our corpus with some further tweaks, step by step!
Removing the numbers from the package descriptions is fairly straightforward, as based on the previous examples:
> v <- tm_map(v, removeNumbers)
To remove some frequent domain-specific words with less important meanings, let's see the most common words in the documents. For this end, first we have to compute the TermDocumentMatrix
function that can be passed later to the findFreqTerms
function to identify the most popular terms in the corpus, based on frequency:
> tdm <- TermDocumentMatrix(v)
This object is basically a matrix which...