Pointwise Mutual Information
PMI between two words is calculated using the following formula:
represent the number of occurrences of the word word
in the entire document collection. The original article that proposed this idea used the number of articles returned for the search word word
from the AltaVista search engine. But you can safely use a probability (the number of documents in which the word word
appeared divided by the total number of documents). The &
operator in refers to the number of documents containing both words word1
and word2
divided by the total number of documents.
The following function finds the probability of the word in a document collection represented by list
:
The following function finds the probability of the words w1
and w2
in a document collection represented by list
:
The following function calculates the PMI between w1
and w2
: