Implementing the information gain model
The problem with the information gain model is that, for each term in the index, we will have to evaluate the occurrence of every other term. The complexity of the algorithm will be of the order of square of the two terms, square(xy)
. It is not possible to compute this using a simple machine. What is recommended is that we create a map-reduce job and use a distributed Hadoop cluster to compute the information gain for each term in the index.
Our distributed Hadoop cluster would do the following:
- Count all occurrences of each term in the index
- Count all occurrences of each co-occurring term in the index
- Construct a hash table or a map of co-occurring terms
- Calculate the information gain for each term and store it in a file in the Hadoop cluster
In order to implement this in our scoring algorithm, we will need to build a custom scorer where the IDF calculation is overwritten by the algorithm for deriving the information gain for the term from the Hadoop cluster...