The curse of dimensionality
Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called the curse of dimensionality. Some literature also calls it the concentration of measure:
- Distance is attributed to all the dimensions and assumes each of them to have the same effect on the distance. The higher the dimensions, the more similar things appear to each other.
- The similarity measures do not take into account the association of attributes, which may result in inaccurate distance estimation.
- The number of samples required per attribute increases exponentially with the increase in dimensions.
- A lot of dimensions might be highly correlated with each other, thus causing multi-collinearity.
- Extra dimensions cause a rapid volume increase that can result in high sparsity, which is a major issue in any method that requires statistical significance. Also, it causes huge variance in estimates, near duplicates...