Random forest
Like our motivation with the use of the Gower metric in handling mixed, in fact, messy data, we can apply random forest in an unsupervised fashion. Â Selection of this method has some advantages:
- Robust against outliers and highly skewed variables
- No need to transform or scale the data
- Handles mixed data (numeric and factors)
- Can accommodate missing data
- Can be used on data with a large number of variables, in fact, it can be used to eliminate useless features by examining variable importance
- The dissimilarity matrix produced serves as an input to the other techniques discussed earlier (hierarchical, k-means, and PAM)Â
A couple words of caution. Â It may take some trial and error to properly tune the Random Forest with respect to the number of variables sampled at each tree split (mtry = ?
in the function) and the number of trees grown. Â Studies done show that the more trees grown, up to a point, provide better results, and a good starting point is to grow 2,000 trees (Shi, T. &...