Preprocessing and cleaning in R
Preprocessing and cleaning are the very basic and first steps in any data-mining problem. A learning algorithm on a unified and cleaned dataset cannot only run very fast, but can also produce more accurate results. The first steps involve the annotation of target data, in the case of classification problems and understating the feature vector space, to apply an appropriate distance measure for clustering problems. Identification of noise samples and their clean up is a very tricky task but the better it's done, the more accuracy one can expect in the results. As mentioned previously, you need to be careful in cleaning tasks as this can lead to a rejection of good samples. Furthermore, the preprocessing steps need to be a reversible process because at the end of the exercise, the results need to be processed back to the original sample space for it to make sense.