Introducing scikit-learn with a PCA example
PCA is a statistical procedure that’s used to perform a reduction of the dimension of a number of variables to a smaller subset that is linearly uncorrelated. In Chapter 6, we saw a PCA implementation based on using an external application. In this recipe, we will implement the same PCA for population genetics but will use the scikit-learn
library. Scikit-learn is one of the fundamental Python libraries for machine learning and this recipe is an introduction to the library. PCA is a form of unsupervised machine learning – we don’t provide information about the class of the sample. We will discuss supervised techniques in the other recipes of this chapter.
As a reminder, we will compute PCA for 11 human populations from the HapMap project.
Getting ready
You will need to run the first recipe from Chapter 6 in order to generate the hapmap10_auto_noofs_ld_12
PLINK file (with alleles recorded as 1 and 2). From a population...