Clustering with k-means and hierarchical clustering
It is common in bioinformatics to want to classify things into groups without first knowing what or how many groups there may be. This process is usually known as clustering and is a type of unsupervised ML. This is commonly used in genomics experiments, particularly RNAseq and related count-based technologies. In this recipe, we’ll start with a large gene expression dataset with around 150 samples. We’ll learn how to estimate how many groups of samples there are and apply a method to cluster them based on the reduction of dimensionality with PCA followed by a k-means cluster.
Getting ready
We’ll need the factoextra
, RColorBrewer
, and Bioconductor biobase
libraries. We’ll also use the modencodefly_eset
object from the rbioinfcookbook
package.
How to do it…
We can cluster with the following code
- Load the data and run a PCA:
library(factoextra)library(Biobase)library(rbioinfcookbook...