Factors
If you recall from Chapter 1, Introducing Machine Learning, features that represent a characteristic with categories of values are known as
nominal. Although it is possible to use a character vector to store nominal data, R provides a data structure known as a factor specifically for this purpose. A factor is a special case of vector that is solely used for representing nominal variables. In the medical dataset we are building, we might use a factor to represent gender
, because it uses two categories: MALE
and FEMALE
.
Why not use character
vectors? An advantage of using factors is that they are generally more efficient than character
vectors because the category labels are stored only once. Rather than storing MALE
, MALE
, FEMALE
, the computer may store 1
, 1
, 2
. This can save memory. Additionally, certain machine learning algorithms use special routines to handle categorical variables. Coding categorical variables as factors ensures that the model will treat this data appropriately...