Constructing a gender identifier
Gender identification is an interesting problem and far from being an exact science. We can quickly think of names that can be used for both males and females:
- Dana
- Angel
- Lindsey
- Morgan
- Jessie
- Chris
- Payton
- Tracy
- Stacy
- Jordan
- Robin
- Sydney
In addition, in a heterogeneous society such as the United States, there are going to be many ethnic names that will not follow English rules. In general, we can take an educated guess for a wide range of names. In this simple example, we will use a heuristic to construct a feature vector and use it to train a classifier. The heuristic that will be used here is the last N letters of a given name. For example, if the name ends with ia, it's most likely a female name, such as Amelia or Genelia. On the other hand, if the name ends with rk, it's likely a male name, such as Mark or Clark. Since we are not sure of the exact number of letters to use...