Feature Engineering for Natural Language Data
In the previous chapter, we explored how to extract features from numerical data and images. We explored a few algorithms that are used for that purpose. In this chapter, we’ll continue with the algorithms that extract features from natural language data.
Natural language is a special kind of data source in software engineering. With the introduction of GitHub Copilot and ChatGPT, it became evident that machine learning and artificial intelligence tools for software engineering tasks are no longer science fiction. Therefore, in this chapter, we’ll explore the first steps that made these technologies so powerful – feature extraction from natural language data.
In this chapter, we’ll cover the following topics:
- Tokenizers and their role in feature extraction
- Bag-of-words as a simple technique for processing natural language data
- Word embeddings as more advanced methods that can capture contexts...