Taking your data to AI
Now that we have taken our data on a journey through a sample ETL pipeline, let’s take it through one last step, which is to perform ML on the data output from the previous step, that is, tokenized words and their counts.
In this section, we will create a model to identify the context from the given list of words using word2vec and cosine similarity techniques. We will use the top 1,000 frequently occurring words (from the output of the previous step) to predict the context of the tokenized words generated from the pipeline we created in the previous section.
In this exercise, we will take the data we have generated through the pipeline as input data to the context prediction application we will build in Python. Don’t worry, I have kept the code simple to understand and very minimal, so we don’t spend hours explaining the steps and the code. Open a new Colab Notebook from https://colab.research.google.com/. Enter the code snippets in...