How can a beginner get started with NLP?
This book will be of little use if we do not eventually jump into how to use these tools and technologies. The common and advanced uses that I described here are just some of the uses. As you become comfortable with NLP, I want you to constantly consider other uses for NLP that are possibly not being met. For instance, in text classification alone, you can go very deep. You could use text classification to attempt to classify even more difficult concepts, such as sarcasm or empathy, for instance, but let’s not get ahead of ourselves. This is what I want you to do.
Start with a simple idea
Think simply, and only add complexity as needed. Think of something that interests you that you would like to know more about, and then find people who talk about it. If you are interested in photography, find a few Twitter accounts that talk about it. If you are looking to analyze political extremism, find a few Twitter accounts that proudly show their unifying hashtags. If you are interested in peanut allergy research, find a few Twitter accounts of researchers that post their results and articles in their quest to save lives. I mention Twitter over and over because it is a goldmine for investigating how groups of people talk about issues, and people often post links, which can lead to even more scraping. But you could use any social media platform, as long as you can scrape it.
However, start with a very simple idea. What would you like to know about a piece of text (or a lot of Tweets)? What would you like to know about a community of people? Brainstorm. Get a notebook and start writing down every question that comes to mind. Prioritize them. Then you will have a list of questions to seek answers for.
For instance, my research question could be, “What are people saying about Black Lives Matter protests?” Or, we could research something less serious and ask, “What are people saying about the latest Marvel movie?” Personally, I prefer to at least attempt to use data science for good, to make the world a bit safer, so I am not very interested in movie reviews, but others are. We all have our preferences. Study what interests you.
For this demonstration, I will use my scraped data science feed. I have a few starter questions:
- Which accounts post the most frequently every week?
- Which accounts are mentioned the most?
- Which are the primary hashtags used by this community of people?
- What follow-up questions can we think of after answering these questions?
We will only use NLP and simple string operations to answer these questions, as I have not yet begun to explain social network analysis. I am also going to assume that you know your way around Python programming and are familiar with the pandas library. I will cover pandas in more detail in a later chapter, but I will not be giving in-depth training. There are a few great books that cover pandas in depth.
Here is what the raw data for my scraped data science feed looks like:
Figure 1.11 – Scraped and enriched data science Twitter feed
To save time, I have set up the regex steps in the scraper to create columns for users, tags, and URLs. All of this is scraped or generated as a step during automated scraping. This will make it much easier and faster to answer the four questions I posed. So, let’s get to it.
Accounts that post most frequently
The first thing I want to do is see which accounts post the most in total. I will also take a glimpse at which accounts post the least to see whether any of the accounts have dried up since adding them to my scrapers. For this, I will simply take the columns for publisher
(the account that posted the tweet) and tweet
, do a groupby
operation on the publisher, and then take the count:
Check_df = df[['publisher', 'tweet']] check_df = check_df.groupby('publisher').count() check_df.sort_values('tweet', ascending=False, inplace=True) check_df.columns = ['count'] check_df.head(10)
This will display a DataFrame of publishers by tweet count, showing us the most active publishers:
Figure 1.12 – User tweet counts from the data science Twitter feed
That’s awesome. So, if you want to break into data science and you use Twitter, then you should probably follow these accounts.
However, to me, this is of limited use. I really want to see each account’s posting behavior. For this, I will use a pivot table. I will use publisher
as the index, created_week
as the columns, and run a count aggregation. Here is what the top ten looks like, sorted by the current week:
Check_df = df[['publisher', 'created_week', 'tweet']].copy() pvt_df = pd.pivot_table(check_df, index='publisher', columns='created_week', aggfunc='count').fillna(0) pvt_df = pvt_df['tweet'] pvt_df.sort_values(202129, ascending=False, inplace=True) keep_weeks = pvt_df.columns[-13:-1] # keep the last twelve weeks, but excluding current pvt_df = pvt_df[keep_weeks] pvt_df.head(10)
This creates the following DataFrame:
Figure 1.13 – Pivot table of user tweet counts by week
This looks much more useful, and it is sensitive to the week. This should also be interesting to see as a visualization, to get a feel for the scale:
_= pvt_df.plot.bar(figsize=(13,6), title='Twitter Data Science Accounts – Posts Per Week', legend=False)
We get the following plot:
Figure 1.14 – Bar chart of user tweet counts by week
It’s a bit difficult to see individual weeks when visualized like this. With any visualization, you will want to think about how you can most easily tell the story that you want to tell. As I am mostly interested in visualizing which accounts post the most in total, I will use the results from the first aggregation instead. This is interesting and cool to look at, but it’s not very useful:
_= check_df.plot.bar(figsize=(13,6), title='Twitter Data Science Accounts – Posts Per Week', legend=False)
This code gives us the following graph:
Figure 1.15 – A bar chart of user tweet counts in total
That is much easier to understand.
Accounts mentioned most frequently
Now, I want to see which accounts are mentioned by publishers (the account making the tweet) the most often. This can show people who collaborate, and it can also show other interesting accounts that are worth scraping. For this, I’m just going to use value_counts
of the top 20 accounts. I want a fast answer:
Check_df = df[['users']].copy().dropna() check_df['users'] = check_df['users'].str.lower() check_df.value_counts()[0:20] users @dataidols 623 @royalmail 475 @air_lab_muk 231 @makcocis 212 @carbon3it 181 @dictsmakerere 171 @lubomilaj 167 @brittanymsalas 164 @makererenews 158 @vij_scene 151 @nm_aist 145 @makerereu 135 @packtpub 135 @miic_ug 131 @arm 127 @pubpub 124 @deliprao 122 @ucberkeley 115 @mitpress 114 @roydanroy 112 dtype: int64
This looks great. I bet there are some interesting data scientists in this bunch of accounts. I should look into that and consider scraping them and adding them to my data science feed.
Top 10 data science hashtags
Next, I want to see which hashtags are used the most often. The code is going to be very similar, other than I need to run explode()
against the tags field in order to create one row for every element of each tweet’s list of hashtags. Let’s do that first. For this, we can simply create the DataFrame, drop nulls, lowercase the tags for uniformity, and then use value_counts()
to get what we want:
Check_df = df[['tags']].copy().dropna() check_df['tags'] = check_df['tags'].str.lower() check_df.value_counts()[0:10] tags #datascience 2421 #dsc_article 1597 #machinelearning 929 #ai 761 #wids2021 646 #python 448 #dsfthegreatindoors 404 #covid19 395 #dsc_techtarget 340 #datsciafrica 308 dtype: int64
This looks great. I’m going to visualize the top ten results. However, value_counts()
was somehow causing the hashtags to get butchered a bit, so I did a groupby
operation against the DataFrame instead:
Figure 1.16 – Hashtag counts from the data science Twitter feed
Let’s finish up this section with a few more related ideas.
Additional questions or action items from simple analysis
In total, this analysis would have taken me about 10 minutes to do if I was not writing a book. The code might seem strange, as you can chain commands together in Python. I prefer to have significant operations on their own line so that the next person who will have to manage my code will not miss something important that was tacked on to the end of a line. However, notebooks are pretty personal, and notebook code is not typically written with perfectly clean code. When investigating data or doing rough visualizations, focus on what you are trying to do. You do not need to write perfect code until you are ready to write the production version. That said, do not throw notebook quality code into production.
Now that we have done the quick analysis, I have some follow-up questions that I should look into answering:
- How many of these accounts are actually data science related and that I am not already scraping?
- Do any of these accounts give me ideas for new feeds? For instance, I have feeds for data science, disinformation research, art, natural sciences, news, political news, politicians, and more. Maybe I should have a photography feed, for instance.
- Would it be worth scraping by keyword for any of the top keywords to harvest more interesting content and accounts?
- Have any of the accounts dried up (no new posts ever)? Which ones? When did they dry up? Why did they dry up?
You try. Do you have any questions you can think of, given this dataset?
Next, let’s try something similar but slightly different, using NLP tools against the book Alice’s Adventures in Wonderland. Specifically, I want to see whether I can take the tf-idf
vectors and plot out character appearance by chapter. If you are unfamiliar with it, term frequency-inverse of document frequency (TF-IDF) is an appropriate name because that is exactly the math. I won’t go into the code, but this is what the results look like:
Figure 1.17 – TF-IDF character visualization of Alice’s Adventures in Wonderland by book chapter
By using a stacked bar chart, I can see which characters appear together in the same chapters, as well as their relative importance based on the frequency with that they were named. This is completely automated, and I think it would allow for some very interesting applications, such as a more interactive way of researching various books. In the next chapter, I will introduce social network analysis, and if you were to add that in as well, you could even build the social network of Alice in Wonderland, or any other piece of literature, allowing you to see which characters interact.
In order to perform a tf-idf
vectorization, you need to split sentences apart into tokens. Tokenization is NLP 101 stuff, with a token being a word. So, for instance, if we were to tokenize this sentence:
I would end up with a list of the following tokens:
['Today', 'was', 'a', 'wonderful', '
day', '.']
If you have a collection of several sentences, you can then feed it to tf-idf
to return the relative importance of each token in a corpus of text. This is often very useful for text classification using simpler models, and it can also be used as input for topic modeling or clustering. However, I have never seen anyone else use it to determine character importance by book chapters, so that’s a creative approach.
This example only scratches the surface of what we can do with NLP and investigates only a few of the questions we could come up with. As you do your own research, I encourage you to keep a paper notebook handy, so that you can write down questions to investigate whenever they come to mind.