You're reading from Network Science with Python Explore the networks around us using network science, social network analysis, and machine learning

Product type Paperback

Published in Feb 2023

Publisher Packt

ISBN-13 9781801073691

Length 414 pages

Edition 1st Edition

Languages

Python

Tools

PyTorch

Concepts

Machine Learning

Author (1):

David Knickerbocker

View More author details

Table of Contents (17) Chapters

Preface

1. Part 1: Getting Started with Natural Language Processing and Networks

2. Chapter 1: Introducing Natural Language Processing FREE CHAPTER

3. Chapter 2: Network Analysis

4. Chapter 3: Useful Python Libraries

5. Part 2: Graph Construction and Cleanup

6. Chapter 4: NLP and Network Synergy

7. Chapter 5: Even Easier Scraping!

8. Chapter 6: Graph Construction and Cleaning

9. Part 3: Network Science and Social Network Analysis

10. Chapter 7: Whole Network Analysis

11. Chapter 8: Egocentric Network Analysis

12. Chapter 9: Community Detection

13. Chapter 10: Supervised Machine Learning on Network Data

14. Chapter 11: Unsupervised Machine Learning on Network Data

15. Index

Why subscribe?

16. Other Books You May Enjoy

How can a beginner get started with NLP?

This book will be of little use if we do not eventually jump into how to use these tools and technologies. The common and advanced uses that I described here are just some of the uses. As you become comfortable with NLP, I want you to constantly consider other uses for NLP that are possibly not being met. For instance, in text classification alone, you can go very deep. You could use text classification to attempt to classify even more difficult concepts, such as sarcasm or empathy, for instance, but let’s not get ahead of ourselves. This is what I want you to do.

Start with a simple idea

Think simply, and only add complexity as needed. Think of something that interests you that you would like to know more about, and then find people who talk about it. If you are interested in photography, find a few Twitter accounts that talk about it. If you are looking to analyze political extremism, find a few Twitter accounts that proudly show their unifying hashtags. If you are interested in peanut allergy research, find a few Twitter accounts of researchers that post their results and articles in their quest to save lives. I mention Twitter over and over because it is a goldmine for investigating how groups of people talk about issues, and people often post links, which can lead to even more scraping. But you could use any social media platform, as long as you can scrape it.

However, start with a very simple idea. What would you like to know about a piece of text (or a lot of Tweets)? What would you like to know about a community of people? Brainstorm. Get a notebook and start writing down every question that comes to mind. Prioritize them. Then you will have a list of questions to seek answers for.

For instance, my research question could be, “What are people saying about Black Lives Matter protests?” Or, we could research something less serious and ask, “What are people saying about the latest Marvel movie?” Personally, I prefer to at least attempt to use data science for good, to make the world a bit safer, so I am not very interested in movie reviews, but others are. We all have our preferences. Study what interests you.

For this demonstration, I will use my scraped data science feed. I have a few starter questions:

Which accounts post the most frequently every week?
Which accounts are mentioned the most?
Which are the primary hashtags used by this community of people?
What follow-up questions can we think of after answering these questions?

We will only use NLP and simple string operations to answer these questions, as I have not yet begun to explain social network analysis. I am also going to assume that you know your way around Python programming and are familiar with the pandas library. I will cover pandas in more detail in a later chapter, but I will not be giving in-depth training. There are a few great books that cover pandas in depth.

Here is what the raw data for my scraped data science feed looks like:

Figure 1.11 – Scraped and enriched data science Twitter feed

To save time, I have set up the regex steps in the scraper to create columns for users, tags, and URLs. All of this is scraped or generated as a step during automated scraping. This will make it much easier and faster to answer the four questions I posed. So, let’s get to it.

Accounts that post most frequently

The first thing I want to do is see which accounts post the most in total. I will also take a glimpse at which accounts post the least to see whether any of the accounts have dried up since adding them to my scrapers. For this, I will simply take the columns for publisher (the account that posted the tweet) and tweet, do a groupby operation on the publisher, and then take the count:

Check_df = df[['publisher', 'tweet']]
check_df = check_df.groupby('publisher').count()
check_df.sort_values('tweet', ascending=False, inplace=True)
check_df.columns = ['count']
check_df.head(10)

This will display a DataFrame of publishers by tweet count, showing us the most active publishers:

Figure 1.12 – User tweet counts from the data science Twitter feed

That’s awesome. So, if you want to break into data science and you use Twitter, then you should probably follow these accounts.

However, to me, this is of limited use. I really want to see each account’s posting behavior. For this, I will use a pivot table. I will use publisher as the index, created_week as the columns, and run a count aggregation. Here is what the top ten looks like, sorted by the current week:

Check_df = df[['publisher', 'created_week', 'tweet']].copy()
pvt_df = pd.pivot_table(check_df, index='publisher', columns='created_week', aggfunc='count').fillna(0)
pvt_df = pvt_df['tweet']
pvt_df.sort_values(202129, ascending=False, inplace=True)
keep_weeks = pvt_df.columns[-13:-1] # keep the last twelve weeks, but excluding current
pvt_df = pvt_df[keep_weeks]
pvt_df.head(10)

This creates the following DataFrame:

Figure 1.13 – Pivot table of user tweet counts by week

This looks much more useful, and it is sensitive to the week. This should also be interesting to see as a visualization, to get a feel for the scale:

_= pvt_df.plot.bar(figsize=(13,6), title='Twitter Data Science Accounts – Posts Per Week', legend=False)

We get the following plot:

Figure 1.14 – Bar chart of user tweet counts by week

It’s a bit difficult to see individual weeks when visualized like this. With any visualization, you will want to think about how you can most easily tell the story that you want to tell. As I am mostly interested in visualizing which accounts post the most in total, I will use the results from the first aggregation instead. This is interesting and cool to look at, but it’s not very useful:

_= check_df.plot.bar(figsize=(13,6), title='Twitter Data Science Accounts – Posts Per Week', legend=False)

This code gives us the following graph:

Figure 1.15 – A bar chart of user tweet counts in total

That is much easier to understand.

Accounts mentioned most frequently

Now, I want to see which accounts are mentioned by publishers (the account making the tweet) the most often. This can show people who collaborate, and it can also show other interesting accounts that are worth scraping. For this, I’m just going to use value_counts of the top 20 accounts. I want a fast answer:

Check_df = df[['users']].copy().dropna()
check_df['users'] = check_df['users'].str.lower()
check_df.value_counts()[0:20]
users
@dataidols         623
@royalmail         475
@air_lab_muk       231
@makcocis          212
@carbon3it         181
@dictsmakerere     171
@lubomilaj         167
@brittanymsalas    164
@makererenews      158
@vij_scene         151
@nm_aist           145
@makerereu         135
@packtpub          135
@miic_ug           131
@arm               127
@pubpub            124
@deliprao          122
@ucberkeley        115
@mitpress          114
@roydanroy         112
dtype: int64

This looks great. I bet there are some interesting data scientists in this bunch of accounts. I should look into that and consider scraping them and adding them to my data science feed.

Top 10 data science hashtags

Next, I want to see which hashtags are used the most often. The code is going to be very similar, other than I need to run explode() against the tags field in order to create one row for every element of each tweet’s list of hashtags. Let’s do that first. For this, we can simply create the DataFrame, drop nulls, lowercase the tags for uniformity, and then use value_counts() to get what we want:

Check_df = df[['tags']].copy().dropna()
check_df['tags'] = check_df['tags'].str.lower()
check_df.value_counts()[0:10]
tags
#datascience           2421
#dsc_article           1597
#machinelearning        929
#ai                     761
#wids2021               646
#python                 448
#dsfthegreatindoors     404
#covid19                395
#dsc_techtarget         340
#datsciafrica           308
dtype: int64

This looks great. I’m going to visualize the top ten results. However, value_counts() was somehow causing the hashtags to get butchered a bit, so I did a groupby operation against the DataFrame instead:

Figure 1.16 – Hashtag counts from the data science Twitter feed

Let’s finish up this section with a few more related ideas.

Additional questions or action items from simple analysis

In total, this analysis would have taken me about 10 minutes to do if I was not writing a book. The code might seem strange, as you can chain commands together in Python. I prefer to have significant operations on their own line so that the next person who will have to manage my code will not miss something important that was tacked on to the end of a line. However, notebooks are pretty personal, and notebook code is not typically written with perfectly clean code. When investigating data or doing rough visualizations, focus on what you are trying to do. You do not need to write perfect code until you are ready to write the production version. That said, do not throw notebook quality code into production.

Now that we have done the quick analysis, I have some follow-up questions that I should look into answering:

How many of these accounts are actually data science related and that I am not already scraping?
Do any of these accounts give me ideas for new feeds? For instance, I have feeds for data science, disinformation research, art, natural sciences, news, political news, politicians, and more. Maybe I should have a photography feed, for instance.
Would it be worth scraping by keyword for any of the top keywords to harvest more interesting content and accounts?
Have any of the accounts dried up (no new posts ever)? Which ones? When did they dry up? Why did they dry up?

You try. Do you have any questions you can think of, given this dataset?

Next, let’s try something similar but slightly different, using NLP tools against the book Alice’s Adventures in Wonderland. Specifically, I want to see whether I can take the tf-idf vectors and plot out character appearance by chapter. If you are unfamiliar with it, term frequency-inverse of document frequency (TF-IDF) is an appropriate name because that is exactly the math. I won’t go into the code, but this is what the results look like:

Figure 1.17 – TF-IDF character visualization of Alice’s Adventures in Wonderland by book chapter

By using a stacked bar chart, I can see which characters appear together in the same chapters, as well as their relative importance based on the frequency with that they were named. This is completely automated, and I think it would allow for some very interesting applications, such as a more interactive way of researching various books. In the next chapter, I will introduce social network analysis, and if you were to add that in as well, you could even build the social network of Alice in Wonderland, or any other piece of literature, allowing you to see which characters interact.

In order to perform a tf-idf vectorization, you need to split sentences apart into tokens. Tokenization is NLP 101 stuff, with a token being a word. So, for instance, if we were to tokenize this sentence:

Today was a wonderful day.

I would end up with a list of the following tokens:

['Today', 'was', 'a', 'wonderful', 'day', '.']

If you have a collection of several sentences, you can then feed it to tf-idf to return the relative importance of each token in a corpus of text. This is often very useful for text classification using simpler models, and it can also be used as input for topic modeling or clustering. However, I have never seen anyone else use it to determine character importance by book chapters, so that’s a creative approach.

This example only scratches the surface of what we can do with NLP and investigates only a few of the questions we could come up with. As you do your own research, I encourage you to keep a paper notebook handy, so that you can write down questions to investigate whenever they come to mind.