EDA is among the first few tasks we perform when we get started on any ML project. As discussed in the section on CRISP-DM, data understanding is an important step to uncover various insights about the data and better understand the business requirements and context.
In this section, we will take up an actual dataset and perform EDA using pandas as our data manipulation library, coupled with seaborn for visualization. Complete code snippets and details for this analysis are available in the Python Notebook game_of_thrones_eda.ipynb.
We first begin by importing the required libraries and setting up the configurations as shown in the following snippet:
In [1]: import numpy as np ...: import pandas as pd ...: from collections import Counter ...: ...: # plotting ...: import seaborn as sns ...: import matplotlib.pyplot as plt ...: ...: # setting params ...: params = {'legend.fontsize': 'x-large', ...: 'figure.figsize': (30, 10), ...: 'axes.labelsize': 'x-large', ...: 'axes.titlesize':'x-large', ...: 'xtick.labelsize':'x-large', ...: 'ytick.labelsize':'x-large'} ...: ...: sns.set_style('whitegrid') ...: sns.set_context('talk') ...: ...: plt.rcParams.update(params)
Once the settings and requirements are in place, we can begin concentrating on the data. The dataset in consideration for exploratory analysis is the battles.csv file, which contains all major battles from the world of Game of Thrones (up to season 5).
One of the most popular television series of all time, Game of Thrones is a fantasy drama set in the fictional continents of Westeros and Essos, filled with multiple plots and a huge number of characters all battling for the Iron Throne! It is an adaptation of the A Song of Ice and Fire novel series by George R. R. Martin. Being a popular series, it has caught the attention of many, and data scientists aren't to be excluded. This notebook presents EDA on the Kaggle dataset enhanced by Myles O'Neill (more details: https://www.kaggle.com/mylesoneill/game-of-thrones). This dataset is based on a combination of multiple datasets collected and contributed to by multiple people. We utilize the battles.csv in this analysis. The original battles data was presented by Chris Albon; more details are available at https://github.com/chrisalbon/war_of_the_five_kings_dataset.
The following snippet loads the battles.csv file using pandas:
In [2]: battles_df = pd.read_csv('battles.csv')
The dataset is as shown in the following screenshot:
We can view the total number of rows, data types of each of the attributes, and general statistics of numerical attributes using the pandas utilities shape, dtypes, and describe() respectively. We have data about 38 battles, with 25 attributes describing each one of them.
Let's understand the distribution of battles across years from the fantasy land. The following snippet plots a bar graph of this distribution:
In [3]: sns.countplot(y='year',data=battles_df) ...: plt.title('Battle Distribution over Years') ...: plt.show()
The following plot shows that the highest number of battles were fought in the year 299, followed by 300 and 298 respectively:
There are different regions in this fantasy land, with battles taking place at every place imaginable. Yet, it would be interesting to see if there were any preferred regions. The following snippet helps us answer this question precisely:
In [4]: sns.countplot(x='region',data=battles_df)
...: plt.title('Battles by Regions')
...: plt.show()
The following plot helps us identify that The Riverlands have seen the most battles, followed by The North and The Westerlands:
Another interesting thing to notice is that there has been only one battle Beyond the Wall (spoiler alert: stay tuned for later seasons).
We can perform similar analysis using different group-by variations to understand, for instance, the number of major deaths, or captures per region, and so on.
We move on to see which king attacked the most. We visualize this using a pie chart to understand the percentage share of battles fought by each of the kings involved. Please note that we perform this analysis based on attacking kings. Similar analysis can be performed using defending kings as well. The following snippet prepares a pie chart to display each attacking king's share of battles:
In [5]: attacker_king = battles_df.attacker_king.value_counts() ...: attacker_king.name='' # turn off annoying y-axis-label ...: attacker_king.plot.pie(figsize=(6, 6),autopct='%.2f')
Each attacking king's share of battles is displayed in the following pie chart:
The lands of Westeros and Essos are dangerous with enemies and threats all across. Let's analyze the data a bit to understand on how many occasions each of the kings was a winner. Since a king can be either defending his land or attacking for power, it would be interesting to see the defending and attacking wins as well. The following snippet helps us prepare a stacked bar chart to analyze each king's attacking and defending wins:
In [6] : attack_winners = battles_df[battles_df. ...: attacker_outcome=='win'] ...: ['attacker_king']. ...: value_counts(). ...: reset_index() ...: ...: attack_winners.rename( ...: columns={'index':'king', ...: 'attacker_king':'wins'}, ...: inplace=True) ...: ...: attack_winners.loc[:,'win_type'] = 'attack' ...: ...: defend_winners = battles_df[battles_df. ...: attacker_outcome=='loss'] ...: ['defender_king']. ...: value_counts(). ...: reset_index() ...: defend_winners.rename( ...: columns={'index':'king', ...: 'defender_king':'wins'}, ...: inplace=True) ...: ...: defend_winners.loc[:,'win_type'] = 'defend' ...: ...: ...: sns.barplot(x="king", ...: y="wins", ...: hue="win_type", ...: data=pd.concat([attack_winners, ...: defend_winners])) ...: plt.title('Kings and Their Wins') ...: plt.ylabel('wins') ...: plt.xlabel('king') ...: plt.show()
The preceding snippet calculates the number of wins per king while attacking and then calculates the number of wins per king while defending. We then merge the two results and plot the same using a stacked barplot. The results are shown in the following graph:
The preceding graph clearly shows that the Baratheon boys have the most number of wins, both while attacking as well as while defending. Seems like they have luck on their side so far. Robb Stark was the second most successful king, until of course the Red Wedding happened.
The dataset also contains attributes describing the number of houses involved, battle commanders, and army sizes. We can perform similar and more in-depth analysis to better understand the battles. We encourage the readers to try out a few of these as exercises and check the Python Notebook for more pointers.
Before we close the section, let's try to identify archenemies in the fight for the Iron Throne. Though the fans will already have a gut feeling about this, let's see what the data has to say about it. The following snippet helps us answer this question:
In [7]: temp_df = battles_df.dropna( ...: subset = ["attacker_king", ...: "defender_king"])[ ...: ["attacker_king", ...: "defender_king"] ...: ] ...: ...: archenemy_df = pd.DataFrame( ...: list(Counter( ...: [tuple(set(king_pair)) ...: for king_pair in temp_df.values ...: if len(set(king_pair))>1]). ...: items()), ...: columns=['king_pair', ...: 'battle_count']) ...: ...: archenemy_df['versus_text'] = archenemy_df. ...: apply( ...: lambda row: ...: '{} Vs {}'.format( ...: row[ ...: 'king_pair' ...: ][0], ...: row[ ...: 'king_pair' ...: ][1]), ...: axis=1) ...: archenemy_df.sort_values('battle_count', ...: inplace=True, ...: ascending=False) ...: ...: ...: archenemy_df[['versus_text', ...: 'battle_count']].set_index('versus_text', ...: inplace=True) ...: sns.barplot(data=archenemy_df, ...: x='versus_text', ...: y='battle_count') ...: plt.xticks(rotation=45) ...: plt.xlabel('Archenemies') ...: plt.ylabel('Number of Battles') ...: plt.title('Archenemies') ...: plt.show()
We first prepare a temporary dataframe and remove any battles that do not have either the attacking or defending king's name listed. Once we have a clean dataframe, we iterate over each of the rows and count the number of battles every pair has fought. We ignore cases where the battle was among the king's own army (if len(set(king_pair))>1). We then simply plot the results in a bar graph, shown as follows:
We see that the dataset confirms the gut feelings. Robb Stark and Joffrey Baratheon have fought a total of 19 battles already, with other pairs having fought five or fewer battles.
The analysis and visualizations shared in this section were a glimpse of what can be done on a dataset. There could be many more patterns and insights that could be extracted from this dataset alone.
EDA is a very powerful mechanism for understanding the dataset in detail before jumping into other stages of ML. In the coming chapters, we will regularly perform EDA to assist us in understanding the business problem along with the dataset before we go into modeling, tuning, evaluation, and deployment stages.