Cross-sectional data or cross-section of a population is obtained by taking observations from multiple individuals at the same point in time. Cross-sectional data can comprise of observations taken at different points in time, however, in such cases time itself does not play any significant role in the analysis. SAT scores of high school students in a particular year is an example of cross-sectional data. Gross domestic product of countries in a given year is another example of cross-sectional data. Data for customer churn analysis is another example of cross-sectional data. Note that, in case of SAT scores of students and GDP of countries, all the observations have been taken in a single year and this makes the two datasets cross-sectional. In essence, the cross-sectional data represents a snapshot at a given instance of time in both the cases. However, customer data for churn analysis can be obtained from over a span of time such as years and months. But for the purpose of analysis, time might not play an important role and therefore though customer churn data might be sourced from multiple points in time, it may still be considered as a cross-sectional dataset.
Often, analysis of cross-sectional data starts with a plot of the variables to visualize their statistical properties such as central tendency, dispersion, skewness, and kurtosis. The following figure illustrates this with the univariate example of military expenditure as a percentage of Gross Domestic Product of 85 countries in the year 2010. By taking the data from a single year we ensure its cross-sectional nature. The figure combines a normalized histogram and a kernel density plot in order to highlight different statistical properties of the military expense data.
As evident from the plot, military expenditure is slightly left skewed with a major peak at roughly around 1.0 %. A couple of minor peaks can also be observed near 6.0 % and 8.0 %.
Figure 1.1: Example of univariate cross-sectional data
Exploratory data analysis such as the one in the preceding figure can be done for multiple variables as well in order to understand their joint distribution. Let us illustrate a bivariate analysis by considering total debt of the countries' central governments along with their military expenditure in 2010. The following figure shows the joint distributions of these variables as kernel density plots. The bivariate joint distribution shows no clear correlation between the two, except may be for lower values of military expenditure and debt of central government.
Figure 1.2: Example of bi-variate cross-sectional data
It is noteworthy that analysis of cross-sectional data extends beyond exploratory data analysis and visualization as shown in the preceding example. Advanced methods such as cross-sectional regression fit a linear regression model between several explanatory variables and a dependent variable. For example, in case of customer churn analysis, the objective could be to fit a logistic regression model between customer attributes and customer behavior described by churned or not-churned. The logistic regression model is a special case of generalized linear regression for discrete and binary outcome. It explains the factors that make customers churn and can predict the outcome for a new customer. Since time is not a crucial element in this type of cross-sectional data, predictions can be obtained for a new customer at a future point in time. In this book, we discuss techniques for modeling time series data in which time and the sequential nature of observations are crucial factors for analysis.
The dataset of the example on military expenditures and national debt of countries has been downloaded from the Open Data Catalog of World Bank. You can find the data in the WDIData.csv file under the datasets folder of this book's GitHub repository.
All examples in this book are accompanied by an implementation of the same in Python. So let us now discuss the Python program written to generate the preceding figures. Before we are able to plot the figures, we must read the dataset into Python and familiarize ourselves with the basic structure of the data in terms of columns and rows found in the dataset. Datasets used for the examples and figures, in this book, are in Excel or CSV format. We will use the pandas package to read and manipulate the data. For visualization, matplotlib and seaborn are used. Let us start by importing all the packages to run this example:
from __future__ import print_function
import os
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
The print_function has been imported from the __future__ package to enable using print as a function for readers who might be using a 2.x version of Python. In Python 3.x, print is by default a function. As this code is written and executed from an IPython notebook, %matplotlib inline ensures that the graphics packages are imported properly and made to work in the HTML environment of the notebook. The os package is used to set the working directory as follows:
os.chdir('D:\Practical Time Series')
Now, we read the data from the CSV file and display basic information about it:
data = pd.read_csv('datasets/WDIData.csv')
print('Column names:', data.columns)
This gives us the following output showing the column names of the dataset:
Column names: Index([u'Country Name', u'Country Code', u'Indicator Name',
u'Indicator Code', u'1960', u'1961', u'1962', u'1963', u'1964', u'1965',
u'1966', u'1967', u'1968', u'1969', u'1970', u'1971', u'1972', u'1973',
u'1974', u'1975', u'1976', u'1977', u'1978', u'1979', u'1980', u'1981',
u'1982', u'1983', u'1984', u'1985', u'1986', u'1987', u'1988', u'1989',
u'1990', u'1991', u'1992', u'1993', u'1994', u'1995', u'1996', u'1997',
u'1998', u'1999', u'2000', u'2001', u'2002', u'2003', u'2004', u'2005',
u'2006', u'2007', u'2008', u'2009', u'2010', u'2011', u'2012', u'2013',
u'2014', u'2015', u'2016'],
dtype='object')
Let us also get a sense of the size of the data in terms of number of rows and columns by running the following line:
print('No. of rows, columns:', data.shape)
This returns the following output:
No. of rows, columns: (397056, 62)
This dataset has nearly 400k rows because it captures 1504 world development indicators for 264 different countries. This information about the unique number of indicators and countries can be obtained by running the following four lines:
nb_countries = data['Country Code'].unique().shape[0]
print('Unique number of countries:', nb_countries)
As it appears from the structure of the data, every row gives the observations about an indicator that is identified by columns Indicator Name and Indicator Code and for the country, which is indicated by the columns Country Name and Country Code. Columns 1960 through 2016 have the values of an indicator during the same period of time. With this understanding of how the data is laid out in the DataFrame, we are now set to extract the rows and columns that are relevant for our visualization.
Let us start by preparing two other DataFrames that get the rows corresponding to the indicators Total Central Government Debt (as % of GDP) and Military expenditure (% of GDP) for all the countries. This is done by slicing the original DataFrame as follows:
central_govt_debt = data.ix[data['Indicator Name']=='Central government debt, total (% of GDP)']
military_exp = data.ix[data['Indicator Name']=='Military expenditure (% of GDP)']
The preceding two lines create two new DataFrames, namely central_govt_debt and military_exp. A quick check about the shapes of these DataFrames can be done by running the following two lines:
print('Shape of central_govt_debt:', central_govt_debt.shape)
print('Shape of military_exp:', military_exp.shape)
These lines return the following output:
Shape of central_govt_debt: (264, 62)
Shape of military_exp: (264, 62)
These DataFrames have all the information we need. In order to plot the univariate and bivariate cross-sectional data in the preceding figure, we need the column 2010. Before we actually run the code for plotting, let us quickly check if the column 2010 has missing. This is done by the following two lines:
central_govt_debt['2010'].describe()
military_exp['2010'].describe()
Which generate the following outputs respectively:
count 93.000000
mean 52.894412
std 30.866372
min 0.519274
25% NaN
50% NaN
75% NaN
max 168.474953
Name: 2010, dtype: float64
count 194.000000
mean 1.958123
std 1.370594
min 0.000000
25% NaN
50% NaN
75% NaN
max 8.588373
Name: 2010, dtype: float64
Which tells us that the describe function could not compute the 25th, 50th, and 75th quartiles for either, hence there are missing values to be avoided.
Additionally, we would like the Country Code column to be the row indices. So the following couple of lines are executed:
central_govt_debt.index = central_govt_debt['Country Code']
military_exp.index = military_exp['Country Code']
Next, we create two pandas.Series by taking non-empty 2010 columns from central_govt_debt and military_exp. The newly created Series objects are then merged into to form a single DataFrame:
central_govt_debt_2010 = central_govt_debt['2010'].ix[~pd.isnull(central_govt_debt['2010'])]
military_exp_2010 = military_exp['2010'].ix[~pd.isnull(military_exp['2010'])]
data_to_plot = pd.concat((central_govt_debt_2010, military_exp_2010), axis=1)
data_to_plot.columns = ['central_govt_debt', 'military_exp']
data_to_plot.head()
The preceding lines return the following table that shows that not all countries have information on both Central Government Debt and Military Expense for the year 2010:
|
central_govt_debt
|
military_exp
|
AFG
|
NaN
|
1.897473
|
AGO
|
NaN
|
4.244884
|
ALB
|
NaN
|
1.558592
|
ARB
|
NaN
|
5.122879
|
ARE
|
NaN
|
6.119468
|
ARG
|
NaN
|
0.814878
|
ARM
|
NaN
|
4.265646
|
ATG
|
75.289093
|
NaN
|
AUS
|
29.356946
|
1.951809
|
AUT
|
79.408304
|
0.824770
|
To plot, we have to take only those countries that have both central government debt and military expense. Run the following line, to filter out rows with missing values:
data_to_plot = data_to_plot.ix[(~pd.isnull(data_to_plot.central_govt_debt)) & (~pd.isnull(data_to_plot.military_exp)), :]
The first five rows of the filtered DataFrame are displayed by running the following line:
data_to_plot.head()
|
central_govt_debt
|
military_exp
|
AUS
|
29.356946
|
1.951809
|
AUT
|
79.408304
|
0.824770
|
AZE
|
6.385576
|
2.791004
|
BEL
|
7.022605
|
1.084631
|
BGR
|
21.286254
|
1.765384
|
AUS |
29.356946
|
1.951809
|
AUT |
79.408304 |
0.824770 |
AZE |
6.385576 |
2.791004 |
BEL |
7.022605 |
1.084631 |
BGR |
21.286254 |
1.765384 |
The preceding table has only non-empty values and we are now ready to generate the plots for the cross-sectional data. The following lines of code generate the plot on the univariate cross-sectional data on military expense:
plt.figure(figsize=(5.5, 5.5))
g = sns.distplot(np.array(data_to_plot.military_exp), norm_hist=False)
g.set_title('Military expenditure (% of GDP) of 85 countries in 2010')
The plot is saved as a png file under the plots/ch1 folder of this book's GitHub repository. We will also generate the bivariate plot between military expense and central government debt by running the following code:
plt.figure(figsize=(5.5, 5.5))
g = sns.kdeplot(data_to_plot.military_exp, data2=data_to_plot.central_govt_debt)
g.set_title('Military expenditures & Debt of central governments in 2010')