Introduction to EDA
EDA is the process of procuring, understanding, and deriving meaningful statistical insights from structured/unstructured data of interest. It is the first step before a more complex analysis, such as predicting future expectations from the data. In the case of financial data, EDA helps obtain insights used later for building profitable trading signals and strategies.
EDA guides later decisions of which features/signals to use or avoid and which predictive models to use or avoid, and invalidates incorrect hypotheses while validating and introducing correct hypotheses about the nature of variables and the relationships between them.
EDA is also important in understanding how sample (a smaller dataset representative of a complete dataset) statistics differ from population (a complete dataset or an ultimate truth) statistics and keeping that in mind when drawing conclusions about the population, based on observations of samples. Thus, EDA helps cut down possible search spaces down the road; otherwise, we would waste a lot more time later on building incorrect/insignificant models or strategies.
EDA must be approached with a scientific mindset. Sometimes, we might reach inadequately validated conclusions based on anecdotal evidence rather than statistical evidence.
Hypotheses based on anecdotal evidence suffer from issues stemming from the following:
- Not being statistically significant—too low number of observations.
- Selection bias—the hypothesis is only created because it was first observed.
- Confirmation bias—our inherent belief in the hypothesis biases our results.
- Inaccuracies in observations.
Let's explore the different steps and techniques involved in EDA, using real datasets.
Steps in EDA
Here is a list of steps involved in EDA (we'll be going through each of them in the subsections that follow):
- Loading the necessary libraries and setting them up
- Data collection
- Data wrangling/munging
- Data cleaning
- Obtaining descriptive statistics
- Visual inspection of the data
- Data cleaning
- Advanced visualization techniques
Loading the necessary libraries and setting them up
We will be using numpy
, pandas
, and matplotlib
, and these libraries can be loaded with the help of the following code:
%matplotlib inline import numpy as np import pandas as pd from scipy import stats import seaborn as sn import matplotlib.pyplot as plt import mpld3 mpld3.enable_notebook() import warnings warnings.filterwarnings('ignore') pd.set_option('display.max_rows', 2)
We use the mpld3
library for enabling zooming within Jupyter's matplotlib
charts. The last line of the preceding code block specifies that only a maximum of two rows of pandas
DataFrames should be displayed.
Data collection
Data collection is usually the first step for EDA. Data may come from many different sources (comma-separated values (CSV) files, Excel files, web scrapes, binary files, and so on) and will often need to be standardized and first formatted together correctly.
For this exercise, we will use data for three different trading instruments for a period of 5 years, stored in .csv
format. The identity of these instruments is deliberately not revealed since that might give away their expected behavior/relationships, but we will reveal their identity at the end of this exercise to evaluate intuitively how well we performed EDA on them.
Let's start by loading up our available datasets into three DataFrames (A
, B
, and C
), as follows:
A = pd.read_csv('A.csv', parse_dates=True, index_col=0); A
DataFrame A
has the following structure:

Figure 2.1 – DataFrame constructed from the A.csv file
Similarly, let's load DataFrame B
, as follows:
B = pd.read_csv('B.csv', parse_dates=True, index_col=0); B
DataFrame B
has the following structure:

Figure 2.2 – DataFrame constructed from the B.csv file
Finally, let's load the C
data into a DataFrame, as follows:
C = pd.read_csv('C.csv', parse_dates=True, index_col=0); C
And we see C
has the following fields:

Figure 2.3 – DataFrame constructed from the C.csv file
As we can observe, all three data sources have the same format with Open, High, Low, Close, and Adj Close prices and Volume information between approximately 2015-05-15
and 2020-05-14
.
Data wrangling/munging
Data rarely comes in a ready-to-use format. Data wrangling/munging refers to the process of manipulating and transforming data from its initial raw source into structured, formatted, and easily usable datasets.
Let's use pandas.DataFrame.join(...)
to merge the DataFrames and align them to have the same DateTimeIndex
format. Using the lsuffix=
and rsuffix=
parameters, we assign the _A
, _B
, and _C
suffixes to the columns coming from the three DataFrames, as follows:
merged_df = A.join(B, how='outer', lsuffix='_A', sort=True).join(C, how='outer', lsuffix='_B', rsuffix='_C', sort=True) merged_df
We will inspect the merged_df
DataFrame we just created and make sure it has all the fields we expected from all three DataFrames (displaying only the first seven columns). The DataFrame can be seen here:

Figure 2.4 – DataFrame constructed by joining the DataFrames A, B, and C
Notice that the original three DataFrames (A
, B
, and C
) had 1,211, 1,209 and 1,206 rows respectively, but the combined DataFrame has 1,259 rows. This is because we used an outer join, which uses the union of dates across all three DataFrames. When it cannot find values for a specific DataFrame for a specific date, it places a NaN
value there for that DataFrame's fields.
Data cleaning
Data cleaning refers to the process of addressing data errors coming from missing data, incorrect data values, and outliers.
In our example, merged_df
has missing values for many fields coming from the original datasets and coming from merging DataFrames with different sets of dates.
Let's first check if there are any rows where all values are missing (NaN
), as follows:
merged_df[merged_df.isnull().all(axis=1)]
The result shows that we do not have any row with all fields missing, as we can see here:

Figure 2.5 – DataFrame showing that there are no rows with all fields missing
Now, let's find out how many rows exist that have at least one field that is missing/NaN
, as follows:
merged_df[['Close_A', 'Close_B', 'Close_C']].isnull().any(axis=1).sum()
So, it turns out 148 rows out of our 1,259 rows have one or more fields with missing values, as shown here:
148
For our further analysis, we need to have valid Close
prices. Thus, we can drop all rows where the Close
price for any of the three instruments is missing, by running the following code:
valid_close_df = merged_df.dropna(subset=['Close_A', 'Close_B', 'Close_C'], how='any')
After dropping the missing Close
prices, we should have no more missing Close
price fields, as illustrated in the following code snippet:
valid_close_df[['Close_A', 'Close_B', 'Close_C']].isnull().any(axis=1).sum()
The result confirms there are no rows left where any of the Close_A
, Close_B
, or Close_C
fields are NaN
values, as we can see here:
0
Let's inspect the new DataFrame, as follows:
valid_close_df
Here is the result (displaying only the first seven columns):

Figure 2.6 – Resulting DataFrame with no missing/NaN values for any close prices
As expected, we dropped the 148 rows that had missing/NaN
values for any of the close prices.
Next, let's deal with rows that have NaN
values for any of the other fields, starting with getting a sense of how many such rows exist. We can do this by running the following code:
valid_close_df.isnull().any(axis=1).sum()
Here is the output of that query:
165
So, there exist 165 rows that have at least some fields with a missing value.
Let's quickly inspect a few of the rows with at least some fields with a missing value, as follows:
valid_close_df[valid_close_df.isnull().any(axis=1)]
Some of the rows with some missing values are displayed (displaying only the first seven columns), as shown here:

Figure 2.7 – DataFrame showing there are still some rows with some missing values
So, we can see that the Low_C
field on 2015-05-18
(not visible in the preceding screenshot) and the Open_B
field on 2020-05-01
have NaN
values (among 163 others, of course).
Let's use the pandas.DataFrame.fillna(...)
method with a method called backfill
—this uses the next valid value after the missing value to fill in the missing value. The code is illustrated in the following snippet:
valid_close_complete = valid_close_df.fillna(method='backfill')
Let's see the impact of the backfilling, as follows:
valid_close_complete.isnull().any(axis=1).sum()
Now, this is the output for the query:
0
As we can see, after the backfill
operation, there are no more missing/NaN
values left for any field in any row.
Obtaining descriptive statistics
The next step is to generate the key basic statistics on data to build familiarity with each field, with the DataFrame.describe(...)
method. The code is illustrated in the following snippet:
pd.set_option('display.max_rows', None) valid_close_complete.describe()
Notice that we have increased the number of rows of a pandas
DataFrame to display.
Here is the output of running pandas.DataFrame.describe(…)
, displaying only the first seven columns:

Figure 2.8 – Descriptive statistics of the valid_close_complete DataFrame
The preceding output provides quick summary statistics for every field in our DataFrame.
Key observations from Figure 2.8 are outlined here:
Volume_C
has all statistics values to be0
, implying every row has theVolume_C
value set to0
. Therefore, we need to remove this column.Open_C
has a minimum value of-400
, which is unlikely to be true for the following reasons:a) The other price fields—
High_C
,Low_C
,Close_C
, andAdj Close_C
—all have minimum values around9
, so it doesn't make sense forOpen_C
to have a minimum value of-400
.b) Given that the 25th percentile for
Open_C
is12.4
, it is unlikely that the minimum value would be so much lower than that.c) The price of an asset should be non-negative.
Low_C
has a maximum value of330
, which is again unlikely because of the following reasons:a) For the same reasons given previously to those outlined previously, as
Open_C
is not correct.b) In addition, considering that
Low_C
should always be lower thanHigh_C
, by definition, the lowest price in a day has to be lower than the highest price on a day.
Let's put back the output of all the pandas
DataFrames to be just two rows, as follows:
pd.set_option('display.max_rows', 2)
Now, let's remove the Volume
fields for all three instruments, with the following code:
prices_only = valid_close_complete.drop(['Volume_A', 'Volume_B', 'Volume_C'], axis=1) prices_only
And the prices_only
DataFrame has the following data (displaying only the first seven columns):

Figure 2.9 – The prices_only DataFrame
As expected, after we removed the three volume columns, we reduced the DataFrame dimensions to 1111 × 15
—these were previously 1111 × 18
.
Visual inspection of the data
There do not seem to be any obvious errors or discrepancies with the other fields, so let's plot a quick visualization of the prices to see if that sits in line with what we learned from the descriptive statistics.
First, we will start with the prices of A
, since we expect those to be correct based on the descriptive statistics summary. The code is illustrated in the following snippet:
valid_close_complete['Open_A'].plot(figsize=(12,6), linestyle='--', color='black', legend='Open_A') valid_close_complete['Close_A'].plot(figsize=(12,6), linestyle='-', color='grey', legend='Close_A') valid_close_complete['Low_A'].plot(figsize=(12,6), linestyle=':', color='black', legend='Low_A') valid_close_complete['High_A'].plot(figsize=(12,6), linestyle='-.', color='grey', legend='High_A')
The output is consistent with our expectations, and we can conclude that the prices of A
are valid based on the statistics and the plot shown in the following screenshot:

Figure 2.10 – Plot showing Open, Close, High, and Low prices for trading instrument A over 5 years
Now, let's plot the prices of C to see if the plot provides further evidence regarding our suspicions about some prices being incorrect. The code can be seen in the following snippet:
valid_close_complete['Open_C'].plot(figsize=(12,6), linestyle='--', color='black', legend='Open_C') valid_close_complete['Close_C'].plot(figsize=(12,6), linestyle='-', color='grey', legend='Close_C') valid_close_complete['Low_C'].plot(figsize=(12,6), linestyle=':', color='black', legend='Low_C') valid_close_complete['High_C'].plot(figsize=(12,6), linestyle='-.', color='grey', legend='High_C')
The output confirms that Open_C
and Low_C
have some erroneous values extremely far away from other values—these are the outliers. The following screenshot shows a plot illustrating this:

Figure 2.11 – Plot showing large outliers in the prices of C in both positive and negative directions
We will need to perform some further data cleaning to eliminate these outlier values so that we do not derive incorrect statistical insights from our data.
The two most commonly used methods to detect and remove outliers are the interquartile range (IQR) and the Z-score.
IQR
The IQR method uses a percentile/quantile range of values over the entire dataset to identify and remove outliers.
When applying the IQR method, we usually use extreme percentile values, such as 5% to 95%, to minimize the risk of removing correct data points.
In our example of Open_C
, let's use the 25th percentile and 75th percentile and remove all data points with values outside that range. The 25th-to-75th percentile range is (12.4, 17.68
), so we would remove the outlier value of -400
.
Z-score
The Z-score (or standard score) is obtained by subtracting the mean of the dataset from each data point and normalizing the result by dividing by the standard deviation of the dataset.
In other words, the Z-score of a data point represents the distance in the number of standard deviations that the data point is away from the mean of all the data points.
For a normal distribution (applicable for large enough datasets) there is a distribution rule of 68-95-99, summarized as follows:
- 68% of all data will lie in a range of one standard deviation from the mean.
- 95% of all data will lie in a range of two standard deviations from the mean.
- 99% of all data will lie within a range of three standard deviations from the mean.
So, after computing Z-scores of all data points in our dataset (which is large enough), there is an approximately 1% chance of a data point having a Z-score larger than or equal to 3
.
Therefore, we can use this information to filter out all observations with Z-scores of 3
or higher to detect and remove outliers.
In our example, we will remove all rows with values whose Z-score is less than -6
or greater than 6
—that is, six standard deviations away from the mean.
First, we use scipy.stats.zscore(...)
to compute Z-scores of each column in the prices_only
DataFrame, and then we use numpy.abs(...)
to get the magnitude of the Z-scores. Finally, we select rows where all fields have Z-scores lower than 6, and save that in a no_outlier_prices
DataFrame. The code is illustrated in the following snippet:
no_outlier_prices = prices_only[(np.abs(stats.zscore(prices_only)) < 6).all(axis=1)]
Let's see what impact this Z-score outlier removal code had on the price fields for instrument C
by plotting its prices again and comparing to the earlier plot, as follows:
no_outlier_prices['Open_C'].plot(figsize=(12,6), linestyle='--', color='black', legend='Open_C') no_outlier_prices['Close_C'].plot(figsize=(12,6), linestyle='-', color='grey', legend='Close_C') no_outlier_prices['Low_C'].plot(figsize=(12,6), linestyle=':', color='black', legend='Low_C') no_outlier_prices['High_C'].plot(figsize=(12,6), linestyle='-.', color='grey', legend='High_C')
Here's the output:

Figure 2.12 – Plot showing the prices of C after removing outliers by applying data cleaning
The plot clearly shows that the earlier observation of extreme values for Open_C
and Low_C
has been discarded; there is no longer the dip of -400
.
Note that while we removed the extreme outliers, we were still able to preserve the sharp spikes in prices during 2015, 2018, and 2020, thus not leading to a lot of data losses.
Let's also check the impact of our outlier removal work by re-inspecting the descriptive statistics, as follows:
pd.set_option('display.max_rows', None) no_outlier_prices[['Open_C', 'Close_C', 'Low_C', 'High_C']].describe()
These statistics look significantly better—as we can see in the following screenshot, the min
and max
values for all prices now look in line with expectations and do not have extreme values, so we succeeded in our data cleaning task:

Figure 2.13 – Descriptive statistics for the no_outlier_prices selected columns
Let's reset back the number of rows to display for a pandas
DataFrame, as follows:
pd.set_option('display.max_rows', 5)
Advanced visualization techniques
In this section, we will explore univariate and multivariate statistics visualization techniques.
First, let's collect the close prices for the three instruments, as follows:
close_prices = no_outlier_prices[['Close_A', 'Close_B', 'Close_C']]
Next, let's compute the daily close price changes to evaluate if there is a relationship between daily price changes between the three instruments.
Daily close price changes
We will use the pandas.DataFrame.shift(...)
method to shift the original DataFrame one period forward so that we can compute the price changes. The pandas.DataFrame.fillna(...)
method here fixes the one missing value generated in the first row as a result of the shift
operation. Finally, we will rename the columns to Delta_Close_A
, Delta_Close_B
, and Delta_Close_C
to reflect the fact that these values are price differences and not actual prices. The code is illustrated in the following snippet:
delta_close_prices = (close_prices.shift(-1) - close_prices).fillna(0) delta_close_prices.columns = ['Delta_Close_A', 'Delta_Close_B', 'Delta_Close_C'] delta_close_prices
The content of the newly generated delta_close_prices
DataFrame is shown in the following screenshot:

Figure 2.14 – The delta_close_prices DataFrame
These values look correct, judging from the first few actual prices and the calculated price differences.
Now, let's quickly inspect the summary statistics for this new DataFrame to get a sense of how the delta price values are distributed, as follows:
pd.set_option('display.max_rows', None) delta_close_prices.describe()
The descriptive statistics on this DataFrame are shown in the following screenshot:

Figure 2.15 – Descriptive statistics for the delta_close_prices DataFrame
We can observe from these statistics that all three delta values' means are close to 0, with instrument A
experiencing large price swings and instrument C
experiencing significantly smaller price moves (from the std
field).
Histogram plot
Let's observe the distribution of Delta_Close_A
to get more familiar with it, using a histogram plot. The code for this is shown in the following snippet:
delta_close_prices['Delta_Close_A'].plot(kind='hist', bins=100, figsize=(12,6), color='black', grid=True)
In the following screenshot, we can see that the distribution is approximately normally distributed:

Figure 2.16 – Histogram of Delta_Close_A values roughly normally distributed around the 0 value
Box plot
Let's draw a box plot, which also helps in assessing the values' distribution. The code for this is shown in the following snippet:
delta_close_prices['Delta_Close_B'].plot(kind='box', figsize=(12,6), color='black', grid=True)
The output can be seen in the following screenshot:

Figure 2.17 – Box plot showing mean, median, IQR (25th to 75th percentile), and outliers
Correlation charts
The first step in multivariate data statistics is to assess the correlations between Delta_Close_A
, Delta_Close_B
, and Delta_Close_C
.
The most convenient way to do that is to plot a correlation scatter matrix that shows the pairwise relationship between the three variables, as well as the distribution of each individual variable.
In our example, we demonstrate the option of using kernel density estimation (KDE), which is closely related to histograms but provides a smoother distribution surface across the plots on the diagonals. The code for this is shown in the following snippet:
pd.plotting.scatter_matrix(delta_close_prices, figsize=(10,10), color='black', alpha=0.75, diagonal='kde', grid=True)
This plot indicates that there is a strong positive correlation between Delta_Close_A
and Delta_Close_B
and a strong negative correlation between Delta_Close_C
and the other two variables. The diagonals also display the distribution of each individual variable, using KDE.
A scatter plot of the fields can be seen in the following screenshot:

Figure 2.18 – Scatter plot of Delta_Close fields with KDE histogram on the diagonals
Next, let's look at some statistics that provide the relationship between the variables. DataFrame.corr(...)
does that for us and also displays linear correlations. This can be seen in the following code snippet:
delta_close_prices.corr()
The correlation matrix confirms that Delta_Close_A
and Delta_Close_B
have a strong positive correlation (very close to 1.0, which is the maximum), as we expected based on the scatter plot. Also, Delta_Close_C
is negatively correlated (closer to -1.0 than 0.0) to the other two variables.
You can see the correlation matrix in the following screenshot:

Figure 2.19 – Correlation matrix for Delta_Close_A, Delta_Close_B, and Delta_Close_C
Pairwise correlation heatmap
An alternative visualization technique known as a heatmap is available in seaborn.heatmap(...)
, as illustrated in the following code snippet:
plt.figure(figsize=(6,6)) sn.heatmap(delta_close_prices.corr(), annot=True, square=True, linewidths=2)
In the plot shown in the following screenshot, the rightmost scale shows a legend where the darkest values represent the strongest negative correlation and the lightest values represent the strongest positive correlations:

Figure 2.20 – Seaborn heatmap visualizing pairwise correlations between Delta_Close fields
The heatmap shows graphically the same message as the table in the previous section— there is a very high correlation between Delta_Close_A
and Delta_Close_B
and a very high negative correlation between Delta_Close_A
and Delta_Close_C
. There is also a very high negative correlation between Delta_Close_B
and Delta_Close_C
.
Revelation of the identity of A, B, and C and EDA's conclusions
The A
instrument is the Dow Jones Industrial Average (DJIA), a large cap equity index exchange traded fund (ETF). The B
instrument is the S&P 500 (SPY), another large cap equity index ETF. The C
instrument is the Chicago Board Options Exchange (CBOE) Volatility Index (VIX), which basically tracks how volatile markets are at any given time (basically, a function of equity index price swings).
From our EDA on the mystery instruments, we drew the following conclusions:
C
(VIX) cannot have negative prices or prices above 90, which has historically been true.A
(DJIA) andB
(SPY) had huge drops in 2008 and 2020, corresponding to the stock market crash and the COVID-19 pandemic, respectively. Also, the price ofC
(VIX) spiked at the same time, indicating heightened market turmoil.A
(DJIA) has largest daily price swings, followed byB
(SPY), and finallyC
(VIX), with very low daily price swings. These are also correct observations considering the underlying instruments that they were hiding.
A
(DJIA) and B
(SPY) have very strong positive correlations, which makes sense since both are large cap equity indices. C
(VIX) has strong negative correlations with both A
(DJIA) and B
(SPY), which also makes sense since during periods of prosperity, volatility remains low and markets rise, and during periods of crisis, volatility spikes and markets drop.
In the next section, we introduce one special Python library that generates the most common EDA charts and tables automatically.