ChatGPT for Exploratory Data Analysis (EDA)

Introduction

Exploratory data analysis (EDA) refers to the initial investigation of data to discover patterns, identify outliers and anomalies, test hypotheses, and check assumptions with the goal of informing future analysis and model building. It is an iterative, exploratory process of questioning, analyzing, and visualizing data.

Some key aspects of exploratory data analysis include:

Getting to know the data - Examining individual variables, their values, distributions, and relationships between variables.
Data cleaning - Checking and handling missing values, outliers, formatting inconsistencies, etc., before further analysis.
Univariate analysis - Looking at one variable at a time to understand its distribution, central tendency, spread, outliers, etc.
Bivariate analysis - Examining relationships between two variables using graphs, charts, and statistical tests. This helps find correlations.
Multivariate analysis - Analyzing patterns between three or more variables simultaneously using techniques like cluster analysis.
Hypothesis generation - Coming up with potential explanations or hypotheses about relationships in the data based on initial findings.
Data visualization - Creating graphs, plots, and charts to summarize findings and detect patterns and anomalies more easily.

The goals of EDA are to understand the dataset, detect useful patterns, formulate hypotheses, and make decisions on how to prepare/preprocess the data for subsequent modeling and analysis. It is an iterative, exploratory process of questioning, analyzing, and visualizing data.

Why ChatGPT for EDA?

Exploratory data analysis (EDA) is an important but often tedious process with challenges and pitfalls. The use of ChatGPT saves hours on repetitive tasks. ChatGPT handles preparatory data wrangling, exploration, and documentation - freeing you to focus on insights. Its capabilities will only grow through continued learning. Soon, it may autonomously profile datasets and propose multiple exploratory avenues. ChatGPT is the perfect on-demand assistant for solo data scientists and teams seeking an effortless boost to the EDA process. The drawback of ChatGPT is it can only handle small datasets. There are a few methods like handling smaller datasets and generating Python code to do the necessary analysis.

The following table provides detailed challenges/pitfalls during EDA:

Challenge/Pitfall	Details
Getting lost in the weeds	Spending too much time on minor details without focusing on the big picture. This leads to analysis paralysis.
Premature conclusions	Drawing conclusions without considering all possible factors or testing different hypotheses thoroughly.
Bias	Personal biases, preconceptions or domain expertise can skew analysis in a particular direction.
Multiple comparisons	Testing many hypotheses without adjusting for Type 1 errors, leading to false discoveries.
Documentation	Failing to properly document methods, assumptions, and thought processes along the way.
Lack of focus	Jumping randomly without a clear understanding of the business objective.
Ignoring outliers	Not handling outliers appropriately, can distort analysis and patterns.
Correlation vs causation	Incorrectly inferring causation based only on observed correlations.
Overfitting	Finding patterns in sample data that may not generalize to new data.
Publication bias	Only focusing on publishable significant or "interesting" findings.
Multiple roles	Wearing data analyst and subject expert hats, mixing subjective and objective analysis.

With ChatGPT, get an AI assistant to be your co-pilot on the journey of discovery. ChatGPT can provide EDA at various stages of your data analysis within the limits that we discussed earlier. The following table provides different stages of data analysis with prompts (these prompts either generate the output or Python code for you to execute separately):

Type of EDA	Prompt
Summary Statistics	Describe the structure and summary statistics of this dataset. Check for any anomalies in variable distributions or outliers.
Univariate Analysis	Create histograms and density plots of each numeric variable to visualize their distributions and identify any unusual shapes or concentrations of outliers.
Bivariate Analysis	Generate a correlation matrix and heatmap to examine relationships between variables. Flag any extremely high correlations that could indicate multicollinearity issues.
Dimensionality Reduction	Use PCA to reduce the dimensions of this high-dimensional dataset and project it into 2D. Do any clusters or groupings emerge that provide new insights?
Clustering	Apply K-Means clustering on the standardized dataset with different values of k. Interpret the resulting clusters and check if they reveal any meaningful segments or categories.
Text Analysis	Summarize the topics and sentiments discussed in this text column using topic modeling algorithms like LDA. Do any dominant themes or opinions stand out?
Anomaly Detection	Implement an isolation forest algorithm on the dataset to detect outliers independently in each variable. Flag and analyze any suspicious or influential data points.
Model Prototyping	Quickly prototype different supervised learning algorithms like logistic regression, decision trees, random forest on this classification dataset. Compare their performance and feature importance.
Model Evaluation	Generate a correlation matrix between predicted vs actual values from different models. Any low correlations potentially indicate nonlinear patterns worth exploring further.
Report Generation	Autogenerate a Jupyter notebook report with key visualizations, findings, concentrations, and recommendations for the next steps based on the exploratory analyses performed.

How do we feed data to ChatGPT for EDA?

Describe your dataset through natural language prompts, and ChatGPT instantly runs analyses to find hidden insights. No need to write code - let the AI do the heavy lifting! For this article, let’s use the CSV file available at: (https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-1000.csv) (http://tinyurl.com/mphebj4k)

Here are some examples of how ChatGPT can be used for exploratory data analysis:

Prompts:

Describe the structure and summary statistics of this CSV file: [Pasted URL or file contents]
What variable types are in this DataFrame? Import Pandas and show column data types.
Generate a correlation matrix and heatmap for these variables.
Check for missing values in each column and calculate the percentage missing.
Create a histogram to visualize the distribution of this numeric variable.
Compare the means of this categorical variable across groups using a boxplot.

In the below example, I pasted two hundred rows and see the prompt I gave.

chatgpt-for-exploratory-data-analysis-eda-img-0

This resulted in the following.

chatgpt-for-exploratory-data-analysis-eda-img-1

Look at the column details it provided. Asking for the list of categorical and numerical variables in table format produces the below:

chatgpt-for-exploratory-data-analysis-eda-img-2

Asking for a statistical summary of numerical columns would produce the following:

chatgpt-for-exploratory-data-analysis-eda-img-3

Python Code generation:

For certain aspects of EDA, ChatGPT produces Python code with pandas library import and the result will be shown like this:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 
# Load the CSV file from the URL into a DataFrame
url = "https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-1000.csv"
df = pd.read_csv(url)
 
# Display basic information about the DataFrame
print("Basic Info About the DataFrame:")
print(df.info())
 
# Display the first few rows of the DataFrame
print("\nFirst Few Rows of the DataFrame:")
print(df.head())
 
# Summary statistics of numerical columns
print("\nSummary Statistics of Numerical Columns:")
print(df.describe())
 
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
 
# Visualize data
# Example: Histogram of a numerical column (replace 'col_name' with the column name you want to plot)
# plt.hist(df['col_name'], bins=20)
# plt.xlabel('X-axis Label')
# plt.ylabel('Y-axis Label')
# plt.title('Histogram of col_name')
# plt.show()
 
# You can create more visualizations and explore relationships between columns as needed.
 
# Correlation matrix heatmap (for numerical columns)
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix Heatmap")
plt.show()

Running this in Spyder (Anaconda UI) produces the following output without a single error:

chatgpt-for-exploratory-data-analysis-eda-img-4

EDA on large datasets with millions of records

As mentioned earlier in this article, ChatGPT is very powerful for its size, but there are still limitations since it runs on general consumer hardware rather than massive server clusters. Here are a few things to keep in mind regarding its capabilities with large datasets:

ChatGPT works best for datasets under 50-100MB in size. It can handle some operations on larger files up to 1GB, but performance will degrade.
For initial exploration of very large datasets, ChatGPT is still useful. It can quickly summarize dimensions, types, distributions, outliers, etc., to help shape hypotheses.
Advanced analytics like complex multi-variable modeling may not be feasible on the largest datasets directly in ChatGPT.
However, it can help with the data prep - filtering, aggregations, feature engineering, etc. to reduce a large dataset into a more manageable sample for detailed analysis.
Integration with tools that can load large datasets directly (e.g., BigQuery, Spark, Redshift) allows ChatGPT to provide insights on files too big to import wholesale.
As AI capabilities continue advancing, future versions powered by more computing may be able to handle larger files for a broader set of analytics tasks.

Conclusion

ChatGPT revolutionizes Exploratory Data Analysis (EDA) by streamlining the process and making it accessible to a wider audience. EDA is crucial for understanding data, and ChatGPT automates tasks like generating statistics, visualizations, and even code, simplifying the process.

ChatGPT's natural language interface enables users to interact with data using plain language, eliminating the need for extensive coding skills. While it excels in initial exploration and data preparation, it may have limitations with large datasets or complex modeling tasks.

ChatGPT is a valuable EDA companion, empowering data professionals to uncover insights and make data-driven decisions efficiently. ChatGPT's role in data analytics is expected to expand as AI technology evolves, offering even more support for data-driven decision-making.

Author Bio

Rama Kattunga has been working with data for over 15 years at tech giants like Microsoft, Intel, and Samsung. As a geek and a business wonk with degrees from Kellogg and two technology degrees from India, Rama uses his engineering know-how and strategy savvy to get stuff done with analytics, AI, and unlocking insights from massive datasets. When he is not analyzing data, you can find Rama sharing his thoughts as an author, speaker, and digital transformation specialist. Moreover, Rama also finds joy in experimenting with cooking, using videos as his guide to create delicious dishes that he can share with others. This diverse range of interests and skills highlights his well-rounded and dynamic character.