ChatGPT for Data Analysis

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!

Introduction

As datasets continue growing rapidly in size and complexity, exploring, preparing, and documenting data is taking data scientists more and more time. While coding is crucial for actually implementing analyses, there's an opportunity to make the conceptual work more efficient through a simpler way of interaction.

ChatGPT - an AI tool that truly understands what you're saying and can have natural back-and-forth conversations. By casually describing what you need to do, its broad knowledge lets it generate sample results, code snippets, and documentation without you writing a single line.

In this fast-paced world where time is precious, ChatGPT seems like a helpful extra pair of hands. You can bounce ideas off it 24/7 to test hypotheses and get artifacts to aid your own work.

Now, it definitely won't take over your job. However, facilitating exploratory talks and quick prototyping through plain speech, it opens up new approaches that maximize our growing computing power.

In this post, I'll demonstrate how ChatGPT streamlines common analyst tasks through example conversations. While coding is still king, it serves as a supplementary brain to speed up the often overlooked definition stage - arguably the most important part. Ultimately, combining human intuition with AI assistance will vastly improve our ability to extract value from huge datasets at lightning speed.

Let's see how it can supplement analysis in practice.

You can access the entire code here: https://github.com/rohanchikorde/chatgpt-for-data-analysis.git

Creating a Sample DataFrame using ChatGPT

Here is a sample prompt to have ChatGPT generate a Pandas DataFrame containing fake customer data:

Prompt:

Hey, can you help me create a sample DataFrame containing customer data that I can use to demonstrate exploring and analyzing a dataset through conversations?
I need it to contain around 10 rows with the following columns:
·        'CustomerID' (integer index)
·        'Name' (string of first and last name)
·        'Age' (integer)
·        'State' (string of two-letter state abbreviation)
·        'Annual Income' (integer dollar amount)
You can generate random values for the names, ages between 20-60, state abbreviations from the 50 US states, and incomes between $30,000-$150,000.
It would be great if you could print out the code to create a Pandas DataFrame with this sample data so I have something concrete to reference in example conversations. Having sample data upfront will make it easier to walk through tasks like summarizing, grouping, plotting and more.

Viola! Here, is the generated Python code by ChatGPT:

import pandas as pd
import numpy as np
 
# Define the number of rows
num_rows = 10
# Set random seed for reproducibility
np.random.seed(42)
# Define the state abbreviations for the 50 US states
state_abbreviations = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
                       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
                       'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
                       'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
                       'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
 
# Generate random sample data
customer_data = {
    'CustomerID': range(1, num_rows + 1),
    'Name': [' '.join(np.random.choice(['John', 'Jane', 'David', 'Sarah', 'Michael', 'Emily'], 1)) +
             ' ' +
             np.random.choice(['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis'], 1)[0]
             for _ in range(num_rows)],
    'Age': np.random.randint(20, 61, num_rows),
    'State': np.random.choice(state_abbreviations, num_rows),
    'Annual Income': np.random.randint(30000, 150001, num_rows)
}
 
# Create the DataFrame
df = pd.DataFrame(customer_data)
 
# Print the DataFrame
print(df)

After running the above code:

Descriptive Statistics

Descriptive statistics are a fundamental aspect of data analysis that provides a summary of the main characteristics of a dataset. They help us understand the distribution, central tendency, and variability of the data. Let's explore some common descriptive statistics and how they can be calculated and interpreted:

Measures of Central Tendency:

Mean: It represents the average value of a dataset and is computed by summing all the values and dividing by the number of observations.
Median: It corresponds to the middle value of a dataset when it is sorted in ascending or descending order. It is less affected by extreme values compared to the mean.
Mode: It is the most frequently occurring value in a dataset.

Python Code by ChatGPT:

import pandas as pd
# Calculate the mean
age_mean = df['Age'].mean()
income_mean = df['Annual Income'].mean()
 
# Calculate the median
age_median = df['Age'].median()
income_median = df['Annual Income'].median()
 
# Calculate the mode
age_mode = df['Age'].mode().values
income_mode = df['Annual Income'].mode().values
 
# Print the results
print("Age Mean:", age_mean)
print("Age Median:", age_median)
print("Age Mode:", age_mode)
 
print("Income Mean:", income_mean)
print("Income Median:", income_median)
print("Income Mode:", income_mode)

Output in Python environment:

Measures of Dispersion/Variability

Range: It is the difference between the maximum and minimum values in a dataset, providing an idea of the spread of the data.
Variance: It quantifies the average squared deviation of each data point from the mean. A higher variance indicates greater dispersion.
Standard Deviation: It is the square root of the variance and provides a measure of the average distance between each data point and the mean.

Python code generated by ChatGPT:

import pandas as pd
# Calculate the range
age_range = df['Age'].max() - df['Age'].min()
income_range = df['Annual Income'].max() - df['Annual Income'].min()
 
# Calculate the variance
age_variance = df['Age'].var()
income_variance = df['Annual Income'].var()
 
# Calculate the standard deviation
age_std_dev = df['Age'].std()
income_std_dev = df['Annual Income'].std()
 
# Print the results
print("Age Range:", age_range)
print("Age Variance:", age_variance)
print("Age Standard Deviation:", age_std_dev)
 
print("Income Range:", income_range)
print("Income Variance:", income_variance)
print("Income Standard Deviation:", income_std_dev)

Output in Python environment:

Percentiles

Percentiles divide a dataset into hundredths, allowing us to understand how values are distributed. The median corresponds to the 50th percentile.
Quartiles divide the dataset into quarters, with the first quartile (Q1) representing the 25th percentile and the third quartile (Q3) representing the 75th percentile.

Python code generated by ChatGPT:

import pandas as pd
 
# Calculate the percentiles
age_percentiles = df['Age'].quantile([0.25, 0.5, 0.75])
income_percentiles = df['Annual Income'].quantile([0.25, 0.5, 0.75])
 
# Extract the quartiles
age_q1, age_median, age_q3 = age_percentiles
income_q1, income_median, income_q3 = income_percentiles
 
# Print the results
print("Age Percentiles:")
print("Q1 (25th percentile):", age_q1)
print("Median (50th percentile):", age_median)
print("Q3 (75th percentile):", age_q3)
 
print("\nIncome Percentiles:")
print("Q1 (25th percentile):", income_q1)
print("Median (50th percentile):", income_median)
print("Q3 (75th percentile):", income_q3)

Output in Python environment:

Skewness and Kurtosis

Skewness measures the asymmetry of a distribution. A positive skew indicates a longer tail on the right, while a negative skew indicates a longer tail on the left.
Kurtosis measures the heaviness of the tails of a distribution. High kurtosis implies more extreme values, while low kurtosis indicates a flatter distribution.

Python Code generated by ChatGPT:

import pandas as pd
 
# Calculate the skewness
age_skewness = df['Age'].skew()
income_skewness = df['Annual Income'].skew()
 
# Calculate the kurtosis
age_kurtosis = df['Age'].kurtosis()
income_kurtosis = df['Annual Income'].kurtosis()
 
# Print the results
print("Age Skewness:", age_skewness)
print("Income Skewness:", income_skewness)
 
print("\nAge Kurtosis:", age_kurtosis)
print("Income Kurtosis:", income_kurtosis)

Output in Python jupyter notebook:

Grouping and Aggregation

Grouping and aggregation in Python are powerful techniques for analyzing data by grouping it based on specific criteria and calculating summary statistics or performing aggregate functions on each group. Here's the Python code to group the data by state and find the average age and income for each state:

import pandas as pd
# Group the data by State and calculate the average age and income
grouped_data = df.groupby('State').agg({'Age': 'mean', 'Annual Income': 'mean'})
 
# Print the grouped data
print(grouped_data)

Output in Python jupyter notebook:

In this code, ChatGPT uses the groupby function from the Pandas library to group the data in the DataFrame df by the 'State' column. It then uses the agg function to specify the aggregation functions we want to apply to each group. In this case, it calculates the mean of the 'Age' and 'Annual Income' columns for each state.

The output of this code will be a new DataFrame containing the grouped data with the average age and income for each state. The DataFrame will have the 'State' column as the index and two additional columns: 'Age' and 'Annual Income', representing the average values for each state.

Data Visualization

Histogram of Age

The histogram provides a visual representation of the distribution of ages in the dataset. The x-axis represents the age values, and the y-axis represents the frequency or count of individuals falling into each age bin. The shape of the histogram can provide insights into the data's central tendency, variability, and any skewness in the distribution.

Scatter Plot: Age vs. Annual Income

The scatter plot visualizes the relationship between age and annual income for each data point. Each point on the plot represents an individual's age and their corresponding annual income. By plotting the data points, we can observe patterns, clusters, or trends in the relationship between these two variables. The scatter plot helps identify any potential correlation or lack thereof between age and income.

Python Code for histogram and scatterplot generated by ChatGPT:

import matplotlib.pyplot as plt
# Plot a histogram of the Age variable
plt.hist(df['Age'])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Histogram of Age')
plt.show()
 
# Plot a scatter plot between Age and Income
plt.scatter(df['Age'], df['Annual Income'])
plt.xlabel('Age')
plt.ylabel('Annual Income')
plt.title('Scatter Plot: Age vs. Annual Income')
plt.show()

Output in Python jupyter notebook

chatgpt-for-data-analysis-img-7

In this code, ChatGPT uses the hist function from the matplotlib library to plot a histogram of the 'Age' variable. The histogram visualizes the distribution of ages in the dataset. It set the x-axis label to 'Age', the y-axis label to 'Frequency' (indicating the count of individuals in each age group), and give the plot a title which is super cool.

Next, it used the scatter function to create a scatter plot between 'Age' and 'Annual Income'. The scatter plot shows the relationship between age and annual income for each data point. It sets the x-axis label to 'Age', the y-axis label to 'Annual Income', and gives the plot a title.

Conclusion

In this blog, we explored a couple of examples showing how ChatGPT can streamline various aspects of data analysis through natural conversation. By simply describing our needs, it was able to generate sample Python code for us without writing a single line of code. While the results require human review, ChatGPT handles much of the prototyping work rapidly.

For data scientists who understand programming but want to focus more on problem definition, ChatGPT serves as a helpful digital assistant to offload some of the repetitive technical work. It also opens up analysis to those without coding skills by abstracting the process into simple question-and-response dialogue. While ChatGPT does not replace human expertise, it makes the analysis process more approachable and efficient overall.

Going forward, as chatbots advance in capabilities, we may see them automating ever more complex portions of the data science lifecycle through natural language. But for now, even with its limitations, ChatGPT has proven quite useful as a dialogue-driven aid for getting initial insights, especially when time is of the essence. I hope this post demonstrates how accessible and powerful conversational data science can be.

Author Bio

Rohan Chikorde is an accomplished AI Architect professional with a post-graduate in Machine Learning and Artificial Intelligence. With almost a decade of experience, he has successfully developed deep learning and machine learning models for various business applications. Rohan's expertise spans multiple domains, and he excels in programming languages such as R and Python, as well as analytics techniques like regression analysis and data mining. In addition to his technical prowess, he is an effective communicator, mentor, and team leader. Rohan's passion lies in machine learning, deep learning, and computer vision.