Analyzing Eurostat Data Using OpenAI Code Interpreter

OpenAI's recent release of the ChatGPT Code Interpreter plugin has introduced a groundbreaking addition to their language model, ChatGPT. This plugin combines the power of Large Language Models (LLMs) with traditional programming capabilities, revolutionizing programming workflows and enhancing data analysis processes. By eliminating the need to write code or set up separate environments, the Code Interpreter plugin simplifies the process of leveraging the capabilities of ChatGPT for data analysis. Let's explore how this plugin can be used to analyze Eurostat data and derive valuable insights.

Introduction to the ChatGPT Code Interpreter Plugin

The ChatGPT Code Interpreter plugin provides a Python interpreter within a secure execution environment. It supports file uploads and downloads, allowing seamless handling of data. The plugin enables persistent code execution within a chat conversation, allowing users to build on previous code executions. Its purpose is to solve mathematical problems, perform data analysis and visualization, and convert files between different formats.

Simplifying Programming Workflows

Before the introduction of the Code Interpreter plugin, users had to generate code separately and execute it outside of the ChatGPT model. However, with the Code Interpreter, the entire process can be completed within ChatGPT itself. This eliminates the need for external runtime environments and offers an easy-to-use interface for both programmers and non-programmers to access programming capabilities.

Analyzing Eurostat Data Using Code Interpreter

To demonstrate the capabilities of the Code Interpreter plugin, let's analyze Eurostat data. Eurostat provides various datasets related to agriculture, trade, energy, and more. We will focus on analyzing livestock production data in this example.

Installing the Required Libraries

To begin, we need to install the eurostat library, which allows us to access Eurostat data in Python. We can use the following command within the Code Interpreter plugin:

!pip install eurostat

And then we need to obtain the table of contents to obtain the necessary data for the analysis.

Obtaining Eurostat Data

Once the library is installed, we can proceed to obtain the table of contents (TOC) of the available Eurostat datasets. This will help us identify the datasets relevant to our analysis. Here is the code to retrieve the TOC:

import eurostat
import pandas as pd
 
toc = eurostat.get_toc()
toc_df = pd.DataFrame(toc[1:],columns=toc[0])
toc_df.head().to_string()

We obtain a table that contains the next information:

title      code     type       last update of data last table structure change data start data end
0                           Balance of payments  MED_ECBP  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2018
1       International trade by SITC - 1 000 EUR  MED_ECET  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2019
2                     Primary energy production  MED_EG10  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2018
3  Secondary energy production (transformation)  MED_EG20  dataset  2016-10-28T23:00:00+0200    2021-02-08T23:00:00+0100       2000     2015
4                            Energy consumption  MED_EG30  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2018

Let’s save this into a CSV file:

toc_df.to_csv('toc_df',index=False)

The resulting table will contain information such as the dataset title, code, type, last update of data, and data start/end years.

Identifying Relevant Datasets

Next, we can use the uploaded table of contents file to find the dataset codes that are most relevant to our analysis. In this case, we are interested in datasets related to crop production, livestock numbers, fertilizer usage, and economic accounts for agriculture. We can use the following code within the Code Interpreter:

I have uploaded a file with Eurostat datasets and I need to find the codes for the datasets that most resemble the next indicators:
 
- Crops production in tonnes by NUTS 2 regions.
- Livestock and meat numbers by NUTS 2 regions.
- Poultry production or similar by NUTS 2 regions.
- Mlik or derivates production or similar by NUTS 2 regions.
- Economic Accounts for Agriculture (EAA) by NUTS 2 regions, where we will take the total intermediate consumption as a proxy for input energy.
The assistant will now analyze the uploaded and provided this:
The uploaded file contains the following columns:
 
- `title`: The title of the dataset.
- `code`: The code of the dataset.
- `type`: The type of the entry (all seem to be datasets).
- `last update of data`: The date of the last update of the dataset.
- `last table structure change`: The date of the last structural change of the dataset.
- `data start`: The start year of the data.
- `data end`: The end year of the data.
 
To find the codes for the datasets that most resemble the indicators you mentioned, I will search the dataset titles for relevant keywords. For each indicator, I'll look for the following keywords:
 
1. Crop production: "crop", "production", "NUTS 2".
2. Livestock numbers: "livestock", "NUTS 2".
3. Fertilizer usage: "fertilizer", "NUTS 2".
4. Economic Accounts for Agriculture (EAA): "economic accounts", "agriculture", "NUTS 2".
 
Let's start this search.
 
Here are the dataset codes that most resemble your indicators:
 
1. **Crop production in tonnes by NUTS 2 regions:**
    - `APRO_CPNHR`: Crop production in national humidity by NUTS 2 regions (data available from 2000 to 2023)
    - `APRO_CPNHR_H`: Crop production by NUTS 2 regions - historical data (data available from 1975 to 1999)
    - `APRO_CPSHR`: Crop production in EU standard humidity by NUTS 2 regions (data available from 2000 to 2023)
 
2. **Livestock numbers by NUTS 2 regions:**
    - `TAI06`: Share of main livestock types in total livestock (data available from 2005 to 2016)
    - `EF_KVAAREG`: Key farm variables: area, livestock (LSU), labour force (data available from 2005 to 2013)
    - `EF_KVECSLEG`: Key variables: area, livestock (LSU), labour force by economic size class (data available from 2005 to 2013)
    - `EF_KVFTREG`: Key variables: area, livestock (LSU), labour force by type of farming (data available from 2005 to 2013)
    - `EF_LSK_MAIN`: Main livestock indicators by NUTS 2 regions (data available from 2005 to 2020)
    - `EF_LS_OVAAREG`: Livestock: number of farms and heads by size of the economic output (data available from 1990 to 2007)
    - `EF_LS_OVESU`: Livestock: number of farms and heads by economic size class (data available from 1990 to 2007)
    - `EF_LS_OVLSUREG`: Livestock: number of farms and heads by livestock size class (data available from 1990 to 2007)
    - `EF_OLSAAREG`: Livestock: number of farms and heads of animals by size of the agricultural area (data available from 2005 to 2013)
    - `EF_OLSECSREG`: Livestock: number of farms and heads of animals by economic size class (data available from 2005 to 2013)
    - `EF_OLSLSUREG`: Livestock: number of farms and heads of animals by livestock size class (data available from 2005 to 2013)
    - `AEI_EF_LS`: Share of main livestock types in total livestock (data available from 2005 to 2016)
 
3. **Fertilizer usage or similar by NUTS 2 regions:**
    - Unfortunately, there were no datasets found that match this indicator.
 
4. **Economic Accounts for Agriculture (EAA) by NUTS 2 regions:**
    - `AGR_R_ACCTS`: Economic accounts for agriculture by NUTS 2 regions (data available from 1980 to 2021)
    - `AGR_R_ACCTS_H`: Economic accounts for agriculture by NUTS 2 regions - historical data (data available from 1973 to 2012)
 
Please note that the data ranges for each dataset may not cover the exact period you are interested in. You might need to merge data from multiple datasets or look for additional datasets to fill in any gaps.

Based on this, we can retrieve the next datasets:

Crop production in tonnes by NUTS 2 regions:
- APRO_CPNHR: Crop production in national humidity by NUTS 2 regions (data available from 2000 to 2023)
- APRO_CPSHR: Crop production in EU standard humidity by NUTS 2 regions (data available from 2000 to 2023)
Livestock numbers by NUTS 2 regions:
- EF_LSK_MAIN: Main livestock indicators by NUTS 2 regions (data available from 2005 to 2020)
Economic Accounts for Agriculture (EAA) by NUTS 2 regions:
- AGR_R_ACCTS: Economic accounts for agriculture by NUTS 2 regions (data available from 1980 to 2021)

As a next step, we can start downloading one of the datasets using the Eurostat API.

Analyzing Livestock Production Data

We will focus on the "Livestock numbers by NUTS 2 regions" dataset (EF_LSK_MAIN) for our analysis. Let's proceed with downloading and analyzing this dataset:

# Animal production
livestock_prod_code = 'EF_LSK_MAIN'
livestock_prod_pars = eurostat.get_pars(livestock_prod_code)
print(livestock_prod_pars)
par_values = eurostat.get_par_values(livestock_prod_code, 'geo')
# filter the regions for germany
de_par_values = {
    'unit':'LSU',
    'so_eur':'TOTAL',
    'geo':[p for p in par_values if all([p.startswith('DE'),len(p)==4])]}
# Download data for de filtered regions
livestock_prod_data = eurostat.get_data_df(livestock_prod_code, filter_pars=de_par_values)
print(livestock_prod_data.head().to_string())

Which produces the following result:

['freq', 'lsu', 'animals', 'farmtype', 'agrarea', 'so_eur', 'unit', 'geo']
  freq   lsu animals farmtype  agrarea so_eur unit geo\\TIME_PERIOD  2005  2007  2010  2013  2016  2020
0    A  LSU0   A0010  FT15_SO  HA10-19  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0
1    A  LSU0   A0010  FT15_SO    HA2-4  TOTAL  LSU            DE11   0.0   0.0   NaN   NaN  None   NaN
2    A  LSU0   A0010  FT15_SO  HA20-29  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0
3    A  LSU0   A0010  FT15_SO  HA30-49  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0
4    A  LSU0   A0010  FT15_SO    HA5-9  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0

The code above filters the dataset for Germany's NUTS 2 regions, specifically focusing on the unit of measurement (LSU) and the standard output in Euros (TOTAL). The resulting DataFrame provides a glimpse of the livestock production data for different regions and years.

Some of the codes are still not meaningful, specifically, the ones related to animals, so let’s obtain some results about it:

# Get the values in the dataset
livestock_prod_animal_vals = eurostat.get_par_values(livestock_prod_code, 'animals')
# Obtain their meaning
livestock_prod_animal_codes = [p for p in eurostat.get_dic(livestock_prod_code, 'animals') if p[0] in livestock_prod_animal_vals]
livestock_prod_animal_codes = pd.DataFrame(livestock_prod_animal_codes,columns= ['code','meaning'])
print(livestock_prod_animal_codes.head().to_string())

Which produces the next:

code                                meaning
0  A0010                              Livestock
1  A1000  Live horses, asses, mules and hinnies
2  A2000                    Live bovine animals
3  A3100           Live swine, domestic species
4  A4100                             Live sheep

Now let’s download both datasets and ask the ChatGPT code interpreter to run some analysis for us. We want to see some basic analysis of the data that use these to run some basic analysis.

Visualizing and Analyzing the Livestock Production Data

We can leverage the Code Interpreter plugin to visualize and analyze the downloaded livestock production data. I’ve uploaded the file to the code interpreter and ask it to create an analysis of the data:

analyzing-eurostat-data-using-openai-code-interpreter-img-0

The result is a very good initial evaluation of the data. The complete response its:

The data seems to pertain to livestock production and has the following columns:
 
- `freq`: The frequency of the data
- `lsu`: LSU (Livestock Unit) - a reference unit which facilitates the aggregation of livestock from various species and age according to their nutritional or feed requirement.
- `animals`: The type of animals
- `farmtype`: Type of the farm
- `agrarea`: The agricultural area
- `so_eur`: Unknown, needs further clarification
- `unit`: The unit of measurement
- `geo\\TIME_PERIOD`: Geographical region and time period
- `2005`, `2007`, `2010`, `2013`, `2016`, `2020`: These seem to be the years for which the data was collected, indicating the measure of the livestock production.
 
To provide an insightful analysis, let's check the basic statistics of the numerical data and count the unique values of the categorical data.
 
The basic statistics of the numerical data (years 2005, 2007, 2010, 2013, 2016, and 2020) are as follows:
 
- For each year, we have a substantial number of entries (count).
- The mean (average) varies across the years, with the highest being around 3,250 for the year 2007 and the lowest around 1,759 for the year 2020.
- The standard deviation, which measures the dispersion of the data from the mean, is quite large for all years, suggesting a significant variation in the data.
- The minimum value is 0 for all years, which indicates that in some cases, there was no livestock production.
- The 25th, 50th (median), and 75th percentiles are 0 for many years, indicating that a substantial portion of the data has low or zero values.
- The maximum value is quite large for all years, reaching up to around 3,062,740 in 2020, suggesting that some entities had very high livestock production.
 
For the year 2016, there are no data points available. This might be due to missing data or no data collection in that year.
 
Regarding the categorical data:
 
- `freq` has only one unique value, which does not provide any variability.
- `lsu`, `animals`, `farmtype`, `agrarea`, and `geo\\TIME_PERIOD` have multiple unique values, indicating different categories within these fields.
- `so_eur` and `unit` also have only one unique value each.
 
Next, we can further examine the distribution of the numerical data and the most common categories in the categorical data. We could also investigate the missing values in the dataset.

This is a great starting point, now I’ll ask Code Interpreter to run some basic analysis on the data. I’ve uploaded the codes and ask it to join the data and provide some basic visualizations, which results in the next answer:

The uploaded file contains codes for different types of livestock and their meanings.
Now, let's join this data with the original dataset. After that, we will visualize
the livestock production by year and by different categories.

Up next it creates a visualization of the data:

analyzing-eurostat-data-using-openai-code-interpreter-img-1

The image comes with the next description:

The bar chart illustrates the total livestock production over the years. Some observations from the plot include:
 
- The total livestock production appears to be highest in the years 2005 and 2007.
- There's a noticeable drop in total production in 2010 and 2013.
- The year 2016 does not have any data.
- There's a slight increase in total production in 2020 compared to 2013.
 
Next, let's analyze the livestock production by different categories, such as the type of animals and the type of farms.

Up next it creates a visualization of the joined data by type of animal:

analyzing-eurostat-data-using-openai-code-interpreter-img-2

It not only provides the visualization but also an analysis of the data:

The bar chart shows the top 5 animal types with the highest total livestock production. Some observations from the plot include:
 
- "Livestock" has the highest total production. However, this category may be a general category that includes various types of animals.
- "Live bovine animals" and "Live swine, domestic species" have the next highest total production levels.
- "Live poultry, domestic species" and "Live sheep" also contribute significantly to the total livestock production.
 
Next, let's analyze the livestock production by different farm types.

As we can see, the analysis continues with the next of step of the analysis. First by creating the visualization:

analyzing-eurostat-data-using-openai-code-interpreter-img-3

And then by providing the resulting analysis:

The bar chart shows the top 5 farm types with the highest total livestock production. Some observations from the plot include:
 
- The farm type "FT15_SO" has the highest total livestock production. However, without a proper definition for this farm type, it's hard to interpret what this category entails.
- The farm types "FT16_SO", "FT10_SO", "FT14_SO", and "FT08_SO" also have significant levels of livestock production.
 
These visualizations provide a broad overview of the livestock production trends by year, animal type, and farm type. For a more detailed analysis, it would be beneficial to have more context or information about the different categories, especially the farm types.

The Code Interpreter plugin is particularly useful for conducting data analysis and visualization tasks. It allows users to directly interact with large datasets within ChatGPT, posing questions and extracting valuable insights without the need for writing code. The plugin's ability to handle substantial amounts of data enables efficient exploration and analysis, enhancing decision-making processes. Furthermore, it provides visual representations of complex datasets, simplifying data comprehension and aiding in the communication of findings.

The resulting analyses provide insights into the most common animal types and farm types contributing to livestock production.

Conclusion

The Eurostat dataset, integrated with OpenAI's ChatGPT Code Interpreter plugin, streamlines data analysis. This fusion of programming and language models empowers users to extract insights from vast datasets, regardless of coding expertise. With simplified workflows, the plugin caters to programmers and non-programmers alike, enabling data retrieval, filtering, visualization, and basic analysis. This innovation has applications in marketing, finance, HR, and healthcare, democratizing coding and fostering collaboration. The technology's evolution holds the promise of revolutionizing programming and data analysis, empowering users to glean insights from intricate datasets for informed decision-making across industries.

Author Bio:

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.