Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Data Cleaning Made Easy with ChatGPT

Save for later
  • 5 min read
  • 02 Jun 2023

article-image

Identifying inconsistencies and inaccuracies in the data is a vital part of the data analysis process. ChatGPT is a natural language processing tool powered by AI that enables users to have human-like conversations and helps them complete tasks quickly. In this article, we'll focus on how chatGPT can make the process of data cleansing and cleaning more efficient.

 

Data Cleansing/Cleaning with ChatGPT

 

Given the volume, velocity, and variety of data we deal with nowadays, manually carrying out the data cleansing task is a very time-consuming process. Data cleansing, the removal of duplicate data, data validity, uniqueness, consistency, and correctness are all steps taken to increase the quality of the data. Better business insights and the ability for business users to make wise decisions are provided by cleansed data. Data cleansing activities go via a series of steps, starting with gathering the data and ending with integrating, producing, and normalizing the data, as shown in the image below:

 

data-cleaning-made-easy-with-chatgpt-img-0

Image 1: Data cleansing cycle

 

 

The majority of corporate organizations carry out the following tasks as part of the exploratory data analysis's data cleansing procedure:

 

  • Identify and clean up Duplicate Values 
  • Fill Null Values with a default value
  • Rectify and Correct inconsistent data
  • Standardising date formats 
  • Standardising  name or address
  • Area codes out of phone numbers
  • Flattening nested data structures
  • Erasing incomplete data
  • Detecting conflicts in the database

 

The strength of ChatGPT allows us to perform time-consuming and extremely boring tasks like data purification with ease. Let's use the example of employee details for the banking industry to better comprehend it which has columns: Employee ID, Employee Name, Department Name, and Joining Date. While reviewing the data, we discovered a number of data quality concerns that must be resolved before we can truly use this data for analytics.

 

Example: Employee Name is inconsistent - some instances use lowercase while others use uppercase letters. The data format is not uniform for the joining date column.

 

Traditional Way of Working

 

To clean up this data in Excel, we must manually construct the formulas and apply functions like TRIM, UPPER, or LOWER before using it for analytics. It calls for development work, and upkeep of Excel logic without version control, history, etc. Sounds extremely tedious, isn’t it?

 

Working with ChatGPT

 

We can utilize ChatGPT to automate the aforementioned data purification operation by implementing some Python code. In this example, we'll use the ChatGPT Python code to demonstrate how to standardize the name for the employee's name and the date format for the joining date.


ChatGPT prompt:

Here is the prompt that we can provide in the text format, in case you plan to copy and paste:
 

            Employee ID | Employee Name | Department Name | Joining Date

            214                   john Root                  HR                             1-06-2003

            435                   STEVE Smith             Retail                          21-Feb-05

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime

            654                   Sachin WALA            OPSI                           25-July-1999

 

Above is the employee data source which should be cleaned. Employee names are not consistent, and the joining date is not in a uniform date format. Generate a Python code to create accurate data.

 

data-cleaning-made-easy-with-chatgpt-img-1

Image 2: Input to the ChatGPT

We pass a dataset and a description of how and for which columns we want to clean the data as seen in the image above.

 

Output from ChatGPT

ChatGPT automatically creates Python code with a variety of generic functions to clean the specified column in accordance with our specifications. The ChatGPT tool's output Python code is shown below.

     

data-cleaning-made-easy-with-chatgpt-img-2 

Image 3: Output Python code from ChatGPT

 

After running the Python code generated by ChatGPT on the stated data, ChatGPT also displays a sample result on the data here. It is clear that employee names are now uniform, and the joining date is likewise shown using a common date format.

 

            data-cleaning-made-easy-with-chatgpt-img-3

Image 4 : Sample output from ChatGPT

 

This Python code can be used to clean any data source in the future when we need to do so, not just the employee dataset. Therefore, using ChatGPT's capabilities, we can develop a fully automated data cleaning process that is precise, effective, and totally automated.

There are also tools on the market like RATH, which has an integration with ChatGPT, to simplify the data analysis workflow and increase your productivity without putting in a lot of manual work if you are having trouble with a large volume of data and need to spend a lot of time performing the data cleaning activity

 

Conclusion

This article gave you a fundamental grasp of the data cleaning/cleansing procedure, which will enable you to use the data to make more trustworthy decisions. The most effective method for using ChatGPT to clean your data simply and effectively for any data quantities.

 

Author Bio:

Sagar Lad is a Cloud Data Solution Architect with a leading organisation and has deep expertise in designing and building Enterprise-grade Intelligent Azure Data and Analytics Solutions. He is a published author, content writer, Microsoft Certified Trainer, and C# Corner MVP.

You can follow Sagar on - Medium, Amazon, LinkedIn