Identifying inconsistencies and inaccuracies in the data is a vital part of the data analysis process. ChatGPT is a natural language processing tool powered by AI that enables users to have human-like conversations and helps them complete tasks quickly. In this article, we'll focus on how chatGPT can make the process of data cleansing and cleaning more efficient.
Given the volume, velocity, and variety of data we deal with nowadays, manually carrying out the data cleansing task is a very time-consuming process. Data cleansing, the removal of duplicate data, data validity, uniqueness, consistency, and correctness are all steps taken to increase the quality of the data. Better business insights and the ability for business users to make wise decisions are provided by cleansed data. Data cleansing activities go via a series of steps, starting with gathering the data and ending with integrating, producing, and normalizing the data, as shown in the image below:
Image 1: Data cleansing cycle
The majority of corporate organizations carry out the following tasks as part of the exploratory data analysis's data cleansing procedure:
The strength of ChatGPT allows us to perform time-consuming and extremely boring tasks like data purification with ease. Let's use the example of employee details for the banking industry to better comprehend it which has columns: Employee ID, Employee Name, Department Name, and Joining Date. While reviewing the data, we discovered a number of data quality concerns that must be resolved before we can truly use this data for analytics.
Example: Employee Name is inconsistent - some instances use lowercase while others use uppercase letters. The data format is not uniform for the joining date column.
To clean up this data in Excel, we must manually construct the formulas and apply functions like TRIM, UPPER, or LOWER before using it for analytics. It calls for development work, and upkeep of Excel logic without version control, history, etc. Sounds extremely tedious, isn’t it?
We can utilize ChatGPT to automate the aforementioned data purification operation by implementing some Python code. In this example, we'll use the ChatGPT Python code to demonstrate how to standardize the name for the employee's name and the date format for the joining date.
Here is the prompt that we can provide in the text format, in case you plan to copy and paste:
Employee ID | Employee Name | Department Name | Joining Date
214 john Root HR 1-06-2003
435 STEVE Smith Retail 21-Feb-05
654 Sachin WALA OPSI 25-July-1999
Above is the employee data source which should be cleaned. Employee names are not consistent, and the joining date is not in a uniform date format. Generate a Python code to create accurate data.
Image 2: Input to the ChatGPT
We pass a dataset and a description of how and for which columns we want to clean the data as seen in the image above.
ChatGPT automatically creates Python code with a variety of generic functions to clean the specified column in accordance with our specifications. The ChatGPT tool's output Python code is shown below.
Image 3: Output Python code from ChatGPT
After running the Python code generated by ChatGPT on the stated data, ChatGPT also displays a sample result on the data here. It is clear that employee names are now uniform, and the joining date is likewise shown using a common date format.
Image 4 : Sample output from ChatGPT
This Python code can be used to clean any data source in the future when we need to do so, not just the employee dataset. Therefore, using ChatGPT's capabilities, we can develop a fully automated data cleaning process that is precise, effective, and totally automated.
There are also tools on the market like RATH, which has an integration with ChatGPT, to simplify the data analysis workflow and increase your productivity without putting in a lot of manual work if you are having trouble with a large volume of data and need to spend a lot of time performing the data cleaning activity
This article gave you a fundamental grasp of the data cleaning/cleansing procedure, which will enable you to use the data to make more trustworthy decisions. The most effective method for using ChatGPT to clean your data simply and effectively for any data quantities.
Sagar Lad is a Cloud Data Solution Architect with a leading organisation and has deep expertise in designing and building Enterprise-grade Intelligent Azure Data and Analytics Solutions. He is a published author, content writer, Microsoft Certified Trainer, and C# Corner MVP.