Introduction
In the previous chapter, we discussed the basics of data analysis and SQL. We also used CRUD (create, read, update, and delete) operations on a table. These techniques are the foundation for all the work undertaken in analytics. One such task we will implement is the creation of clean datasets.
According to Forbes, it is estimated that almost 80% of the time spent by analytics professionals involves preparing data for use in analysis and building models with unclean data, which harms analysis by leading to poor conclusions. SQL can help in this tedious but important task by providing efficient ways to build clean datasets.
We will start by discussing how to assemble data using JOIN
and UNION
. Then, we will use different functions, such as CASE WHEN
, COALESCE
, NULLIF
, and LEAST/GREATEST
, in order to clean data. We will then discuss how to transform and remove duplicate data from queries using the DISTINCT
command.