Introduction
In the previous chapter, we discussed the basics of SQL and how to work with individual tables in SQL. We also used CRUD (create, read, update and delete) operations on a table. These tables are the foundation for all the work undertaken in analytics. One of the first tasks implemented in analytics is to create clean datasets. According to Forbes, it is estimated that, almost 80% of the time spent by analytics professionals involves preparing data for use in analysis and building models with unclean data which harms analysis by leading to poor conclusions. SQL can help in this tedious but important task, by providing ways to build datasets which are clean, in an efficient manner. We will start by discussing how to assemble data using JOIN
s and UNION
s. Then, we will use different functions, such as CASE WHEN
, COALESCE
, NULLIF
, and LEAST/GREATEST
, to clean data. We will then discuss how to transform and remove duplicate data from queries using the DISTINCT
command.