Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science with SQL Server Quick Start Guide

You're reading from   Data Science with SQL Server Quick Start Guide Integrate SQL Server with data science

Arrow left icon
Product type Paperback
Published in Aug 2018
Publisher Packt
ISBN-13 9781789537123
Length 206 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Dejan Sarka Dejan Sarka
Author Profile Icon Dejan Sarka
Dejan Sarka
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. Writing Queries with T-SQL 2. Introducing R FREE CHAPTER 3. Getting Familiar with Python 4. Data Overview 5. Data Preparation 6. Intermediate Statistics and Graphs 7. Unsupervised Machine Learning 8. Supervised Machine Learning 9. Other Books You May Enjoy

Core T-SQL SELECT statement elements

You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage:

  • SELECT: Defines the columns returned, or a projection of all table columns
  • FROM: Lists the tables used in the query and how they are associated, or joined
  • WHERE: Filters the data to return only the rows that satisfy the condition in the predicate
  • GROUP BY: Defines the groups over which the data is aggregated
  • HAVING: Filters the data after the grouping with conditions that refer to aggregations
  • ORDER BY: Sorts the rows returned to the client application

The simplest form of the SELECT statement

Let's start with the simplest concept of SQL that every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the AdventureWorksDW2017 database context and selects all data from the dbo.DimEmployee table:

USE AdventureWorksDW2017;
GO
SELECT *
FROM dbo.DimEmployee;

This query returns 296 rows, all employees with all columns.

Using SELECT * is not recommended in production. Queries with SELECT * can return an unexpected result when the table structure changes and are also not suitable for good optimization.

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only three columns from the table:

SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee;

Here is the shortened result, limited to the first three rows only:

EmployeeKey FirstName LastName
----------- --------- ----------
1 Guy Gilbert
2 Kevin Brown
3 Roberto Tamburello

Object names in SQL Server, such as table and column, can include spaces. Names that include spaces are called delimited identifiers. To make SQL Server properly understand them as column names, you must enclose delimited identifiers in square brackets. However, if you prefer to have names without spaces, or if you use computed expressions in the column list, you can add column aliases. The following code uses an expression in the SELECT clause to create a calculated column called [Full Name], and then uses the INTO clause to store the data in a table.

The next query retrieves the data from the newly created and populated dbo.EmpFull table:

SELECT EmployeeKey,
FirstName + ' ' + LastName AS [Full Name]
INTO dbo.EmpFUll
FROM dbo.DimEmployee;
GO
SELECT EmployeeKey, [Full Name]
FROM dbo.EmpFUll;

Here is the partial result:

EmployeeKey Full Name
----------- ------------------
1 Guy Gilbert
2 Kevin Brown
3 Roberto Tamburello

As you have seen before, there are 296 employees. If you check the full result of the first query, you might notice that there is a column named SalesPersonFlag in the dbo.DimEmployee table. You might want to check which of the employees are also salespeople. You can filter the results of a query with the WHERE clause, as the following query shows:

SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee
WHERE SalesPersonFlag = 1;

This query returns 17 rows only.

Joining multiple tables

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, such as employees in the examples you have seen so far. In order to make result sets meaningful for the business your database supports, most of the time you need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. An inner join returns only rows for which the condition in the join predicate for the two joined tables evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false, or NULL. For an inner join, the order of the tables involved in the join is not important.

In the following example, you can see the dbo.DimEmployee table joined with an inner join to the dbo.FactResellerSales table:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
fr.SalesAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;

Here are the partial results:

EmployeeKey FirstName LastName SalesAmount
----------- --------- -------- -----------
285 Tsvi Reiter 2024.994
285 Tsvi Reiter 6074.982
285 Tsvi Reiter 2024.994

In the previous query, you can see that table aliases are used. If a column's name is unique across all tables in the query, you can use it without a table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the table.column format. In the previous query, the EmployeeKey column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query.

The previous query returned 60,855 rows. It is always recommended to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written in a logically correct way. If every sale has an employee, as it should have, then the previous query should have returned exactly the number of rows dbo.FactResellerSales has. You can quickly check the number of rows in the dbo.FactResellerSales table with the help of the COUNT(*) aggregate function, as the following query shows:

SELECT COUNT(*) AS ResellerSalesCount
FROM dbo.FactResellerSales;

The result is, as you probably expected, 60,855 rows.

You can join multiple tables in a single query. The following code joins seven tables in a single query. Note that all of the joins are still inner joins. The query returns 60,855 rows again, with at least 1 column from each table involved in the query:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
r.ResellerKey, r.ResellerName,
d.DateKey, d.CalendarYear, d.CalendarQuarter,
p.ProductKey, p.EnglishProductName,
ps.EnglishProductSubcategoryName,
pc.EnglishProductCategoryName,
fr.OrderQuantity, fr.SalesAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
INNER JOIN dbo.DimDate AS d
ON fr.OrderDateKey = d.DateKey
INNER JOIN dbo.DimProduct AS p
ON fr.ProductKey = p.ProductKey
INNER JOIN dbo.DimProductSubcategory AS ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
INNER JOIN dbo.DimProductCategory AS pc
ON ps.ProductCategoryKey = pc.ProductCategoryKey;

In the dbo.Employees table, there are 17 salespeople. Do all of them have at least one sale, at least one row from the dbo.FactResellerSales table associated with the employee key of that salesperson? You can check how many distinct employees have sales associated with them with the help of the DISTINCT keyword:

SELECT DISTINCT fr.EmployeeKey
FROM dbo.FactResellerSales AS fr;

The query returns 17 rows. Now imagine that you would like to list all sales rows together with the employees' data, but you also need to include in the result the employees that are not salespeople, that do now have any row associated with their EmployeeKey column in the fact table. You can use an outer join to fulfill this task.

With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows, like what you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, the rows from the left table are preserved. If you use RIGHT OUTER JOIN, the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query uses a left outer join to preserve the rows from the dbo.DimEmployee table:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
fr.SalesAmount
FROM dbo.DimEmployee AS e
LEFT OUTER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;

The query returns 61,134 rows. Did we get all of the employees in the result? You can check this by checking the distinct EmployeeKey after the outer join:

SELECT DISTINCT e.EmployeeKey
FROM dbo.DimEmployee AS e
LEFT OUTER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;

The query returns 296 rows, which is the number of employees.

Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows of an outer join. You can control the join order with parentheses. The following query uses the right outer join to preserve all employees and makes sure that this join is executed after all inner joins:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
r.ResellerKey, r.ResellerName,
d.DateKey, d.CalendarYear, d.CalendarQuarter,
p.ProductKey, p.EnglishProductName,
ps.EnglishProductSubcategoryName,
pc.EnglishProductCategoryName,
fr.OrderQuantity, fr.SalesAmount
FROM (dbo.FactResellerSales AS fr
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
INNER JOIN dbo.DimDate AS d
ON fr.OrderDateKey = d.DateKey
INNER JOIN dbo.DimProduct AS p
ON fr.ProductKey = p.ProductKey
INNER JOIN dbo.DimProductSubcategory AS ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
INNER JOIN dbo.DimProductCategory AS pc
ON ps.ProductCategoryKey = pc.ProductCategoryKey)
RIGHT OUTER JOIN dbo.DimEmployee AS e
ON e.EmployeeKey = fr.EmployeeKey;

The query returns 61,134 rows, as it should. Note that with the usage of the parenthesis, the order of joins is defined in the following way:

  • Perform all inner joins, with an arbitrary order among them

  • Execute the left outer join after all of the inner joins

Grouping and aggregating data

Many times, you need to aggregate data in groups. This is where the GROUP BY clause comes in handy. The following query aggregates the sales data for each employee:

SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey;

The query returns 17 aggregated rows. Here are the results, abbreviated to the first three rows only:

EmployeeKey LastName   EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
284 Vargas 11544 3609447.2163
295 Valdez 6898 1790640.2311
281 Blythe 23058 9293903.0055

In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. That is why the LastName column in the SELECT list is used in the MIN() aggregate function. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list.

Sometimes, you need to filter aggregated data. For example, you might need to find only the employees for which the sum of the order quantity did not reach 10,000. You can filter the result set on the aggregated data by using the HAVING clause:

SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) < 10000;

The query returns eight rows only. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clauses from the query, and the aliases are not known yet. However, the ORDER BY clause, which sorts the result, executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows the nine employees with sum of the OrderQuantity variable greater than 10,000, sorted in descending order by this sum:

SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) > 10000
ORDER BY EmpTotalQuantity DESC;

You can see the shortened results as follows:

EmployeeKey LastName   EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
282 Mitchell 27229 10367007.4286
283 Carson 27051 10065803.5429
291 Pak 26231 8503338.6472

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image