Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science with SQL Server Quick Start Guide

You're reading from   Data Science with SQL Server Quick Start Guide Integrate SQL Server with data science

Arrow left icon
Product type Paperback
Published in Aug 2018
Publisher Packt
ISBN-13 9781789537123
Length 206 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Dejan Sarka Dejan Sarka
Author Profile Icon Dejan Sarka
Dejan Sarka
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. Writing Queries with T-SQL FREE CHAPTER 2. Introducing R 3. Getting Familiar with Python 4. Data Overview 5. Data Preparation 6. Intermediate Statistics and Graphs 7. Unsupervised Machine Learning 8. Supervised Machine Learning 9. Other Books You May Enjoy

Advanced SELECT techniques

Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only, either one row for the whole input rowset or one row per group. Sometimes, you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, which are queries inside queries.

Introducing subqueries

The next query shows an example of using two subqueries in a single query: in the SELECT clause, a subquery that calculates the sum of quantity for each employee. It returns a scalar value. The subquery refers to the employee key from the outer query. The subquery can't execute without the outer query. This is a correlated subquery. There is another subquery in the FROM clause that calculates the overall quantity for all employees. This query returns a table, although it is a table with a single row and a single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table.

Another type of join is used to add the overall total to each detail row. A cross-join is a Cartesian product of two input rowsets: each row from one side is associated with every single row from the other side. No join condition is needed. A cross-join can produce an unwanted, huge result set. For example, if you cross-join just 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically you want to avoid a cross-join in production. However, in the example in the following query, 60,855 from the left-side rows is cross-joined to a single row from the subquery, therefore producing only 60,855. Effectively, this means that the overall total column is added to each detail row:

SELECT e.EmployeeKey, e.LastName,
fr.SalesAmount,
(SELECT SUM(fr1.SalesAmount)
FROM dbo.FactResellerSales AS fr1
WHERE fr1.EmployeeKey = e.EmployeeKey)
AS TotalPerEmployee,
frt.GrandTotal
FROM (dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey)
CROSS JOIN
(SELECT SUM(fr2.SalesAmount) AS GrandTotal
FROM dbo.FactResellerSales AS fr2) AS frt
ORDER BY e.EmployeeKey;

Here is the abbreviated output of the previous query:

EmployeeKey LastName   SalesAmount  TotelPerEmployee GrandTotal
----------- ---------- ------------ ---------------- -------------
272 Jiang 1619.52 1092123.8562 80450596.9823
272 Jiang 1445.1898 1092123.8562 80450596.9823
272 Jiang 714.7043 1092123.8562 80450596.9823

In the previous example code, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where it logically executes only once. Although SQL Server can optimize correlated subqueries and convert them to joins, there exists a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions.

Window functions

The following query uses the SUM window aggregate function to calculate the total over each employee and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each employee, meaning that the total quantity per employee is reset to zero for each new employee. The second calculation uses an OVER clause without specifying partitions, meaning the calculation is done over all input rowsets. This query produces exactly the same result as the previous one:

SELECT e.EmployeeKey, e.LastName,
fr.SalesAmount,
SUM(fr.SalesAmount) OVER(PARTITION BY e.EmployeeKey)
AS TotalPerEmployee,
SUM(fr.SalesAmount) OVER()
AS GrandTotal
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
ORDER BY e.EmployeeKey;

Now assume that you need to calculate some statistics of the totals of the employees' orders. You need to calculate the running total amount for employees, and the moving average of this total over the last three employees, ordered by the employee key. This means you need to calculate the totals over employees in advance, and then use aggregate functions on these aggregates. You could do aggregations over employees in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. This subquery is called a common table expression, or CTE.

Common table expressions

CTEs are more readable than derived tables and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, you need to define them multiple times if you want to use them multiple times in the outer query. The following query shows the usage of CTE to calculate the total amount for all employees and then just shows the results in the outer query:

WITH EmpTotalCTE AS
(
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.SalesAmount) AS TotalPerEmployee
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
)
SELECT EmployeeKey, LastName,
TotalPerEmployee
FROM EmpTotalCTE
ORDER BY EmployeeKey;

The query returns 17 rows, one for each employee, with the total sales amount for this employee. Now let's add the running total and the moving average in the outer query:

WITH EmpTotalCTE AS
(
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.SalesAmount) AS TotalPerEmployee
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
)
SELECT EmployeeKey, LastName,
TotalPerEmployee,
SUM(TotalPerEmployee)
OVER(ORDER BY EmploYeeKey
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW)
AS RunningTotal,
AVG(TotalPerEmployee)
OVER(ORDER BY EmploYeeKey

       ROWS BETWEEN 2 PRECEDING
AND CURRENT ROW)
AS MovingAverage
FROM EmpTotalCTE
ORDER BY EmployeeKey;

Here are the partial results with the first five and last two rows:

EmployeeKey LastName  TotelPerEmployee RunningTotal   MovingAverage
----------- --------- ---------------- -------------- -------------
272 Jiang 1092123.8562 1092123.8562 1092123.8562
281 Blythe 9293903.0055 10386026.8617 5193013.4308
282 Mitchell 10367007.4286 20753034.2903 6917678.0967
283 Carson 10065803.5429 30818837.8332 9908904.659
284 Vargas 3609447.2163 34428285.0495 8014086.0626
…
295 Valdez 1790640.2311 79028786.0571 1425236.791
296 Tsoflias 1421810.9252 80450596.9823 1128325.2026

Note that the running total for the last employee, sorted by the employee key, is the grand total. You can also check whether the running total and moving average are calculated correctly.

In the previous query, you can see that in the OVER() clause, I defined the frame of the calculation of the running total and the moving average for each row. For the running total, the frame is all rows from the first row in the window to the current row, and for the moving average, it is the last three rows, including the current row.

You can use many other functions for window calculations. For example, you can use the ranking functions, such as ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order of the calculation. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A standalone, outer ORDER BY at the end of the query defines the order of the result.

The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each employee and defines the row number or the position of every single reseller for each employee, ordered by the sales amount, in descending order:

WITH EmpResTotalCTE AS
(
SELECT e.EmployeeKey, r.ResellerKey,
MIN(e.LastName) AS LastName,
MIN(r.ResellerName) AS ResellerName,
SUM(fr.SalesAmount) AS EmpResTotal
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
GROUP BY e.EmployeeKey, r.ResellerKey
)
SELECT EmployeeKey, LastName,
ResellerName, EmpResTotal,
ROW_NUMBER()
OVER(PARTITION BY EmployeeKey ORDER BY EmpResTotal DESC)
AS PositionByEmployee
FROM EmpResTotalCTE
ORDER BY EmployeeKey, EmpResTotal DESC;

Here are the partial results:

EmployeeKey LastName  ResellerName      EmpResTotal  PositionByEmployee
----------- -------- -------------- ----------- ------------------
272 Jiang Thorough Parts 198993.3507 1
272 Jiang Sheet Metal 138046.3212 2
272 Jiang Vigorous Exercise 125422.2079 3
272 Jiang Sales and Supply 79994.1743 4
…
281 Blythe Totes & Baskets 463030.757 1
281 Blythe First Bike 437319.784 2
281 Blythe Modular Cycle 389208.6639 3
281 Blythe Fitness Toy Store 375100.3084 4
…
296 Tsoflias Eastside Cycle 2223.7009 30
296 Tsoflias Twin Cycles 1336.23 31
296 Tsoflias Kids and Adults 753.768 32
296 Tsoflias Major Bicycle 200.052 33

In the abbreviated results shown previously, each sales amount total for each reseller and employee combination is unique. That's why all of the row numbers are in the right position – 1 is before 2, which is before 3, and so on. Now imagine that the second and the third reseller (Sheet Metal and Vigorous Exercise) for the first employee (Jiang) would have the same total, for example 138,046.3212. Then you could get the wrong order of row numbers in the results, as shown here:

EmployeeKey LastName  ResellerName      EmpResTotal  PositionByEmployee
----------- -------- -------------- ----------- ------------------
272 Jiang Thorough Parts 198993.3507 1

272         Jiang     Vigorous Exercise 138046.3212  3
272 Jiang Sheet Metal 138046.3212 2
272 Jiang Sales and Supply 79994.1743 4

The order of the result is defined over the sales total, and not over the row number. You can't know in advance which row will get which row number when the order of the calculation is not defined in unique values.

Finding top n rows and using the APPLY operator

Let's find the top-6 sales based on the sales amount. You can do this by using the OFFSET…FETCH clause after the ORDER BY clause:

SELECT SalesOrderNumber,
SalesOrderLineNumber,
SalesAmount
FROM dbo.FactResellerSales
ORDER BY SalesAmount DESC
OFFSET 0 ROWS FETCH NEXT 6 ROWS ONLY;

Here are the results:

SalesOrderNumber  SalesOrderLineNumber SalesAmount
----------------- -------------------- -----------
SO55282 39 27893.619
SO43884 17 27055.7604
SO51131 12 26159.2081
SO43875 12 24938.4761
SO57054 26 23667.8549
SO43875 10 23190.6518

The question that arises is whether order SO43875, line number 10, is the only sale with the sales amount equal to 23190.6518. You could try to execute the previous query again, but with limiting the output to the first seven rows, then eight rows, and so on. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way, you don't know the number of the rows in the output in advance. The following query shows this approach:

SELECT TOP 6 WITH TIES
SalesOrderNumber, SalesOrderLineNumber,
SalesAmount
FROM dbo.FactResellerSales
ORDER BY SalesAmount DESC;

In the results of the last query, you get seven rows:

SalesOrderNumber  SalesOrderLineNumber SalesAmount
----------------- -------------------- -----------
SO55282 39 27893.619
SO43884 17 27055.7604
SO51131 12 26159.2081
SO43875 12 24938.4761
SO57054 26 23667.8549
SO44795 18 23190.6518
SO43875 10 23190.6518

The next task is to get the top three resellers by amount for a specific employee. Here is the query that returns the top four resellers for the employee with the employee key equal to 272:

SELECT TOP 3
fr.EmployeeKey, fr.ResellerKey,
SUM(fr.SalesAmount) AS EmpResTotal
FROM dbo.FactResellerSales AS fr
WHERE fr.EmployeeKey = 272
GROUP BY fr.EmployeeKey, fr.ResellerKey
ORDER BY EmpResTotal DESC;

You need to perform the calculation for each employee. The APPLY Transact-SQL operator comes in handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even is the tabular expression on the right side does not return any row, similar to LEFT OUTER JOIN. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate top three resellers by amount for each employee that actually does have some sales:

SELECT e.EmployeeKey, e.LastName,
fr1.ResellerKey, fr1.EmpResTotal
FROM dbo.DimEmployee AS e
CROSS APPLY
(SELECT TOP 3
fr.EmployeeKey, fr.ResellerKey,
SUM(fr.SalesAmount) AS EmpResTotal
FROM dbo.FactResellerSales AS fr

  WHERE fr.EmployeeKey = e.EmployeeKey
GROUP BY fr.EmployeeKey, fr.ResellerKey
ORDER BY EmpResTotal DESC) AS fr1
ORDER BY e.EmployeeKey, fr1.EmpResTotal DESC;

The query returns 51 rows. You can see the abbreviated results here:

EmployeeKey LastName   ResellerKey EmpResTotal
----------- ---------- ----------- -----------
272 Jiang 433 198993.3507
272 Jiang 436 138046.3212
272 Jiang 678 125422.2079
281 Blythe 328 463030.757
281 Blythe 670 437319.784
281 Blythe 4 389208.6639
…
296 Tsoflias 573 163398.0205
296 Tsoflias 87 148996.5063
296 Tsoflias 393 145407.7395

Since this was the last query in this chapter, you can clean up your demo database with the following code:

DROP TABLE dbo.EmpFUll;
GO

That's it for T-SQL for now. You will use this knowledge a lot in the forthcoming chapters for concrete data science tasks.

You have been reading a chapter from
Data Science with SQL Server Quick Start Guide
Published in: Aug 2018
Publisher: Packt
ISBN-13: 9781789537123
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime