Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Data Science with SQL Server Quick Start Guide
Data Science with SQL Server Quick Start Guide

Data Science with SQL Server Quick Start Guide: Integrate SQL Server with data science

eBook
$20.98 $29.99
Paperback
$36.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Data Science with SQL Server Quick Start Guide

Writing Queries with T-SQL

This book is intended for any SQL Server developer or database administrator (DBA) who wants to start working in the data science field. In addition, this book is also aimed at existing data scientists who want to start using SQL Server with related services and tools. I will use, and show examples in, three programming languages in this book: Transact-SQL (or T-SQL), R, and Python. Therefore, it makes sense to start with a brief introduction of the three languages. This is what the first three chapters are about. If you are already a SQL Server developer, proficient in writing T-SQL queries, you can simply skip the first chapter. If you are already working with R, skip the second chapter. If you are familiar with Python, please feel free to skip the third chapter.

This first chapter is not a comprehensive reference guide to T-SQL; I will focus on the mighty SELECT statement only, the statement you need to use immediately when your data is located in a SQL Server database. However, besides the basic clauses, I will also explain advanced techniques, such as window functions, common table expressions, and the APPLY operator.

This chapter will cover the following points:

  • Core Transact-SQL SELECT statement elements
  • Advanced SELECT techniques

Before starting – installing SQL Server

SQL Server setup 

You just start SQL Server setup, and then from the Feature Selection page select the following:

  • Database Engine Services
  • Underneath Machine Learning (ML) Services (In-Database)
  • With both languages, R and Python, selected, like you can see in the next screenshot

After that, all you need is client tools, and you can start writing the code. The following screenshot shows the SQL Server setup Feature Selection page with the appropriate features selected:

Figure 1.1: SQL Server Setup feature selection

The next step is to install client tools. Of course, you need SQL Server Management Studio (SSMS). You can download it at https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-2017. If you are not familiar with this tool, please use the SSMS at https://docs.microsoft.com/en-us/sql/ssms/tutorials/tutorial-sql-server-management-studio?view=sql-server-2017 to learn the basics about this tool.

In the next two chapters, I will explain what other tools you need to efficiently write R and Python code, respectively.

Finally, you need some demo data. I will mostly use the AdventureWorksDW2017 demo database in this book. You can find this demo database and other Microsoft samples at https://github.com/Microsoft/sql-server-samples/releases. I will point you to the appropriate sources when I use any other demo data in this book.

After you install everything mentioned so for, you are ready to start learning or renewing the knowledge of the T-SQL SELECT statement.

Core T-SQL SELECT statement elements

You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage:

  • SELECT: Defines the columns returned, or a projection of all table columns
  • FROM: Lists the tables used in the query and how they are associated, or joined
  • WHERE: Filters the data to return only the rows that satisfy the condition in the predicate
  • GROUP BY: Defines the groups over which the data is aggregated
  • HAVING: Filters the data after the grouping with conditions that refer to aggregations
  • ORDER BY: Sorts the rows returned to the client application

The simplest form of the SELECT statement

Let's start with the simplest concept of SQL that every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the AdventureWorksDW2017 database context and selects all data from the dbo.DimEmployee table:

USE AdventureWorksDW2017;
GO
SELECT *
FROM dbo.DimEmployee;

This query returns 296 rows, all employees with all columns.

Using SELECT * is not recommended in production. Queries with SELECT * can return an unexpected result when the table structure changes and are also not suitable for good optimization.

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only three columns from the table:

SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee;

Here is the shortened result, limited to the first three rows only:

EmployeeKey FirstName LastName
----------- --------- ----------
1 Guy Gilbert
2 Kevin Brown
3 Roberto Tamburello

Object names in SQL Server, such as table and column, can include spaces. Names that include spaces are called delimited identifiers. To make SQL Server properly understand them as column names, you must enclose delimited identifiers in square brackets. However, if you prefer to have names without spaces, or if you use computed expressions in the column list, you can add column aliases. The following code uses an expression in the SELECT clause to create a calculated column called [Full Name], and then uses the INTO clause to store the data in a table.

The next query retrieves the data from the newly created and populated dbo.EmpFull table:

SELECT EmployeeKey,
FirstName + ' ' + LastName AS [Full Name]
INTO dbo.EmpFUll
FROM dbo.DimEmployee;
GO
SELECT EmployeeKey, [Full Name]
FROM dbo.EmpFUll;

Here is the partial result:

EmployeeKey Full Name
----------- ------------------
1 Guy Gilbert
2 Kevin Brown
3 Roberto Tamburello

As you have seen before, there are 296 employees. If you check the full result of the first query, you might notice that there is a column named SalesPersonFlag in the dbo.DimEmployee table. You might want to check which of the employees are also salespeople. You can filter the results of a query with the WHERE clause, as the following query shows:

SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee
WHERE SalesPersonFlag = 1;

This query returns 17 rows only.

Joining multiple tables

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, such as employees in the examples you have seen so far. In order to make result sets meaningful for the business your database supports, most of the time you need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. An inner join returns only rows for which the condition in the join predicate for the two joined tables evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false, or NULL. For an inner join, the order of the tables involved in the join is not important.

In the following example, you can see the dbo.DimEmployee table joined with an inner join to the dbo.FactResellerSales table:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
fr.SalesAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;

Here are the partial results:

EmployeeKey FirstName LastName SalesAmount
----------- --------- -------- -----------
285 Tsvi Reiter 2024.994
285 Tsvi Reiter 6074.982
285 Tsvi Reiter 2024.994

In the previous query, you can see that table aliases are used. If a column's name is unique across all tables in the query, you can use it without a table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the table.column format. In the previous query, the EmployeeKey column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query.

The previous query returned 60,855 rows. It is always recommended to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written in a logically correct way. If every sale has an employee, as it should have, then the previous query should have returned exactly the number of rows dbo.FactResellerSales has. You can quickly check the number of rows in the dbo.FactResellerSales table with the help of the COUNT(*) aggregate function, as the following query shows:

SELECT COUNT(*) AS ResellerSalesCount
FROM dbo.FactResellerSales;

The result is, as you probably expected, 60,855 rows.

You can join multiple tables in a single query. The following code joins seven tables in a single query. Note that all of the joins are still inner joins. The query returns 60,855 rows again, with at least 1 column from each table involved in the query:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
r.ResellerKey, r.ResellerName,
d.DateKey, d.CalendarYear, d.CalendarQuarter,
p.ProductKey, p.EnglishProductName,
ps.EnglishProductSubcategoryName,
pc.EnglishProductCategoryName,
fr.OrderQuantity, fr.SalesAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
INNER JOIN dbo.DimDate AS d
ON fr.OrderDateKey = d.DateKey
INNER JOIN dbo.DimProduct AS p
ON fr.ProductKey = p.ProductKey
INNER JOIN dbo.DimProductSubcategory AS ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
INNER JOIN dbo.DimProductCategory AS pc
ON ps.ProductCategoryKey = pc.ProductCategoryKey;

In the dbo.Employees table, there are 17 salespeople. Do all of them have at least one sale, at least one row from the dbo.FactResellerSales table associated with the employee key of that salesperson? You can check how many distinct employees have sales associated with them with the help of the DISTINCT keyword:

SELECT DISTINCT fr.EmployeeKey
FROM dbo.FactResellerSales AS fr;

The query returns 17 rows. Now imagine that you would like to list all sales rows together with the employees' data, but you also need to include in the result the employees that are not salespeople, that do now have any row associated with their EmployeeKey column in the fact table. You can use an outer join to fulfill this task.

With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows, like what you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, the rows from the left table are preserved. If you use RIGHT OUTER JOIN, the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query uses a left outer join to preserve the rows from the dbo.DimEmployee table:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
fr.SalesAmount
FROM dbo.DimEmployee AS e
LEFT OUTER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;

The query returns 61,134 rows. Did we get all of the employees in the result? You can check this by checking the distinct EmployeeKey after the outer join:

SELECT DISTINCT e.EmployeeKey
FROM dbo.DimEmployee AS e
LEFT OUTER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;

The query returns 296 rows, which is the number of employees.

Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows of an outer join. You can control the join order with parentheses. The following query uses the right outer join to preserve all employees and makes sure that this join is executed after all inner joins:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
r.ResellerKey, r.ResellerName,
d.DateKey, d.CalendarYear, d.CalendarQuarter,
p.ProductKey, p.EnglishProductName,
ps.EnglishProductSubcategoryName,
pc.EnglishProductCategoryName,
fr.OrderQuantity, fr.SalesAmount
FROM (dbo.FactResellerSales AS fr
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
INNER JOIN dbo.DimDate AS d
ON fr.OrderDateKey = d.DateKey
INNER JOIN dbo.DimProduct AS p
ON fr.ProductKey = p.ProductKey
INNER JOIN dbo.DimProductSubcategory AS ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
INNER JOIN dbo.DimProductCategory AS pc
ON ps.ProductCategoryKey = pc.ProductCategoryKey)
RIGHT OUTER JOIN dbo.DimEmployee AS e
ON e.EmployeeKey = fr.EmployeeKey;

The query returns 61,134 rows, as it should. Note that with the usage of the parenthesis, the order of joins is defined in the following way:

  • Perform all inner joins, with an arbitrary order among them

  • Execute the left outer join after all of the inner joins

Grouping and aggregating data

Many times, you need to aggregate data in groups. This is where the GROUP BY clause comes in handy. The following query aggregates the sales data for each employee:

SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey;

The query returns 17 aggregated rows. Here are the results, abbreviated to the first three rows only:

EmployeeKey LastName   EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
284 Vargas 11544 3609447.2163
295 Valdez 6898 1790640.2311
281 Blythe 23058 9293903.0055

In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. That is why the LastName column in the SELECT list is used in the MIN() aggregate function. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list.

Sometimes, you need to filter aggregated data. For example, you might need to find only the employees for which the sum of the order quantity did not reach 10,000. You can filter the result set on the aggregated data by using the HAVING clause:

SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) < 10000;

The query returns eight rows only. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clauses from the query, and the aliases are not known yet. However, the ORDER BY clause, which sorts the result, executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows the nine employees with sum of the OrderQuantity variable greater than 10,000, sorted in descending order by this sum:

SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) > 10000
ORDER BY EmpTotalQuantity DESC;

You can see the shortened results as follows:

EmployeeKey LastName   EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
282 Mitchell 27229 10367007.4286
283 Carson 27051 10065803.5429
291 Pak 26231 8503338.6472

Advanced SELECT techniques

Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only, either one row for the whole input rowset or one row per group. Sometimes, you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, which are queries inside queries.

Introducing subqueries

The next query shows an example of using two subqueries in a single query: in the SELECT clause, a subquery that calculates the sum of quantity for each employee. It returns a scalar value. The subquery refers to the employee key from the outer query. The subquery can't execute without the outer query. This is a correlated subquery. There is another subquery in the FROM clause that calculates the overall quantity for all employees. This query returns a table, although it is a table with a single row and a single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table.

Another type of join is used to add the overall total to each detail row. A cross-join is a Cartesian product of two input rowsets: each row from one side is associated with every single row from the other side. No join condition is needed. A cross-join can produce an unwanted, huge result set. For example, if you cross-join just 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically you want to avoid a cross-join in production. However, in the example in the following query, 60,855 from the left-side rows is cross-joined to a single row from the subquery, therefore producing only 60,855. Effectively, this means that the overall total column is added to each detail row:

SELECT e.EmployeeKey, e.LastName,
fr.SalesAmount,
(SELECT SUM(fr1.SalesAmount)
FROM dbo.FactResellerSales AS fr1
WHERE fr1.EmployeeKey = e.EmployeeKey)
AS TotalPerEmployee,
frt.GrandTotal
FROM (dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey)
CROSS JOIN
(SELECT SUM(fr2.SalesAmount) AS GrandTotal
FROM dbo.FactResellerSales AS fr2) AS frt
ORDER BY e.EmployeeKey;

Here is the abbreviated output of the previous query:

EmployeeKey LastName   SalesAmount  TotelPerEmployee GrandTotal
----------- ---------- ------------ ---------------- -------------
272 Jiang 1619.52 1092123.8562 80450596.9823
272 Jiang 1445.1898 1092123.8562 80450596.9823
272 Jiang 714.7043 1092123.8562 80450596.9823

In the previous example code, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where it logically executes only once. Although SQL Server can optimize correlated subqueries and convert them to joins, there exists a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions.

Window functions

The following query uses the SUM window aggregate function to calculate the total over each employee and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each employee, meaning that the total quantity per employee is reset to zero for each new employee. The second calculation uses an OVER clause without specifying partitions, meaning the calculation is done over all input rowsets. This query produces exactly the same result as the previous one:

SELECT e.EmployeeKey, e.LastName,
fr.SalesAmount,
SUM(fr.SalesAmount) OVER(PARTITION BY e.EmployeeKey)
AS TotalPerEmployee,
SUM(fr.SalesAmount) OVER()
AS GrandTotal
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
ORDER BY e.EmployeeKey;

Now assume that you need to calculate some statistics of the totals of the employees' orders. You need to calculate the running total amount for employees, and the moving average of this total over the last three employees, ordered by the employee key. This means you need to calculate the totals over employees in advance, and then use aggregate functions on these aggregates. You could do aggregations over employees in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. This subquery is called a common table expression, or CTE.

Common table expressions

CTEs are more readable than derived tables and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, you need to define them multiple times if you want to use them multiple times in the outer query. The following query shows the usage of CTE to calculate the total amount for all employees and then just shows the results in the outer query:

WITH EmpTotalCTE AS
(
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.SalesAmount) AS TotalPerEmployee
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
)
SELECT EmployeeKey, LastName,
TotalPerEmployee
FROM EmpTotalCTE
ORDER BY EmployeeKey;

The query returns 17 rows, one for each employee, with the total sales amount for this employee. Now let's add the running total and the moving average in the outer query:

WITH EmpTotalCTE AS
(
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.SalesAmount) AS TotalPerEmployee
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
)
SELECT EmployeeKey, LastName,
TotalPerEmployee,
SUM(TotalPerEmployee)
OVER(ORDER BY EmploYeeKey
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW)
AS RunningTotal,
AVG(TotalPerEmployee)
OVER(ORDER BY EmploYeeKey

       ROWS BETWEEN 2 PRECEDING
AND CURRENT ROW)
AS MovingAverage
FROM EmpTotalCTE
ORDER BY EmployeeKey;

Here are the partial results with the first five and last two rows:

EmployeeKey LastName  TotelPerEmployee RunningTotal   MovingAverage
----------- --------- ---------------- -------------- -------------
272 Jiang 1092123.8562 1092123.8562 1092123.8562
281 Blythe 9293903.0055 10386026.8617 5193013.4308
282 Mitchell 10367007.4286 20753034.2903 6917678.0967
283 Carson 10065803.5429 30818837.8332 9908904.659
284 Vargas 3609447.2163 34428285.0495 8014086.0626

295 Valdez 1790640.2311 79028786.0571 1425236.791
296 Tsoflias 1421810.9252 80450596.9823 1128325.2026

Note that the running total for the last employee, sorted by the employee key, is the grand total. You can also check whether the running total and moving average are calculated correctly.

In the previous query, you can see that in the OVER() clause, I defined the frame of the calculation of the running total and the moving average for each row. For the running total, the frame is all rows from the first row in the window to the current row, and for the moving average, it is the last three rows, including the current row.

You can use many other functions for window calculations. For example, you can use the ranking functions, such as ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order of the calculation. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A standalone, outer ORDER BY at the end of the query defines the order of the result.

The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each employee and defines the row number or the position of every single reseller for each employee, ordered by the sales amount, in descending order:

WITH EmpResTotalCTE AS
(
SELECT e.EmployeeKey, r.ResellerKey,
MIN(e.LastName) AS LastName,
MIN(r.ResellerName) AS ResellerName,
SUM(fr.SalesAmount) AS EmpResTotal
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
GROUP BY e.EmployeeKey, r.ResellerKey
)
SELECT EmployeeKey, LastName,
ResellerName, EmpResTotal,
ROW_NUMBER()
OVER(PARTITION BY EmployeeKey ORDER BY EmpResTotal DESC)
AS PositionByEmployee
FROM EmpResTotalCTE
ORDER BY EmployeeKey, EmpResTotal DESC;

Here are the partial results:

EmployeeKey LastName  ResellerName      EmpResTotal  PositionByEmployee
----------- -------- -------------- ----------- ------------------
272 Jiang Thorough Parts 198993.3507 1
272 Jiang Sheet Metal 138046.3212 2
272 Jiang Vigorous Exercise 125422.2079 3
272 Jiang Sales and Supply 79994.1743 4

281 Blythe Totes & Baskets 463030.757 1
281 Blythe First Bike 437319.784 2
281 Blythe Modular Cycle 389208.6639 3
281 Blythe Fitness Toy Store 375100.3084 4

296 Tsoflias Eastside Cycle 2223.7009 30
296 Tsoflias Twin Cycles 1336.23 31
296 Tsoflias Kids and Adults 753.768 32
296 Tsoflias Major Bicycle 200.052 33

In the abbreviated results shown previously, each sales amount total for each reseller and employee combination is unique. That's why all of the row numbers are in the right position – 1 is before 2, which is before 3, and so on. Now imagine that the second and the third reseller (Sheet Metal and Vigorous Exercise) for the first employee (Jiang) would have the same total, for example 138,046.3212. Then you could get the wrong order of row numbers in the results, as shown here:

EmployeeKey LastName  ResellerName      EmpResTotal  PositionByEmployee
----------- -------- -------------- ----------- ------------------
272 Jiang Thorough Parts 198993.3507 1

272         Jiang     Vigorous Exercise 138046.3212  3
272 Jiang Sheet Metal 138046.3212 2
272 Jiang Sales and Supply 79994.1743 4

The order of the result is defined over the sales total, and not over the row number. You can't know in advance which row will get which row number when the order of the calculation is not defined in unique values.

Finding top n rows and using the APPLY operator

Let's find the top-6 sales based on the sales amount. You can do this by using the OFFSET…FETCH clause after the ORDER BY clause:

SELECT SalesOrderNumber,
SalesOrderLineNumber,
SalesAmount
FROM dbo.FactResellerSales
ORDER BY SalesAmount DESC
OFFSET 0 ROWS FETCH NEXT 6 ROWS ONLY;

Here are the results:

SalesOrderNumber  SalesOrderLineNumber SalesAmount
----------------- -------------------- -----------
SO55282 39 27893.619
SO43884 17 27055.7604
SO51131 12 26159.2081
SO43875 12 24938.4761
SO57054 26 23667.8549
SO43875 10 23190.6518

The question that arises is whether order SO43875, line number 10, is the only sale with the sales amount equal to 23190.6518. You could try to execute the previous query again, but with limiting the output to the first seven rows, then eight rows, and so on. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way, you don't know the number of the rows in the output in advance. The following query shows this approach:

SELECT TOP 6 WITH TIES
SalesOrderNumber, SalesOrderLineNumber,
SalesAmount
FROM dbo.FactResellerSales
ORDER BY SalesAmount DESC;

In the results of the last query, you get seven rows:

SalesOrderNumber  SalesOrderLineNumber SalesAmount
----------------- -------------------- -----------
SO55282 39 27893.619
SO43884 17 27055.7604
SO51131 12 26159.2081
SO43875 12 24938.4761
SO57054 26 23667.8549
SO44795 18 23190.6518
SO43875 10 23190.6518

The next task is to get the top three resellers by amount for a specific employee. Here is the query that returns the top four resellers for the employee with the employee key equal to 272:

SELECT TOP 3
fr.EmployeeKey, fr.ResellerKey,
SUM(fr.SalesAmount) AS EmpResTotal
FROM dbo.FactResellerSales AS fr
WHERE fr.EmployeeKey = 272
GROUP BY fr.EmployeeKey, fr.ResellerKey
ORDER BY EmpResTotal DESC;

You need to perform the calculation for each employee. The APPLY Transact-SQL operator comes in handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even is the tabular expression on the right side does not return any row, similar to LEFT OUTER JOIN. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate top three resellers by amount for each employee that actually does have some sales:

SELECT e.EmployeeKey, e.LastName,
fr1.ResellerKey, fr1.EmpResTotal
FROM dbo.DimEmployee AS e
CROSS APPLY
(SELECT TOP 3
fr.EmployeeKey, fr.ResellerKey,
SUM(fr.SalesAmount) AS EmpResTotal
FROM dbo.FactResellerSales AS fr

  WHERE fr.EmployeeKey = e.EmployeeKey
GROUP BY fr.EmployeeKey, fr.ResellerKey
ORDER BY EmpResTotal DESC) AS fr1
ORDER BY e.EmployeeKey, fr1.EmpResTotal DESC;

The query returns 51 rows. You can see the abbreviated results here:

EmployeeKey LastName   ResellerKey EmpResTotal
----------- ---------- ----------- -----------
272 Jiang 433 198993.3507
272 Jiang 436 138046.3212
272 Jiang 678 125422.2079
281 Blythe 328 463030.757
281 Blythe 670 437319.784
281 Blythe 4 389208.6639

296 Tsoflias 573 163398.0205
296 Tsoflias 87 148996.5063
296 Tsoflias 393 145407.7395

Since this was the last query in this chapter, you can clean up your demo database with the following code:

DROP TABLE dbo.EmpFUll;
GO

That's it for T-SQL for now. You will use this knowledge a lot in the forthcoming chapters for concrete data science tasks.

Summary

In this chapter, you were given a quick introduction to the mighty T-SQL SELECT statement. You are ready to query the data you have stored in a SQL Server database. However, T-SQL is a language that is specialized to work with data, to query and modify it. It logically operates on a whole dataset at once. When operating on a whole set of data, T-SQL is extremely efficient. However, it is not very efficient for advanced mathematical operations inside a single row and between rows in a row-by-row manner. This is not very appropriate for the data science tasks, where  you often need to perform such advanced mathematical analyses. Therefore, we need another language. In the next chapter, you will learn about the first option, the R language.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Use the features of SQL Server 2017 to implement the data science project life cycle
  • Leverage the power of R and Python to design and develop efficient data models
  • find unique insights from your data with powerful techniques for data preprocessing and analysis

Description

SQL Server only started to fully support data science with its two most recent editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning (ML) Services for your projects, then this is the ideal book for you. This book is the ideal introduction to data science with Microsoft SQL Server and In-Database ML Services. It covers all stages of a data science project, from businessand data understanding,through data overview, data preparation, modeling and using algorithms, model evaluation, and deployment. You will learn to use the engines and languages that come with SQL Server, including ML Services with R and Python languages and Transact-SQL. You will also learn how to choose which algorithm to use for which task, and learn the working of each algorithm.

Who is this book for?

SQL Server professionals who want to start with data science, and data scientists who would like to start using SQL Server in their projects will find this book to be useful. Prior exposure to SQL Server will be helpful.

What you will learn

  • • Use the popular programming languages,T-SQL, R, and Python, for data science
  • • Understand your data with queries and introductory statistics
  • • Create and enhance the datasets for ML
  • • Visualize and analyze data using basic and advanced graphs
  • • Explore ML using unsupervised and supervised models
  • • Deploy models in SQL Server and perform predictions
Estimated delivery fee Deliver to South Africa

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2018
Length: 206 pages
Edition : 1st
Language : English
ISBN-13 : 9781789537123
Vendor :
Microsoft
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Estimated delivery fee Deliver to South Africa

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Aug 31, 2018
Length: 206 pages
Edition : 1st
Language : English
ISBN-13 : 9781789537123
Vendor :
Microsoft
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 146.97
SQL Server 2017 Developer???s Guide
$60.99
Hands-On Data Science with SQL Server 2017
$48.99
Data Science with SQL Server Quick Start Guide
$36.99
Total $ 146.97 Stars icon

Table of Contents

9 Chapters
Writing Queries with T-SQL Chevron down icon Chevron up icon
Introducing R Chevron down icon Chevron up icon
Getting Familiar with Python Chevron down icon Chevron up icon
Data Overview Chevron down icon Chevron up icon
Data Preparation Chevron down icon Chevron up icon
Intermediate Statistics and Graphs Chevron down icon Chevron up icon
Unsupervised Machine Learning Chevron down icon Chevron up icon
Supervised Machine Learning Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(1 Ratings)
5 star 0%
4 star 100%
3 star 0%
2 star 0%
1 star 0%
Rubén Dec 20, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
First of all, I want to explain that I have been given a copy of this book and that Dejan works in the same company I work but that is not conditioning my review. I have also read some other books authored or coauthored by Dejan covering topics like T-SQL programming & DW design.Cons----Maybe my expectations were too high or I'm just used to materials which go deeper on the topics than this book. This book pretends to be an introduction to data science using SQL Server an ML Services and it also pretends to cover all the stages of a data science project, all in a book of fewer than 200 pages, which seems a bit too optimistic.The end half of the book is where the most interesting topics are presented but everything seems a bit "accelerated" and some complex concepts are directly applied without many previous explanations. The number of different packages and libraries used in the examples increases a lot page after page making a bit complicated to follow the code (highly recommended to have side by side the libraries documentation). I also find somehow distracting to be jumping from one language to the other (R vs Python) instead of having the examples in both languages. I guess that some graphics engine/libraries and algorithms are only implemented on R or Python libraries so that jumping is not avoidable.Pros-----I think this book is useful for data professionals which had no previous knowledge of data science and R/Python. The first three chapters, a third of the book, are a basic introduction to T-SQL, R, and Python. Then we find an introduction to the data science project life cycle plus some basic statistical concepts. The second half of the book is the most interesting part from my point of view.In the second half of the book, we will get some insights on some discretization algorithms first and how to explore and measure the associations between variables. The unsupervised & supervised algorithms and methods presented are a good selection and they seem useful for lots of real scenarios. Most of the time you will need a "trial and error" approximation to see which one fits better, generate a better model, gets more accurate predictions, etc. I guess that an experienced data scientist will do a good initial "filtering" intuitively before and he/she will select the ones more promising based on previous experience & knowledge.Maybe the potential readers expect more on some topics but this book succeeded in its main goal; getting into the world of data science with SQL Server and its most used languages and algorithms.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela