The simplest form of the SELECT statement
Let's start with the simplest concept of SQL that every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT
and the FROM
clauses. In the select clause, you can use the star character, literally SELECT *
, to denote that you need all columns from a table in the result set. The following code switches to the AdventureWorksDW2017
database context and selects all data from the dbo.DimEmployee
table:
USE AdventureWorksDW2017;
GO
SELECT *
FROM dbo.DimEmployee;
This query returns 296 rows, all employees with all columns.
Note
Using SELECT *
is not recommended in production. Queries with SELECT *
can return an unexpected result when the table structure changes and are also not suitable for good optimization.
Better than using SELECT *
is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only three columns from the table:
SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee;
Here is the shortened result, limited to the first three rows only:
EmployeeKey FirstName LastName
----------- --------- ----------
1 Guy Gilbert
2 Kevin Brown
3 Roberto Tamburello
Object names in SQL Server, such as table and column, can include spaces. Names that include spaces are called delimited identifiers. To make SQL Server properly understand them as column names, you must enclose delimited identifiers in square brackets. However, if you prefer to have names without spaces, or if you use computed expressions in the column list, you can add column aliases. The following code uses an expression in the SELECT
clause to create a calculated column called [Full Name]
, and then uses the INTO
clause to store the data in a table.
The next query retrieves the data from the newly created and populated dbo.EmpFull
table:
SELECT EmployeeKey,
FirstName + ' ' + LastName AS [Full Name]
INTO dbo.EmpFUll
FROM dbo.DimEmployee;
GO
SELECT EmployeeKey, [Full Name]
FROM dbo.EmpFUll;
Here is the partial result:
EmployeeKey Full Name
----------- ------------------
1 Guy Gilbert
2 Kevin Brown
3 Roberto Tamburello
As you have seen before, there are 296 employees. If you check the full result of the first query, you might notice that there is a column named SalesPersonFlag
in the dbo.DimEmployee
table. You might want to check which of the employees are also salespeople. You can filter the results of a query with the WHERE
clause, as the following query shows:
SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee
WHERE SalesPersonFlag = 1;
This query returns 17 rows only.
In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, such as employees in the examples you have seen so far. In order to make result sets meaningful for the business your database supports, most of the time you need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. An inner join returns only rows for which the condition in the join predicate for the two joined tables evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL
keyword. A predicate can thus evaluate to true, false, or NULL. For an inner join, the order of the tables involved in the join is not important.
In the following example, you can see the dbo.DimEmployee
table joined with an inner join to the dbo.FactResellerSales
table:
SELECT e.EmployeeKey, e.FirstName, e.LastName,
fr.SalesAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;
Here are the partial results:
EmployeeKey FirstName LastName SalesAmount
----------- --------- -------- -----------
285 Tsvi Reiter 2024.994
285 Tsvi Reiter 6074.982
285 Tsvi Reiter 2024.994
In the previous query, you can see that table aliases are used. If a column's name is unique across all tables in the query, you can use it without a table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the table.column
format. In the previous query, the EmployeeKey
column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM
clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query.
The previous query returned 60,855 rows. It is always recommended to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written in a logically correct way. If every sale has an employee, as it should have, then the previous query should have returned exactly the number of rows dbo.FactResellerSales
has. You can quickly check the number of rows in the dbo.FactResellerSales
table with the help of the COUNT(*)
aggregate function, as the following query shows:
SELECT COUNT(*) AS ResellerSalesCount
FROM dbo.FactResellerSales;
The result is, as you probably expected, 60,855 rows.
You can join multiple tables in a single query. The following code joins seven tables in a single query. Note that all of the joins are still inner joins. The query returns 60,855 rows again, with at least 1 column from each table involved in the query:
SELECT e.EmployeeKey, e.FirstName, e.LastName,
r.ResellerKey, r.ResellerName,
d.DateKey, d.CalendarYear, d.CalendarQuarter,
p.ProductKey, p.EnglishProductName,
ps.EnglishProductSubcategoryName,
pc.EnglishProductCategoryName,
fr.OrderQuantity, fr.SalesAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
INNER JOIN dbo.DimDate AS d
ON fr.OrderDateKey = d.DateKey
INNER JOIN dbo.DimProduct AS p
ON fr.ProductKey = p.ProductKey
INNER JOIN dbo.DimProductSubcategory AS ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
INNER JOIN dbo.DimProductCategory AS pc
ON ps.ProductCategoryKey = pc.ProductCategoryKey;
In the dbo.Employees
table, there are 17 salespeople. Do all of them have at least one sale, at least one row from the dbo.FactResellerSales
table associated with the employee key of that salesperson? You can check how many distinct employees have sales associated with them with the help of the DISTINCT
keyword:
SELECT DISTINCT fr.EmployeeKey
FROM dbo.FactResellerSales AS fr;
The query returns 17 rows. Now imagine that you would like to list all sales rows together with the employees' data, but you also need to include in the result the employees that are not salespeople, that do now have any row associated with their EmployeeKey
column in the fact table. You can use an outer join to fulfill this task.
With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows, like what you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN
, the rows from the left table are preserved. If you use RIGHT OUTER JOIN
, the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With FULL OUTER JOIN
, you preserve the rows from both tables, and the order of the tables is not important. The following query uses a left outer join to preserve the rows from the dbo.DimEmployee
table:
SELECT e.EmployeeKey, e.FirstName, e.LastName,
fr.SalesAmount
FROM dbo.DimEmployee AS e
LEFT OUTER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;
The query returns 61,134 rows. Did we get all of the employees in the result? You can check this by checking the distinct EmployeeKey
after the outer join:
SELECT DISTINCT e.EmployeeKey
FROM dbo.DimEmployee AS e
LEFT OUTER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey;
The query returns 296 rows, which is the number of employees.
Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows of an outer join. You can control the join order with parentheses. The following query uses the right outer join to preserve all employees and makes sure that this join is executed after all inner joins:
SELECT e.EmployeeKey, e.FirstName, e.LastName,
r.ResellerKey, r.ResellerName,
d.DateKey, d.CalendarYear, d.CalendarQuarter,
p.ProductKey, p.EnglishProductName,
ps.EnglishProductSubcategoryName,
pc.EnglishProductCategoryName,
fr.OrderQuantity, fr.SalesAmount
FROM (dbo.FactResellerSales AS fr
INNER JOIN dbo.DimReseller AS r
ON r.ResellerKey = fr.ResellerKey
INNER JOIN dbo.DimDate AS d
ON fr.OrderDateKey = d.DateKey
INNER JOIN dbo.DimProduct AS p
ON fr.ProductKey = p.ProductKey
INNER JOIN dbo.DimProductSubcategory AS ps
ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
INNER JOIN dbo.DimProductCategory AS pc
ON ps.ProductCategoryKey = pc.ProductCategoryKey)
RIGHT OUTER JOIN dbo.DimEmployee AS e
ON e.EmployeeKey = fr.EmployeeKey;
The query returns 61,134 rows, as it should. Note that with the usage of the parenthesis, the order of joins is defined in the following way:
Perform all inner joins, with an arbitrary order among them
Execute the left outer join after all of the inner joins
Grouping and aggregating data
Many times, you need to aggregate data in groups. This is where the GROUP BY
clause comes in handy. The following query aggregates the sales data for each employee:
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey;
The query returns 17 aggregated rows. Here are the results, abbreviated to the first three rows only:
EmployeeKey LastName EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
284 Vargas 11544 3609447.2163
295 Valdez 6898 1790640.2311
281 Blythe 23058 9293903.0055
In the SELECT
clause, you can have only the columns used for grouping, or aggregated columns. That is why the LastName
column in the SELECT
list is used in the MIN()
aggregate function. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY
list.
Sometimes, you need to filter aggregated data. For example, you might need to find only the employees for which the sum of the order quantity did not reach 10,000. You can filter the result set on the aggregated data by using the HAVING
clause:
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) < 10000;
The query returns eight rows only. Note that you can't use column aliases from the SELECT
clause in any other clause introduced in the previous query. The SELECT
clause logically executes after all other clauses from the query, and the aliases are not known yet. However, the ORDER BY
clause, which sorts the result, executes after the SELECT
clause, and therefore the columns aliases are already known and you can refer to them. The following query shows the nine employees with sum of the OrderQuantity
variable greater than 10,000, sorted in descending order by this sum:
SELECT e.EmployeeKey,
MIN(e.LastName) AS LastName,
SUM(fr.OrderQuantity)AS EmpTotalQuantity,
SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
INNER JOIN dbo.FactResellerSales AS fr
ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) > 10000
ORDER BY EmpTotalQuantity DESC;
You can see the shortened results as follows:
EmployeeKey LastName EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
282 Mitchell 27229 10367007.4286
283 Carson 27051 10065803.5429
291 Pak 26231 8503338.6472