Using SQLite and SQL
SQL, or Structured Query Language, is a programming language for interacting with data in relational databases (sometimes called RDBMS, meaning Relational Database Management System). SQL has been around since the 1970s and continues to be used widely today. You will likely interact with a SQL database or use SQL queries sooner or later at your workplace if you haven't already. Its advantages in speed and momentum from decades of use sustain its widespread utilization today.
SQL is the standard programming language for interfacing with relational databases, and a large fraction of data and databases use relational models today. In fact, SQL has even been approved as an international standard by the International Organization for Standardization (ISO), and the SQL standard is continually updated every few years. This standard language makes it easier for more people to use these databases, adding to the network effects. NoSQL databases are an alternative to SQL databases, and are used in situations where we might prefer to have more flexibility in our data model, or a different model completely (such as a graph database). For example, if we aren't sure of all the columns or fields we might collect with our data (and they could change frequently over time), a NoSQL document database such as MongoDB may be better than SQL. For big data, NoSQL used to have an advantage over SQL because NoSQL could scale horizontally (that is, adding more nodes to a cluster). Now, SQL databases have been developed to scale easily, and we can also use cloud services such as AWS Redshift and Google's BigQuery to scale SQL databases easily.
Much of the world's data is stored in relational databases using SQL. It's important to understand the basics of SQL so we can retrieve our own data for our data science work. We can interact directly with SQL databases through the command line or GUI tools and through Python packages such as pandas
and SQLAlchemy. But first, we will practice SQL with SQLite databases, since SQLite3 comes installed with Python. SQLite is what it sounds like – a lightweight version of SQL. It lacks the richer functionality of other SQL databases such as MySQL, but is faster and easier to use. However, it can still hold a lot of data, with a maximum potential database size of around 281 TB for SQLite databases.
For our next examples, we'll be using the chinook
database. This is a dataset of songs and purchases from customers and is similar to a dataset of iTunes songs and purchases. We will refer to this dataset as "the iTunes database" from time to time. The chinook.db
file is included in the Chapter 3 folder of this book's GitHub repository, but its source code and data can be found at https://github.com/lerocha/chinook-database. Let's first load the database from the command line and take a look. First, open a terminal or shell and navigate to the directory containing the chinook.db
file. Then run the following command to open the database within a SQLite shell:
sqlite3 chinook.db
Once we are in the SQLite shell, our prompt on the left should look like sqlite>
. Let's try our first command: .tables
. This prints out the tables in the database, which hold the data. Our result should look like the following:
Figure 3.1: The SQLite shell looks like this, and the .tables command lists the tables in the database
SQL databases are organized into tables that can be combined together to extract more information. We can draw an Entity Relationship Diagram (ERD) of the database, which shows the tables and their relationships with one another:
Figure 3.2: This is an ERD of the chinook database and was created with the DbVisualizer software
The preceding ERD shows the tables we listed from the chinook
database. The arrows show which tables connect to other tables, and the highlighted/boxed column names show the common columns that can connect them. These connecting columns having matching values that can be paired up to combine rows from the different tables.
Let's try a simple command in our SQLite shell. To retrieve data from a SQL database, we can use the SELECT
command. Let's look at some data from the artists
table:
SELECT * FROM artists LIMIT 5;
By convention, SQL commands are typed in uppercase, although they don't have to be, and none of the text in SQL commands is case-sensitive. So, the SELECT
, FROM
, and LIMIT
commands shown in the preceding snippet are the specific SQL commands we've used. SELECT
tells the SQL shell which columns to select. With *
, we select all columns. Note that from the ERD, we can see that our available columns in the artists
table are ArtistId
and Name
. Next, we choose the table to select our data from – FROM artists
. Lastly, we limit our results to 5 with LIMIT 5
so it doesn't print out all the results (which could be massive).
Importantly, notice that we end the line with a semicolon. Without the semicolon, the SQL shell keeps looking for more code to run. We should see printed out results like the following:
1|AC/DC
2|Accept
3|Aerosmith
4|Alanis Morissette
5|Alice In Chains
The returned results are minimally formatted within the SQLite shell. Not even the column names are shown, but we know them from the ERD or from looking at the table's list of columns. We can look at the list of columns with the PRAGMA table_info(artists);
command. To exit the SQLite shell, hit Ctrl + C or Command + C.
Although there is no official style guide for SQL's code style like there is for Python in the form of PEP8, there are a few style guides out there that are generally consistent with each other:
https://about.gitlab.com/handbook/business-ops/data-team/platform/sql-style-guide/
If we have a SQLite file, we can interact with it via the command line or other software such as SQLiteStudio (currently at https://sqlitestudio.pl/). However, we can also use it with Python using the built-in sqlite3
module. To get started with this method, we import sqlite3
, then connect to the database file, and create a cursor:
import sqlite3
connection = sqlite3.connect('chinook.db')
cursor = connection.cursor()
The string argument to sqlite3.connect()
should be either the relative or absolute path to the database file. The relative path means it is relative to our current working directory (where we are running the Python code from). If we want to use an absolute path, we could supply something like the following:
connection = sqlite3.connect(r'C:\Users\my_username\github\Practical-Data-Science-with-Python\Chapter3\chinook.db')
Notice that the string has the r
character before it. As we mentioned earlier, this stands for raw string and means it treats special characters (such as the backslash, \
) as literal characters. With raw strings, the backslash is simply a backslash, and this allows us to copy and paste file paths from the Windows File Explorer to Python.
The preceding cursor is what allows us to run SQL commands. For example, to run the SELECT
command that we already tried in the SQLite shell, we can use the cursor:
cursor.execute('SELECT * FROM artists LIMIT 5;')
cursor.fetchall()
We use the fetchall
function to retrieve all the results from the query. There are also fetchone
and fetchmany
functions we could use instead, which are described in Python's sqlite3
documentation. These functions retrieve one record (fetchone
) and several records (fetchmany
, which retrieves a number of records we specify).
When we start executing bigger SQL queries, it helps to format them differently. We can break up a SQL command into multiple lines like so:
query = """
SELECT *
FROM artists
LIMIT 5;
"""
cursor.execute(query)
cursor.fetchall()
We are using a multi-line string for the query
variable with the triple quotes and putting each SQL command on a separate line. Then we give this string variable, query
, to the cursor.execute()
function. Finally, we retrieve the results with fetchall
.
When selecting data, it can be useful to sort it by one of the columns. Let's look at the invoices
table and get the biggest invoices:
cursor.execute(
"""SELECT Total, InvoiceDate
FROM invoices
ORDER BY Total DESC
LIMIT 5;"""
)
cursor.fetchall()
Here, we are using the same connection and cursor as previously. We are selecting a few columns from invoices
: Total
and the InvoiceDate
. We then use the ORDER BY
SQL command and sort by the Total
column, with the addition of DESC
for descending order.
If we use ORDER BY
without the DESC
keyword, the DBMS (Database Management System) sorts by the specified data column in ascending order (from least to greatest) by default. We can also sort text and date columns too – text columns are sorted alphabetically, and dates are sorted from earliest to latest by default.
Another useful SQL command is WHERE
, which allows us to filter data. It's similar to an if
statement in Python. We can filter with Boolean conditions, such as equality (==
), inequality (!=
), or other comparisons (including less than, <
, and greater than, >
). Here is an example of getting invoices from Canada:
cursor.execute(
"""SELECT Total, BillingCountry
FROM invoices
WHERE BillingCountry == "Canada"
LIMIT 5;"""
)
cursor.fetchall()
We have a similar SELECT
statement as in the previous examples, except we are filtering by BillingCountry
equal to Canada
using the WHERE
command. Notice we provide Canada
as a string in double quotes – since the entire query string is in single quotes, we can use double quotes within the string. When we fetch all of the results, we see they are only from Canada:
[(8.91, 'Canada'),
(8.91, 'Canada'),
(0.99, 'Canada'),
(1.98, 'Canada'),
(13.86, 'Canada')]
As a part of WHERE
, we can filter by pattern matching using LIKE
. This is similar to regular expressions, which we will cover in Chapter 18, Working with Text. We can find any country strings that contain the letters c
, a
, and n
like so:
cursor.execute(
"""SELECT Total, BillingCountry
FROM invoices
WHERE BillingCountry LIKE "%can%"
LIMIT 5;"""
)
cursor.fetchall()
The LIKE "%can%"
section enables our filtering when combined with the WHERE
statement. The percent signs (%
) mean any number of characters can be at the beginning or end of the string. It is also not case-sensitive, meaning the characters can match lower- or uppercase letters.
At the time of writing, SQLite's documentation isn't thorough or comprehensive (although it may improve in the future). Other SQL variants have better documentation. For example, Microsoft's SQL documentation on LIKE
describes more of what the command can do: https://docs.microsoft.com/en-us/sql/t-sql/language-elements/like-transact-sql?view=sql-server-ver15. And since SQL is a standard, most SQL variants share many of the same commands and features (but not everything).
It can be useful to group data and get summary statistics. For example, maybe we want to know the total amount of sales by country, and see which countries are bringing in the most revenue. We could accomplish this like so:
cursor.execute(
"""SELECT SUM(Total), BillingCountry
FROM invoices
GROUP BY BillingCountry
ORDER BY SUM(Total) DESC
LIMIT 5;"""
)
cursor.fetchall()
We are using the SUM()
command on the Total
column from the invoices
table here. This sums the Total
column based on the column we group by. Looking at the GROUP BY
clause, we can see we are specifying BillingCountry
to group our data. This means all entries with the same value for BillingCountry
are grouped together, and the sum of the Total
column is then calculated for each group. We are also using ORDER BY
with DESC
to arrange the sum of the Total
column from greatest to least. We can see the USA has the largest total amount of sales:
[(523.0600000000003, 'USA'),
(303.9599999999999, 'Canada'),
(195.09999999999994, 'France'),
(190.09999999999997, 'Brazil'),
(156.48, 'Germany')]
The SUM
function is called an aggregate function. There are others, including MIN
, COUNT
, and so on. SQLite's documentation on these functions can be found at https://sqlite.org/lang_aggfunc.html.
Other SQL variants (such as Microsoft SQL Server which uses the T-SQL extension) have more aggregate functions, such as standard deviation.
As we've seen from the ERD of the chinook
database, the data is split into several tables. This is called database normalization, which we do to minimize the space used by the database and to minimize any errors upon changing data. This means we need to combine tables in order to extract the data we want. For example, say we want to investigate which tracks were purchased by which countries. Let's start with a simple example, where we look at individual tracks purchased and combine this with the country from the invoice.
query = """
SELECT invoices.BillingCountry, invoice_items.TrackId
FROM invoices
JOIN invoice_items
ON invoices.InvoiceId = invoice_items.InvoiceId
LIMIT 5;
"""
cursor.execute(query)
cursor.fetchall()
We can see in our query that we are selecting TrackId
and BillingCountry
, but we are also specifying the tables these are coming from. Since the invoices
table has the country, but the invoice_items
table has the track ID, we need to combine these two tables. We do this with the JOIN
clause on the third line of the preceding query. After our SELECT … FROM
segment, we then specify the table we want to join with using JOIN invoice_items
. Then, after the ON
keyword, we specify the columns we will use to join the tables. Just like with the SELECT
statement, we need to first specify the table, then use a period, and finally specify the column, in the form of invoices.InvoiceId
here. We also use a single equals sign between the two columns we want to match. By default, this is an INNER JOIN
statement, meaning it only returns rows where there is a match in both tables. This is the most common join used, although there are others including left, right, and outer joins. These other joins include more results, but the vast majority of the time we will use an inner join. So, in this example, each row returned has an exact match between InvoiceId
in the two tables. When using joins, we need to be careful which columns we choose. Almost always, we want to use a "primary key" column from one of the tables. This means the values will be unique in the table that has that column as a primary key. In the ERD from Figure 3.2, the primary and foreign key columns are highlighted in each table. We can also see the primary key by examining a table's information. We can use the PRAGMA
command with table_info()
on the invoices
table to retrieve information on each of the columns:
cursor.execute('PRAGMA table_info(invoices);')
cursor.fetchall()
This query returns the following:
[(0, 'InvoiceId', 'INTEGER', 1, None, 1),
(1, 'CustomerId', 'INTEGER', 1, None, 0),
(2, 'InvoiceDate', 'DATETIME', 1, None, 0),
(3, 'BillingAddress', 'NVARCHAR(70)', 0, None, 0),
(4, 'BillingCity', 'NVARCHAR(40)', 0, None, 0),
(5, 'BillingState', 'NVARCHAR(40)', 0, None, 0),
(6, 'BillingCountry', 'NVARCHAR(40)', 0, None, 0),
(7, 'BillingPostalCode', 'NVARCHAR(10)', 0, None, 0),
(8, 'Total', 'NUMERIC(10,2)', 1, None, 0)]
The last column of the results is the pk
field, meaning "primary key." If the value is a 1
, it is a primary key. We can see from the results that InvoiceId
is the primary key column in the invoices
table. Let's expand on our earlier example and get the number of times each track is purchased across countries, as well as sort it from greatest to least:
query = """
SELECT
invoice_items.TrackId,
COUNT(invoice_items.TrackId),
invoices.BillingCountry
FROM invoices
JOIN invoice_items
ON invoices.InvoiceId = invoice_items.InvoiceId
GROUP BY invoices.BillingCountry
ORDER BY COUNT(invoice_items.TrackId) DESC
LIMIT 5;
"""
cursor.execute(query)
cursor.fetchall()
Here, we are combining the invoices
and invoice_items
tables again with an inner join, and once again using the InvoiceId
column to join the tables. We are also grouping by country. Finally, we get the track ID and the count of each track ID for each country. The COUNT
function is another aggregate function, which returns the count of items in each group. Finally, we are sorting it by the count of the track ID for each country from greatest to least using ORDER BY
with the DESC
keyword. Our results look like this:
[(99, 494, 'USA'),
(42, 304, 'Canada'),
(234, 190, 'France'),
(738, 190, 'Brazil'),
(2, 152, 'Germany')]
We can see that the US has the top number of purchases (494) of the track with ID 99, but we don't know what song that is. In order to get results with the song titles, we need to combine our invoice
and invoice_items
tables with one more table – the tracks
table, which has the song titles. We can do this with multiple JOIN
clauses:
query = """
SELECT tracks.Name, COUNT(invoice_items.TrackId), invoices.BillingCountry
FROM invoices
JOIN invoice_items
ON invoices.InvoiceId = invoice_items.InvoiceId
JOIN tracks
ON tracks.TrackId = invoice_items.TrackId
GROUP BY invoices.BillingCountry
ORDER BY COUNT(invoice_items.TrackId) DESC
LIMIT 5;
"""
cursor.execute(query)
cursor.fetchall()
Here, we are combining the invoices
, invoice_items
, and tracks
tables to get the track name, the number of purchases, and the country, and then group the results by country. We can see we first select our columns, specifying which table each of the columns comes from, such as tracks.Name
, to get each song's title from the tracks
table. Then we specify our first table: invoices
. Next, we join with invoice_items
as we did before on the InvoiceId
column. Then we join the tracks
table, which shares a TrackId
column with the invoice_items
table. Finally, we group by country and sort by the count of TrackId
as we did before and limit our results to the top five. The results look like this:
[('Your Time Has Come', 494, 'USA'),
('Right Through You', 304, 'Canada'),
('Morena De Angola', 190, 'France'),
('Admirável Gado Novo', 190, 'Brazil'),
('Balls to the Wall', 152, 'Germany')]
Now we can see the top-selling song by country is Your Time Has Come
, which sold 494 times in the US. We might want to get the artist names next, which would require two more joins: the tracks
table with the albums
table, and the albums
table with the artist
table. This is left as a challenge to the reader.
We could also get the results of the track names and counts of purchases grouped by country if we used temporary tables or subqueries.
Once we are finished with a SQLite database in Python, it's usually a good idea to close it like this:
connection.close()
Closing the connection is not necessarily required, but will make our code more explicit (following the PEP8 style guidelines) and thorough. This will ensure the connection closes when we expect it to, and can prevent problems if our code becomes involved in a larger project.
There are a lot of other SQL commands we have not covered here. In fact, there are entire books focused on learning SQL. If you're interested in learning more beyond these basics we covered, you might look into Learn SQL Database Programming by Josephine Bush from Packt. Another resource is Kaggle's SQL courses at https://www.kaggle.com/learn/overview.
Having a basic grasp of SQL commands will help you greatly in your data science journey. You'll be able to retrieve data from many of the databases out there. We can also use SQLite to store our own data, which we'll cover next.
Creating a SQLite database and storing data
As we saw earlier, we can store data in text or pickle
files. However, as data becomes larger, it can be slow to retrieve our data from a text or pickle
file. We can improve the performance (speed, primarily) of reading and writing data by using SQL databases instead, like a SQLite database. For example, we might use a SQL database to store data we collect from an API or web scraping. SQLite is nice for this, since it saves the data to a .sql
file, which can be shared a little more easily than exporting data from other SQL database management systems. Let's take a hypothetical example of storing book sales data. We have the dates of sales, the book titles, prices, and quantities:
book_data = [
('12-1-2020', 'Practical Data Science With Python', 19.99, 1),
('12-15-2020', 'Python Machine Learning', 27.99, 1),
('12-17-2020', 'Machine Learning For Algorithmic Trading', 34.99, 1)
]
We can see the data is stored in a list since the outer brackets are square brackets. Each element in the list is a tuple since it is surrounded by parentheses. We have the data in the order of date, book title, price, and quantity. We use a tuple for each data row because that is implicitly recommended by Python's sqlite
documentation (https://docs.python.org/3/library/sqlite3.html) , although we can also use a list. Tuples are a good idea to use when inserting data into a SQL database because they are immutable, meaning they can't be changed.
This means our data can't be inadvertently changed by a mistake in our code before we enter it into the database or purposefully changed by a hacker (once it's a tuple - it could be changed earlier if it was a list). We can convert a list to a tuple with tuple()
, in the form of tuple([1, 2, 3])
.
To create our database, we simply connect to a filename and create a cursor:
connection = sqlite3.connect('book_sales.db')
cursor = connection.cursor()
The book_sales.db
file will be created if it does not exist. Then we execute the SQL command for creating a table:
cursor.execute('''CREATE TABLE IF NOT EXISTS book_sales
(date text, book_title text, price real, quantity real)''')
Our query uses a multi-line string with triple single quotes surrounding it. We title our table book_sales
and provide the column names and datatypes for each column next to them, separated by commas. For example, the first column is date
with a text
datatype. We also surround the set of column names and datatypes with parentheses. Once we have created the table, we cannot create it again in the same database, or it throws an error. However, adding the IF NOT EXISTS
statement enables us to run the CREATE TABLE
command, and it will not return an error if the table already exists. If we needed to delete the table to start over, we could use the DROP TABLE book_sales;
command.
Once the table is created, we can insert data with the INSERT INTO
command:
cursor.execute("INSERT INTO book_sales VALUES (?, ?, ?, ?)", book_data[0])
connection.commit()
We specify the table name after the INSERT INTO
command, then use the VALUES
keyword, followed by our data to insert into each of the columns. Here, we are using question mark placeholders, which derive their values from the second argument we give to cursor.execute()
. We should have one value or placeholder for each column. Next, we supply book_data
as the second argument to cursor.execute()
, but we only supply the first element of the list with book_data[0]
. The question mark placeholders are then replaced by each value in the tuple from book_data[0]
when the query is executed. We can also use string formatting to place our values into the query, but this is not recommended. String formatting of SQL queries is a little less safe because we can suffer from a SQL injection attack. For example, if a hacker was able to put an arbitrary string into our SQL query, they could insert something like ; DROP TABLE book_sales;
, which would delete the data in the book_sales
table.
After inserting the data, we need to call connection.commit()
to save the changes. Otherwise, the data will not persist in the database. We now have the first row of our data in the database. We can check that it's there with a simple SELECT
statement:
cursor.execute('SELECT * FROM book_sales;')
cursor.fetchall()
We can also use the executemany()
method to insert several data records at once, like so:
cursor.executemany('INSERT INTO book_sales VALUES (?, ?, ?, ?)', book_data[1:])
connection.commit()
connection.close()
This inserts the rest of our book sales data (the second element through the end of the book_data
list) into our table and saves the changes with commit
. Finally, we close our connection since we are done adding data.
SQLite is a great tool available in Python for saving data. For interacting with data from other SQL database systems, such as Microsoft SQL Server and more, we can use another tool in Python – SQLAlchemy.