Key data concept – data queries
Our final data concept in this chapter is the idea of a data query, which is usually achieved using some flavor of SQL, a standard language to interact with databases. Although SQL varies a little from database to database, the basic syntax is fairly consistent. If you don't already have experience using SQL and you want to develop your data analysis skills, you will have to learn more about this key analytics tool.
When you work with text files and spreadsheets, the only practical way you can manipulate large amounts of data is at the column level, removing entire columns that you don't want. What if you want to manipulate the rows in a large dataset? In a spreadsheet, you could of course, use filters to remove rows.
Overall, however, manipulating data in this way is more tedious and potentially more error prone than using a single SQL statement. What if you want to involve more than one table in your "filter"? You can also take advantage of the enormous power of databases to handle large amounts of data and process complex queries.
Anatomy of a SQL statement
A basic SQL statement has just three elements, highlighted in the following code in all caps:
SELECT column_name_1, column_name_2, …, column_name_n FROM schema_name.data_table_name WHERE condition
You need to provide a schema_name
. Remember when we connected to a database, we saw a set of logical containers into which the tables were organized. These are examples of schemas. The terminology schema may mean slightly different things in different databases, but most organize the tables into a set of logical containers, and you need to prefix table names in your query with the container name using a dot to separate the two elements.
Often, a database administrator will create what is known as a view. These are pre-written, often quite complex, queries spanning multiple tables that define a particular dataset. As an end user, you can use them just like tables.
Tip
If you have a complex data requirement that you feel is beyond your SQL skills, talk to a database administrator or developer and get them to create a view for you. You can then select from the view as though it were a single table.
The SELECT
and FROM
clauses are mandatory, but the WHERE
condition is optional and is just a logical expression to limit the data that is returned. For example:
WHERE column_name_1 = 'some text' AND column_name_2 > 4
If you want to include more than one table in your query, you will have to use what is known as a join.
SELECT a.column_name_1, b.column_name_2, …, etc FROM schema_name.data_table_name_1 a JOIN schema_name.data_table_name_2 b on a.column_name_x = b.column_name_y WHERE condition
Note how the aliases a and b have been used for the two tables. Aliases can be anything you like, but are usually short and make the statement easier to write and read.
The JOIN
statement ensures that you only return rows where the join condition is true, as well as any other condition you defined. There are other types of joins. The following link of the w3schools website provides a helpful tutorial: http://www.w3schools.com/sql/sql_join.asp