Relational Databases
While using Excel and CSV is a simple and easy way to connect to data in Tableau, these files can easily be changed by human error and are not dynamic. Most organizations have outgrown using Excel and CSV for the following reasons:
- They require data that can be found quickly when needed and is trusted to be reliable and accurate
- The solutions need to be able to comfortably handle the natural growth of data and the number of people wanting to access and manipulate it
- The files can often be duplicated and shared freely, risking unwarranted access
- Alternative solutions offer greater opportunities to connect from other locations, rather than a single local machine
Relational databases are often a reliable means of achieving these benefits. They are data storage systems that organize information in the familiar tabular structure, with rows and columns; when databases are discussed in a Tableau-specific context, users are usually referring to relational databases. Databases are often hosted on a server, which provides the resources required to run and manage the database; servers can often host multiple databases simultaneously, each with a distinct function.
Tables inside these data repositories are usually set up by developers to capture conceptually distinct types of information. For instance, a marketing center may have a Telephone Enquiries table with each record representing an outgoing call (with columns such as start time, duration, and operator), but store customer-level information (such as phone numbers, addresses, and last-contact dates) in a separate table called Clients.
Common elements allow tables to be related to each other for analytical purposes. This is usually done through keys. Primary keys are either a single field or multiple fields in combination that can be used to identify distinct records. To do this effectively, values in the primary key column(s) must be unique for each row, and primary key columns must be fully populated – that is, all records must have a value (with no missing values, known as null
values). Tables typically have just one primary key. Primary keys are useful for identifying duplicate values, which reduce the reliability of the data and result in issues such as double counting.
Foreign keys are columns in a table that refer to the primary key in another table. They are used to link tables on a common identifier. To continue the preceding example, the Clients table might have a Client ID column as the primary key, which also appears as a foreign key in the Telephone Enquiries table. Analysts can match the numbers between tables and identify which client was called in each instance. For example, they could identify which clients have had the greatest volume of successful calls and are therefore worth investing in. This process maintains the original values in a single location – the confidential Clients table – to make the data easier to govern.
Relational databases need to be communicated with for records to be accessed, updated, added, or deleted. This is achieved using a programming language called Structured Query Language (SQL). SQL is discussed further later, in the Custom SQL Query section.
Relational databases are popular as they often enforce rules to maintain data consistency and accuracy; for example, rules may be built to only allow values with a certain range when adding new records. In the Clients table, a Telephone Number field may require a 10-digit format with a country code prefix for a new record to be accepted in the table.
Popular relational database management systems include PostgreSQL, MySQL, and Oracle.