Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Actionable Insights with Amazon QuickSight
Actionable Insights with Amazon QuickSight

Actionable Insights with Amazon QuickSight: Develop stunning data visualizations and machine learning-driven insights with Amazon QuickSight

eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Actionable Insights with Amazon QuickSight

Chapter 1: Introducing the AWS Analytics Ecosystem

As data increases in both volume and variety, organizations from all verticals are adopting cloud analytics services for their data analytics. AWS offers a number of analytics services covering data lakes, data warehousing, big data processing, extract, transform, load (ETL), and data visualization. In this chapter, we will introduce the AWS analytics ecosystem. Some of the services we discuss here will be mentioned again later in the book.

First, we will map the AWS services into categories. Then, we will discuss how Amazon QuickSight fits into the wider AWS analytics ecosystem. We will look more closely at a modern Lake House architecture and we will discuss its benefits and its components. Finally, we will provide a step-by-step guide to set up a data Lake House architecture on AWS and load and query a demo data sample. Some of this information may already be familiar to you, but let's go back over the basics.

In this chapter, we will cover the following topics:

  • Discovering the AWS analytics ecosystem
  • Exploring the data Lake House architecture on AWS
  • Creating a basic Lake House architecture

Technical requirements

To follow along with this chapter, you will need the following pre-requisites:

  • An AWS account with console access
  • AWS CLI access

The code sample for this chapter can be accessed on the GitHub repository for this book at https://github.com/PacktPublishing/Actionable-Insights-with-Amazon-QuickSight/tree/main/chapter_1.

Discovering the AWS analytics ecosystem

AWS provides a large number of analytics services. In addition to that, AWS has a number of partners who specialize in data analytics and offer analytics solutions that run on the AWS infrastructure. Partner solutions are not in the scope of this section, however. This section focuses on the AWS fully managed analytics services. In order to list the services, we will first define the specific categories related to analytics functions. Machine learning and predictive analytics are also out of the scope of this chapter. For every service category, we will then list the AWS services available, and for each service, we will provide a high-level description. Figure 1.1 depicts the commonly used AWS analytics services.

Figure 1.1 – AWS analytics services

Figure 1.1 – AWS analytics services

Business intelligence

More and more organizations aspire to be data-driven and use data to drive their strategic decisions. Business intelligence (BI) tools help organizations to transform data into actionable insights. With the use of BI tools, users can analyze data and then present their findings in reports or dashboards. These reports or dashboards can then be consumed by business users who are interested in getting a picture of the state of the business.

In 2015, AWS launched Amazon QuickSight, a cloud-native BI tool. Since then, AWS has added new features to QuickSight, enriching the standard dashboard functionality with machine learning capabilities and offering embedded dashboard functionality. Amazon QuickSight is the main technology we will be covering in this book. Over the next few chapters, we will start with the basic functionality of Amazon QuickSight, and then we will explore more advanced features. Where possible, we will use practical examples that can be repeated in your own development environment, to give you hands-on experience with Amazon QuickSight.

Data warehousing

Data warehouses are repositories of data; they are important components of the BI process. Data stored in data warehouses is typically structured. Traditionally, data is ingested and centralized into data warehouses from different operational data stores. Data warehouses are optimized to run analytical queries over large amounts of data. The results of analytical queries are usually calculated after an aggregation over multiple rows from one or more tables. BI applications use analytical queries to aggregate data and visualize it. It is a common architectural approach to use a data warehouse to serve data to a BI application.

Back in 2012, AWS launched Amazon Redshift, a cloud-native, fully managed data warehouse service. Today, Redshift is one of the most popular cloud data warehouses with thousands of organizations from different verticals using it to analyze their data. Other popular cloud data warehouses include Snowflake and Google BigQuery. Amazon Redshift integrates with most BI tools and it integrates natively with Amazon QuickSight. We will discuss this topic in more detail in Chapter 3, Preparing Data with Amazon QuickSight, when we look more closely into Amazon QuickSight-supported data sources.

Data lake storage and governance

A data lake is a repository of data where organizations can easily centralize all of their data and apply it in different use cases such as reporting, visualization, big data analytics, and predictive analytics. Data stored in data lakes can be structured or semi-structured. Usually, data is ingested into the data lake in its raw format, and is then transformed and stored back into the data lake for further processing and analysis. A cloud data lake typically uses a cloud object store to store data. AWS introduced Amazon Simple Storage Service (S3) in March 2006, offering developers a highly scalable, reliable, and low-latency data storage infrastructure at very low cost. Amazon S3 can store an unlimited amount of data, a particularly useful feature for data lakes. Organizations have one less thing to worry about because they don't need to think about scaling their storage as the amount of data stored grows.

While scaling data lake storage is something that organizations and CIOs don't need to worry about much anymore, data lake governance needs to be considered carefully. Data lakes do not enforce data schemas or data formats and, without any governance, data lakes can degrade into unusable data repositories, often referred to as data swamps. AWS offers a number of services for data governance.

The AWS Glue Catalog is part of the AWS Glue service. It is a fully managed Apache Hive metastore-compatible data catalog. Big data applications (for example, Apache Spark, Apache Hive, Presto, and so on) use the metadata in the catalog to locate and parse data. The AWS Glue Catalog is a technical metadata repository and can catalog data in Amazon S3, and a number of relational or non-relational data stores including Redshift, Aurora, and DynamoDB, among others.

AWS Lake Formation runs on top of AWS Glue and Amazon S3 and provides a governance layer and access layer for data lakes on Amazon S3. It also provides a set of reusable ETL jobs, called blueprints, that can be used to perform common ETL tasks (for example, loading data from a relational data store into an S3 data lake). Lake Formation allows users to manage access permissions, using a familiar GRANT REVOKE syntax that you might have seen in relational database management systems (RDBMSes).

Amazon Macie is an AWS service for data protection. It provides an inventory of Amazon S3 buckets and it uses machine learning to identify and alert its users about sensitive data, such as personally identifiable information (PII).

Finally, and perhaps most importantly, AWS Identity and Access Management (IAM) is a fundamental AWS service that allows users to assign permissions to principals (for example, users, groups, or roles) and explicitly allow or deny access to AWS resources including data lake locations or tables in the data catalog.

Ad hoc analytics

Ad hoc analytics refers to getting answers from the data on an as-needed basis. Contrary with what happens with scheduled reports, ad hoc querying is initiated by a user when they need to get specific answers from their data. The user typically uses SQL via a workbench type of application or other analytics frameworks (for instance, Apache Spark) using notebook environments or other BI applications. AWS has a number of analytics services that can be used for ad hoc analytics.

Amazon Redshift can be used for ad hoc analysis of data. For ad hoc querying, users will typically connect to Amazon Redshift using a query editor application with the Redshift JDBC/ODBC drivers. Notebook integrations or BI tool integrations are also possible for ad hoc analysis. AWS offers a number of managed notebook environments such as EMR notebooks and SageMaker notebooks. Amazon Redshift also allows its users to query data that is stored outside the data warehouse. Amazon Redshift Spectrum allows Redshift users to query data stored in Amazon S3, eliminating the need to load the data first before querying. Redshift's federated querying capability allows users to query live data in operational data stores such as PostgreSQL and MySQL.

For big data and data lakes, Presto is a popular choice for ad hoc analysis. Presto provides a high-performance parallel SQL query engine. Amazon Athena lets users run Presto queries in a scalable serverless environment. Amazon QuickSight natively supports Amazon Athena. We will talk more about this native integration in Chapter 3, Preparing Data with Amazon QuickSight. Amazon EMR is a fully managed Hadoop cluster, and it comes with a range of applications from the open source big data ecosystem. Presto has two community projects, PrestoSQL and PrestoDB, both of which are part of the Amazon EMR service. Other options included with EMR are Hive on EMR and Spark on EMR.

Extract, transform, load

ETL is a term used to describe a set of processes to extract, transform, and load data usually for analytical purposes. Organizations gather data from different data sources and centralize them in a central data repository. Data from different sources typically has different schemas and different conventions and standards, and therefore it can be challenging to combine them to get the required answers. For that reason, data needs to transformed so that it can work together. For example, cleaning the data, applying certain data quality thresholds, and standardizing to a specific standard (for instance, date and time formats used) are all important tasks to ensure the data is useable. A visual representation of the ETL process is shown in the following figure.

Figure 1.2 – The ETL process

Figure 1.2 – The ETL process

AWS Glue is a fully managed ETL service offered by AWS. When it was first introduced in 2017, Glue ETL offered an Apache Spark environment optimized for ETL. Now, Glue ETL offers a wider range of options:

  • PySpark – Apache Spark using Python
  • Spark with Scala – Apache Spark with Scala
  • Python shell – For smaller ETL jobs that don't need a Spark cluster
  • Glue Studio and Glue Databrew – Visual approach to ETL without the need to write code

Amazon EMR transient clusters, with applications such as Spark or Hive, can be leveraged for ETL workloads. ETL workloads can be bulk or streaming: streaming ETL workloads usually need to be up and running constantly, or at least for as long as the source stream is on; batch ETL workloads don't need to run at all times and they can stop once the data is loaded into the target system. This type of workload fits nicely with the flexibility of the cloud. With the cloud, data architects don't need to think of Hadoop clusters as big monolithic clusters. Instead, users prefer purpose-built transient clusters, optimized and sized to handle specific workloads and data loads.

Now that we've had our overview of the AWS analytics ecosystem, let's learn about data Lake Houses and how they are built.

Exploring the data Lake House architecture on AWS

The data Lake House architecture is a modern data analytics architecture: as the name suggests, it combines the data lake and the data warehouse into a seamless system. This approach extends the traditional data warehouse approach and opens up new possibilities for data analytics. For this reason, it is important to understand this architecture, which can be used as a data backend for Amazon QuickSight or other BI applications. To understand the architecture better, let's first start by understanding the differences between a data lake and data warehouse.

Data lakes versus data warehouses

Data lakes and data warehouses are designed to consume large amounts of data for analytics purposes. Data warehouses are traditional database systems, used by organizations and enterprises for years. Data lakes, on the other side, are relatively young implementations that emerged from the big data and Hadoop ecosystems. Tables stored in data warehouses need to have clearly defined schemas. The schema needs to be defined upfront, before any data is added. This approach is called schema on write, and it ensures that data conforms to a specific structure before being ingested into the data warehouse. However, it can be less flexible, and it may introduce complexity when dealing with evolving schemas. Evolving schemas are an increasingly common scenario because organizations need to capture more and more data points from their customer interactions to drive data-driven decisions.

On the other side, data lakes don't enforce a schema upfront. Instead, applications that have the required permissions can write data to a data lake. Structure and data formats aren't enforced by the data lake: it is a responsibility of the writing application.

Data stored in a data lake has few to no limitations regarding its format: it can be structured, semi-structured, or completely unstructured. For many datasets, a schema can be inferred, either because the data is semi-structured (CSV, JSON, and others), or they follow patterns that can be identified after applying regular expressions and extracting specific columns. In data lakes, the schema is inferred when the data is read by the querying application. This approach is called schema on read, and it gives an organization flexibility regarding the data type stored. However, it also introduces challenges with data complexity and enforcing data quality.

For that reason, it is common that data that lands into the data lake goes through a series of transformations to get to a stage where it is useable. The first stage, often referred to as the raw layer, is where the data first lands, and it is stored as is.

After the data has landed, the first series of transformations is applied and the data is stored at the processed layer. Since the data can be of any format, the types of possible transformations are limitless. To give just some examples, data quality functions can be applied at this stage to remove incomplete rows and standardize the data in line with a specific datetime or time zone format. Other data engineering activities can also be performed at this stage, such as converting data into different file data formats optimized for analytics, or organizing them into folders using specific information (usually temporal) that can be later used as a partition column by the querying application.

Finally, data can then be converted for specific use cases and landed into the target layer. As an example, data can be transformed in a way that is relevant for a specific machine learning algorithm to work with the data. Another use case could be BI applications, such as Amazon QuickSight, where data can be pre-joined or aggregated and therefore reduced from a large dataset into a smaller dataset that is easier to visualize. Additional data engineering can be applied at this stage to optimize for performance.

Figure 1.3 – Data lake layers

Figure 1.3 – Data lake layers

The data warehouse and data lake architectures are now being challenged by a new, hybrid type of storage: the data Lake House.

Lake House architecture on AWS

This section will look more closely at an example Lake House architecture on AWS using AWS managed services. Let's start by defining the key components of the Lake House architecture:

  • Amazon Redshift is the data warehouse service.
  • Amazon S3 is the object store that can be used for cloud data lake storage.
  • AWS Glue is the data lake catalog to store technical metadata.

    Note

    AWS Glue Catalog tables can be stored in Amazon Redshift, providing a unified metadata catalog across both the data warehouse and the S3 data lake.

Amazon Redshift supports functionality that allows it to interact with the data warehouse. Let's look at those features in more detail.

Ability to query the data lake from the data warehouse

Redshift Spectrum is a feature of Redshift that allows you to perform SQL queries against data in the S3 data lake. The queries are triggered directly from the data warehouse, and therefore you don't need to connect to a different environment to submit your queries. You need to define the Spectrum tables as external tables on their data warehouse. The Redshift cluster also needs to have permission to access the data lake S3 location(s). The Redshift cluster will need to be assigned an IAM role, which needs to have access to the desired S3 locations.

Another key characteristic of Redshift Spectrum is that the Spectrum queries are running in the Spectrum nodes that are outside of the Redshift cluster. This effectively extends the Redshift cluster with additional compute capacity when data lake data needs to be queried.

Finally, Spectrum tables and Redshift tables can be combined and joined. Without this feature, you would have to move data and collocate it before joining it.

Ability to load data from the data lake

Redshift can efficiently load data from the S3 data lake. Specifically, Redshift's COPY command can load data in parallel from Amazon S3. You (at a minimum) need to define a table name, the data location (commonly S3), and the authorization to access the data in the source location. When loading multiple files from S3, Redshift parallelizes the loading by allocating each file to a Redshift slice (the unit of processing in Redshift).

Ability to unload data to the data lake

Redshift also comes with the ability to unload data from the data warehouse back to the data lake. Specifically, the UNLOAD command unloads the result of the query onto Amazon S3. You (as a minimum) need to specify the S3 location and the authorization. There are more options, such as defining the file format (using the FORMAT AS option) or applying partitioning (using the PARTITION BY option), and others.

In the following diagram, we see an example data pipeline that is using both a data warehouse and a data lake on AWS. Data is loaded from the operational data stores into the Amazon S3 object store in the raw layer of the data lake. Then, with a set of ETL jobs, the data reaches a stage that can be loaded into the data warehouse for BI purposes. For cost-effectiveness, you might not want to load all the data into the warehouse. Instead, you might want to leave the data in the data lake but have the ability to query the data when needed. This architecture considers the temperature of the data (how frequently the data is accessed) to determine the best storage. Hot data that needs to be accessed frequently is loaded into the data warehouse, while colder data remains in the data lake, a cheaper long-term storage option.

Figure 1.4 – Example data pipeline on AWS

Figure 1.4 – Example data pipeline on AWS

Now that we have had an overview of the Lake House architecture on AWS, let's build a basic data Lake House architecture on AWS.

Creating a basic Lake House architecture

In this section, we will go through a hands-on example to create a basic data Lake House architecture. This tutorial will use the AWS CLI and the AWS console. By the end of this section, we will have spun up a working data lake and a data warehouse environment with demo data loaded.

Important note

The resources for this tutorial might introduce charges to your AWS account. Once you finish with the exercise, make sure you clean up the resources to prevent incurring further charges.

Creating the data lake storage

In this step, we will add the data lake storage. Then we will upload a demo dataset and will discover its schema automatically.

Step 1 – creating the S3 bucket

Let's begin:

  1. If you haven't installed it already, follow the AWS documentation to install and configure the AWS CLI. To complete this tutorial, you will need to use a role that has access to the S3, Glue, Redshift, and IAM services: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.
  2. First, let's create the S3 bucket as the data lake storage. Your S3 bucket needs to have a globally unique name. For that reason, we should introduce some randomness to it. Let's pick a random set of 10 characters and numbers. For this tutorial, you should choose your own random set of characters for this string; for example, SF482XHS7M.

    We will use the random string in the data lake name, to ensure it is globally unique.

    Let's use my-data-lake-<random string> as the bucket name.

  3. To create a bucket, we can type the following command into the CLI. Replace the following random string with your random string:
    % aws s3api create-bucket --bucket data-lake-xxxxxxxxxx --region us-east-1

    And the response should look like this:

    {
        "Location": "/data-lake-xxxxxxxxxx"
    }

Step 2 – adding data into the data lake

Now let's add some data. For this exercise, we will use a subset of the New York City Taxi and Limousine Commission (TLC) Trip Record Data:

  1. Let's first have a look at the dataset:
    % aws s3 ls 's3://nyc-tlc/trip data/' --no-sign-request

    This command will return all the files in the open S3 location:

    2016-08-11 15:32:21   85733063 fhv_tripdata_2015-01.csv
    2016-08-11 15:33:04   97863482 fhv_tripdata_2015-02.csv
    2016-08-11 15:33:40  102220197 fhv_tripdata_2015-03.csv
    …
    2021-02-26 16:54:00  138989555 yellow_tripdata_2020-11.csv
    2021-02-26 16:54:00  134481400 yellow_tripdata_2020-12.csv 

    We don't need to download all of them. For this tutorial, we will copy only the files for 2020.

  2. We can use the S3 CLI exclude and include parameters to apply a pattern to match the desired filenames. The command to copy is as follows:
    % aws s3 cp "s3://nyc-tlc/trip data/" s3://data-lake-xxxxxxxxxx/yellowtrips/ --recursive --exclude "*" --include "yellow_tripdata_2020*" 
  3. Once completed, we can then verify that the files exist in our environment with the aws s3 ls command, which lists the files under a specific S3 location:
    % aws s3 ls s3://data-lake-xxxxxxxxxx/yellowtrips/
    2021-03-27 16:53:41  593610736 yellow_tripdata_2020-01.csv
    2021-03-27 16:53:41  584190585 yellow_tripdata_2020-02.csv
    2021-03-27 16:53:42  278288608 yellow_tripdata_2020-03.csv
    2021-03-27 16:53:41   21662261 yellow_tripdata_2020-04.csv
    2021-03-27 16:53:43   31641590 yellow_tripdata_2020-05.csv
    2021-03-27 16:53:42   50277193 yellow_tripdata_2020-06.csv
    2021-03-27 16:53:44   73326707 yellow_tripdata_2020-07.csv
    2021-03-27 16:53:46   92411545 yellow_tripdata_2020-08.csv
    2021-03-27 16:53:50  123394595 yellow_tripdata_2020-09.csv
    2021-03-27 16:53:54  154917592 yellow_tripdata_2020-10.csv
    2021-03-27 16:53:57  138989555 yellow_tripdata_2020-11.csv
    2021-03-27 16:53:58  134481400 yellow_tripdata_2020-12.csv

    Note

    You can use data in a shared data lake as part of your data lake without the need to actually copy it across to your data lake.

Step 3 – identifying the schema

The next step is to identify the schema of the dataset. For this purpose, we will use the AWS Glue crawlers. AWS Glue crawlers crawl through the data to detect the schema. If a schema can be determined (remember there is no guarantee that the data has a specific schema) then Glue crawlers will populate the Glue Catalog with the schemas identified after crawling the data. Glue tables always belong to a Glue database. A database in Glue is just a logical repository of tables in the Glue Catalog:

  1. Let's start by creating a database using the create-database command:
    % aws glue create-database --database-input "{\"Name\":\"my-data-lake-db\"}" --region us-east-1
  2. We can verify the successful database creation using the get-databases command:
    % aws glue get-databases --region us-east-1
    {
        "DatabaseList": [
            {
                "Name": "default", 
                "CreateTime": 1553517157.0
            }, 
         
            {
                "Name": "my-data-lake-db", 
                "CreateTime": 1616865129.0
            }
        ]
    }
  3. Before we create the Glue Crawler, we need to create an IAM role that will be assigned to the Crawler and allow it to access the data in the data lake. The crawler doesn't need to write to the data lake location, therefore only the read access permission is needed. To give the required permissions to a role, we need to attach policies that define the permissions. Let's define a policy document that allows read access to our data lake:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::data-lake-xxxxxxxxxx",
                    "arn:aws:s3:::data-lake-xxxxxxxxxx/*"
                ]
            }
            ]
    }

    The preceding policy document allows the policy holder to use the S3 ListBucket and the GetObject API. The crawler will use ListBucket to list the objects in our data lake bucket and getObject to read objects as it crawls data. This policy restricts access to the data lake bucket only.

  4. Now, let's create a file and copy the policy text. Replace the random string in the data lake name with the random string in your environment. I used vim, but you can use any text editor:
    % vim policy
  5. Then, let's create the IAM policy using the create-policy CLI command:
    % aws iam create-policy --policy-name DataLakeReadAccess --policy-document file://policy 

    The preceding command created the policy and we should get a confirmation JSON object back. Note the policy ARN, as we will use it in a later step.

  6. Next, let's create the IAM role that the Glue crawler will assume. First, let's define the role policy document:
    {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Action": "sts:AssumeRole",
                        "Effect": "Allow",
                        "Principal": {
                            "Service": "glue.amazonaws.com"
                        }
                    }
                ]
            }
  7. Then create a file called role-policy and copy in the preceding JSON document:
    % vim role-policy

    This role policy document allows the Glue service to assume the role we will create.

  8. To create the role, we will use the iam create-role CLI command:
    % aws iam create-role --role-name GlueCrawlerRole --assume-role-policy-document file://role-policy

    We should get a confirmation JSON message after running the command.

  9. Capture the role ARN, as it will be used later when defining the crawler.
  10. Then, let's attach the required policies to this role. For this role, we want to allocate two policies: the AWSGlueServiceRole policy (this is managed by AWS) and the DataLakeReadAccess policy we created earlier. To attach policies to the IAM role we will use the iam attach-role-policy command. Let's start with the AWSGlueServiceRole policy:
    % aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 
  11. Then we will attach the DataLakeReadAccess policy. We will need the policy ARN that we captured earlier. The policy ARN should look like the following line:
    arn:aws:iam::<accountid>:policy/DataLakeReadAccess

    And the command should look like the following:

    % aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::<ACCOUNT-ID>:policy/DataLakeReadAccess
  12. Now, let's create the AWS Glue crawler. For this purpose, we will use the glue create-crawler CLI command. Make sure you replace the role ARN and the data lake location with the values for your environment:
    % aws glue create-crawler --name qs-book-crawler --role arn:aws:iam::xxxxxxxxxxxx:role/GlueCrawlerRole --database-name my-data-lake-db --targets "{\"S3Targets\":[{\"Path\":\"s3://data-lake-xxxxxxxxxx/yellowtrips\"}]}" --region us-east-1
  13. Then, just start the crawler using the glue start-crawler command:
    % aws glue start-crawler --name qs-book-crawler --region us-east-1

    After 1-2 minutes, the Glue crawler should populate the database.

  14. We can confirm this by calling the glue get-tables cli command:
    % aws glue get-tables --database-name my-data-lake-db
  15. You can view the Catalog from the AWS Console. Log in to the AWS Console and navigate to AWS Glue.
  16. Then on the left-hand side menu, under Data Catalog, choose Databases and then find my-data-lake-db. Then click on View tables under my-data-lake-db. It should look like the following screenshot:
Figure 1.5 – Glue console

Figure 1.5 – Glue console

Tip

You can click the checkbox to select the table and then, under Action, you can choose Preview Data. This will open the Amazon Athena console and run an Athena query that returns 10 values from the table.

Step 4 – creating the data warehouse

Let's create our data warehouse next.

To create the data warehouse, we will use the redshift create-cluster CLI command, or you can use the AWS Console:

%aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username admin --master-user-password R3dsh1ft --cluster-identifier mycluster --region us-east-1

This command should give a response with the cluster metadata. After a few minutes, our cluster will be up and running.

Note

This command will create a Redshift cluster with a public IP address. This is something that should be avoided in real-world scenarios. The instructions provided are oversimplified for the purposes of this tutorial as this book is not focused on Amazon Redshift.

Step 5 – loading the data into the data warehouse

First, let's create an IAM role that we will assign to the Redshift cluster. We will use this role when using the Redshift Spectrum feature to query data in S3. We want the cluster to be able to write and read to our S3 location. We also want the cluster to be able to have read access to the Glue Catalog:

  1. Similarly to what we did earlier, we will create the following role policy document to allow the role to be assumed by the Redshift service:
    {
                "Version": "2012-10-17", 
                "Statement": [
                    {
                        "Action": "sts:AssumeRole", 
                        "Effect": "Allow", 
                        "Principal": {
                            "Service": "redshift.amazonaws.com"
                        }
                    }
                ]
            }
  2. Then copy this JSON object into a policy document:
    % vim role-policy-redshift
  3. Now, let's create the role using the iam create-role command:
    % aws iam create-role --role-name RedshiftSpectrumRole --assume-role-policy-document file://role-policy-redshift

    Note the role ARN, as we will use it later to attach it to the cluster.

  4. Next, we need to give the desired permissions by attaching the correct policies. This time, for simplicity, we will just attach two AWS managed policies. These policies are overly permissive, and normally we would attach policies with narrower permissions, as we did for the Glue crawlers in Step 3. Let's attach AWSFullS3Access and AWSFullGlueAccess:
    % aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess 
    % aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
  5. Next, we will attach this role to our cluster using the redshift modify-cluster-iam-roles CLI command. Note that you need to update the role ARN with the value from your environment:
    % aws redshift modify-cluster-iam-roles --cluster-identifier mycluster --add-iam-roles arn:aws:iam::<ACCOUNT-ID>:role/RedshiftSpectrumRole --region us-east-1
  6. The cluster change will take a few minutes to be applied. After the change is applied, the cluster will be ready to fetch data from the S3 data lake. To connect to the cluster, we will use the built-in query editor found in the AWS Management Console. To find the editor, navigate to the Redshift console, and see the left-hand side menu. The editor will need to establish a connection. Make sure you select the cluster we created earlier, and type dev as the database name and admin as the username.

    Note

    We didn't set a database name earlier. Redshift uses dev as the default value.

  7. In the editor page, we will need to create a table to store the data. Let's name the table yellowtrips_3mo, as we will only store 3 months' worth of data:
    create table yellowtrips_3mo
    (vendorid varchar(10),
    tpep_pickup_datetime datetime,
    tpep_dropoff_datetime datetime,
    passenger_count int,
    trip_distance float,
    ratecodeid varchar(10),
    store_and_fwd_flag char(1),
    pulocationid varchar(10),
    dolocationid varchar(10),
    payment_type varchar(10),
    fare_amount float,
    extra float,
    mta_tax float,
    tip_amount float,
    tolls_amount float,
    improvement_surcharge float,
    total_amount float,
    congestion_surcharge float);
  8. Then, let's copy 3 months' worth of data into the data warehouse. Let's use the COPY command, as follows:
    copy yellowtrips_3mo from 
    's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-10.csv'
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
    copy yellowtrips_3mo from 
    's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-11.csv'
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
    copy yellowtrips_3mo from 
    's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-12.csv'
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
  9. At this stage, we have a data lake with 12 months' worth of data and a data warehouse that contains only the most recent data (3 months). One of the characteristics of the Lake House architecture is that it allows its users to query the data lake from the data warehouse. Feel free to query the data and start getting an understanding of the dataset. Let's create the external schema so that we can enable the Spectrum feature. Use the following command in your Redshift editor. Replace the role ARN with the values from your environment:
    create external schema spectrum_schema from data catalog 
    database 'my-data-lake-db' 
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSpectrumRole'
    create external database if not exists;
  10. Let's just compare the size of the two tables using a simple count(*) query:
    select count(*) from public.yellowtrips_3mo;
    select count(*) from spectrum_schema.yellowtrips;

    The first query will run against the recent data in the data warehouse. The second will run against the first query using the Spectrum nodes using the data in the data lake. As expected, the number of records in the data lake should be much higher than the number of records in the data warehouse. Specifically, the query result was 24,648,499 for the year 2020 and 4,652,013 records for the last 3 months of 2020.

    Note

    The Spectrum queries use the Spectrum nodes and are charged separately from the Redshift cluster. Every query incurs an added cost based on the data it needs to scan. Refer to the AWS pricing for details.

Feel free to experiment with the data and trigger a few queries to understand the dataset. When you finish with the Redshift cluster, you can pause the cluster so that you stop the on-demand billing. Once the cluster is paused you will only pay for the cluster storage.

Summary

Congratulations, you have reached the end of the first chapter! By now, you should have a good understanding of the AWS analytics ecosystem and its data lake and data warehousing options. In this chapter, we discussed in detail the key differences between data warehouses and data lakes. We also discussed the Lake House architecture on AWS, and we looked at its main components in more detail. Finally, during the step-by-step section in this chapter, you had a chance to create a data lake and a data warehouse from scratch, and you loaded an open dataset for further analysis later on. We also defined Spectrum tables and queried the data lake directly for the data warehouse.

In the next chapter, we will discuss the basic concepts of Amazon QuickSight, understand its main benefits, and learn how to set up a QuickSight account.

Questions

  1. What is the difference between data lakes and data warehouses?
  2. What is schema on read and what is schema on write?
  3. How can we identify the schema of new data in a data lake on AWS?
  4. Which AWS storage service is ideal for a data lake on AWS?
  5. What data is better served from the data lake in the Lake House architecture?
  6. How do Redshift Spectrum tables differ from Redshift tables?

Further reading

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore Amazon QuickSight, manage data sources, and build and share dashboards
  • Learn best practices from an AWS certified big data solutions architect
  • Manage and monitor dashboards using the QuickSight API and other AWS services such as Amazon CloudTrail

Description

Amazon Quicksight is an exciting new visualization that rivals PowerBI and Tableau, bringing several exciting features to the table – but sadly, there aren’t many resources out there that can help you learn the ropes. This book seeks to remedy that with the help of an AWS-certified expert who will help you leverage its full capabilities. After learning QuickSight’s fundamental concepts and how to configure data sources, you’ll be introduced to the main analysis-building functionality of QuickSight to develop visuals and dashboards, and explore how to develop and share interactive dashboards with parameters and on-screen controls. You’ll dive into advanced filtering options with URL actions before learning how to set up alerts and scheduled reports. Next, you’ll familiarize yourself with the types of insights before getting to grips with adding ML insights such as forecasting capabilities, analyzing time series data, adding narratives, and outlier detection to your dashboards. You’ll also explore patterns to automate operations and look closer into the API actions that allow us to control settings. Finally, you’ll learn advanced topics such as embedded dashboards and multitenancy. By the end of this book, you’ll be well-versed with QuickSight’s BI and analytics functionalities that will help you create BI apps with ML capabilities.

Who is this book for?

This book is for business intelligence (BI) developers and data analysts who are looking to create interactive dashboards using data from Lake House on AWS with Amazon QuickSight. It will also be useful for anyone who wants to learn Amazon QuickSight in depth using practical, up-to-date examples. You will need to be familiar with general data visualization concepts before you get started with this book, however, no prior experience with Amazon QuickSight is required.

What you will learn

  • Understand the wider AWS analytics ecosystem and how QuickSight fits within it
  • Set up and configure data sources with Amazon QuickSight
  • Include custom controls and add interactivity to your BI application using parameters
  • Add ML insights such as forecasting, anomaly detection, and narratives
  • Explore patterns to automate operations using QuickSight APIs
  • Create interactive dashboards and storytelling with Amazon QuickSight
  • Design an embedded multi-tenant analytics architecture
  • Focus on data permissions and how to manage Amazon QuickSight operations

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 28, 2022
Length: 242 pages
Edition : 1st
Language : English
ISBN-13 : 9781801072007
Vendor :
Amazon
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jan 28, 2022
Length: 242 pages
Edition : 1st
Language : English
ISBN-13 : 9781801072007
Vendor :
Amazon
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 158.97
Actionable Insights with Amazon QuickSight
$48.99
Machine Learning with Amazon SageMaker Cookbook
$54.99
Serverless Analytics with Amazon Athena
$54.99
Total $ 158.97 Stars icon

Table of Contents

14 Chapters
Section 1: Introduction to Amazon QuickSight and the AWS Analytics Ecosystem Chevron down icon Chevron up icon
Chapter 1: Introducing the AWS Analytics Ecosystem Chevron down icon Chevron up icon
Chapter 2: Introduction to Amazon QuickSight Chevron down icon Chevron up icon
Chapter 3: Preparing Data with Amazon QuickSight Chevron down icon Chevron up icon
Chapter 4: Developing Visuals and Dashboards Chevron down icon Chevron up icon
Section 2: Advanced Dashboarding and Insights Chevron down icon Chevron up icon
Chapter 5: Building Interactive Dashboards Chevron down icon Chevron up icon
Chapter 6: Working with ML Capabilities and Insights Chevron down icon Chevron up icon
Chapter 7: Understanding Embedded Analytics Chevron down icon Chevron up icon
Section 3: Advanced Topics and Management Chevron down icon Chevron up icon
Chapter 8: Understanding the QuickSight API Chevron down icon Chevron up icon
Chapter 9: Managing QuickSight Permissions and Usage Chevron down icon Chevron up icon
Chapter 10: Multitenancy in Amazon QuickSight Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(5 Ratings)
5 star 80%
4 star 20%
3 star 0%
2 star 0%
1 star 0%
Apostolos Grevenitis Feb 01, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Pros:- Clear and concise messaging.- Builds knowledge block-by-block without rushing into concepts.- Plenty of hands-on examples and labs to supplement theoretical concepts (best way to learn is to get your hands dirty imo).- Provides holistic view of the AWS Analytics ecosystem + relevant use cases and makes it easy to understand where QuickSight falls and how it integrates with the rest of the AWS Analytics services.- Caters to AWS novices and more advanced users alike.Cons:- Black & white charts and images take from the paperback reader experience, however they are available through the publisher's website.
Amazon Verified review Amazon
Poonam Nair Jan 31, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
True to this books title, the book helped a novice like myself gain more insight in understanding QuickSight and to be able to see its use & importance as a comprehensive guide that aids with building detailed and specific data analytics solutions. This book is definitely a thorough guide that will aid me with future projects at work.
Amazon Verified review Amazon
MuahAmazon Customer Dec 17, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Actionable Insights with Amazon QuickSight is truly a workbook that takes you through the entire life cycle management of data analytics with a QuickSight lens. If you are looking for a comprehensive study of QuickSight, this text is your one-stop text.This book is NOT about how you create visualizations (although there are lots of visualizations!). It provides answers to the "WHYs" if going deep is your cup of tea; and provides prescriptions about the "WHENs".If you don't have any analytics experience, no problem. I was able to learn about data lakes and other AWS services that support analytics through many hands-on examples which I was able to immediately leverage in my work.
Amazon Verified review Amazon
Abhi Jan 28, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book by Manos Samatas is an excellent resource to gain knowledge on Amazon QuickSight. It is quite extensive and covers a wide range of features that are available, including several advanced BI topics for enterprise scale implementations. A nice read and useful for everyone who are in BI analytics and want to understand power of Amazon QuickSight.
Amazon Verified review Amazon
P. Jan 31, 2022
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Actionable Insights with Amazon Quicksight is an A to Z guide for Quicksight tool. Highly recommended for developers and analysts.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.