Amazon Redshift is one of the database as a service (DBaaS) offerings from AWS that provides a massively scalable data warehouse as a managed service, at significantly lower costs. The data warehouse is based on the open source PostgreSQL database technology. However, not all features offered in PostgreSQL are present in Amazon Redshift.
Today, we will learn about Amazon Redshift and perform a few steps to create a fully functioning Amazon Redshift cluster. We will also take a look at some of the essential concepts and terminologies that you ought to keep in mind when working with Amazon Redshift:
- Clusters: Just like Amazon EMR, Amazon Redshift also relies on the concept of clusters. Clusters here are logical containers containing one or more instances or compute nodes and one leader node that is responsible for the cluster's overall management. Here's a brief look at what each node provides:
- Leader node: The leader node is a single node present in a cluster that is responsible for orchestrating and executing various database operations, as well as facilitating communication between the database and associate client programs.
- Compute node: Compute nodes are responsible for executing the code provided by the leader node. Once executed, the compute nodes share the results back to the leader node for aggregation. Amazon Redshift supports two types of compute nodes: dense storage nodes and dense compute nodes. The dense storage nodes provide standard hard disk drives for creating large data warehouses; whereas, the dense compute nodes provide higher performance SSDs. You can start off by using a single node that provides 160 GB of storage and scale up to petabytes by leveraging one or more 16 TB capacity instances as well.
- Node slices: Each compute node is partitioned into one or more smaller chunks or slices by the leader node, based on the cluster's initial size. Each slice contains a portion of the compute nodes memory, CPU and disk resource, and uses these resources to process certain workloads that are assigned to it. The assignment of workloads is again performed by the leader node.
- Databases: As mentioned earlier, Amazon Redshift provides a scalable database that you can leverage for a data warehouse, as well as analytical purposes. With each cluster that you spin in Redshift, you can create one or more associated databases with it. The database is based on the open source relational database PostgreSQL (v8.0.2) and thus, can be used in conjunction with other RDBMS tools and functionalities. Applications and clients can communicate with the database using standard PostgreSQL JDBC and ODBC drivers.
Here is a representational image of a working data warehouse cluster powered by Amazon Redshift:
With this basic information in mind, let's look at some simple and easy to follow steps using which you can set up and get started with your Amazon Redshift cluster.
Getting started with Amazon Redshift
In this section, we will be looking at a few simple steps to create a fully functioning Amazon Redshift cluster that is up and running in a matter of minutes:
- First up, we have a few prerequisite steps that need to be completed before we begin with the actual set up of the Redshift cluster. From the AWS Management Console, use the Filter option to filter out IAM. Alternatively, you can also launch the IAM dashboard by selecting this URL: https://console.aws.amazon.com/iam/.
- Once logged in, we need to create and assign a role that will grant our Redshift cluster read-only access to Amazon S3 buckets. This role will come in handy later on in this chapter when we load some sample data on an Amazon S3 bucket and use Amazon Redshift's COPY command to copy the data locally into the Redshift cluster for processing. To create the custom role, select the Role option from the IAM dashboards' navigation pane.
- On the Roles page, select the Create role option. This will bring up a simple wizard using which we will create and associate the required permissions to our role.
- Select the Redshift option from under the AWS Service group section and opt for the Redshift - Customizable option provided under the Select your use case field. Click Next to proceed with the set up.
- On the Attach permissions policies page, filter and select the AmazonS3ReadOnlyAccess permission. Once done, select Next: Review.
- In the final Review page, type in a suitable name for the role and select the Create Role option to complete the process. Make a note of the role's ARN as we will be requiring this in the later steps. Here is snippet of the role policy for your reference:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}
With the role created, we can now move on to creating the Redshift cluster.
- To do so, log in to the AWS Management Console and use the Filter option to filter out Amazon Redshift. Alternatively, you can also launch the Redshift dashboard by selecting this URL: https://console.aws.amazon.com/redshift/.
- Select Launch Cluster to get started with the process.
- Next, on the CLUSTER DETAILS page, fill in the required information pertaining to your cluster as mentioned in the following list:
- Cluster identifier: A suitable name for your new Redshift cluster. Note that this name only supports lowercase strings.
- Database name: A suitable name for your Redshift database. You can always create more databases within a single Redshift cluster at a later stage. By default, a database named dev is created if no value is provided:
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
-
- Database port: The port number on which the database will accept connections. By default, the value is set to 5439, however you can change this value based on your security requirements.
- Master user name: Provide a suitable username for accessing the database.
- Master user password: Type in a strong password with at least one uppercase character, one lowercase character and one numeric value. Confirm the password by retyping it in the Confirm password field.
- Once completed, hit Continue to move on to the next step of the wizard.
- On the NODE CONFIGURATION page, select the appropriate Node type for your cluster, as well as the Cluster type based on your functional requirements. Since this particular cluster setup is for demonstration purposes, I've opted to select the dc2.large as the Node type and a Single Node deployment with 1 compute node. Click Continue to move on the next page once done.
It is important to note here that the cluster that you are about to launch will be live and not running in a sandbox-like environment. As a result, you will incur the standard Amazon Redshift usage fees for the cluster until you delete it. You can read more about Redshift's pricing at:
https://aws.amazon.com/redshift/pricing/.
- In the ADDITIONAL CONFIGURATION page, you can configure add-on settings, such as encryption enablement, selecting the default VPC for your cluster, whether or not the cluster should have direct internet access, as well as any preferences for a particular Availability Zone out of which the cluster should operate. Most of these settings do not require any changes at the moment and can be left to their default values.
- The only changes required on this page is associating the previously created IAM role with the cluster. To do so, from the Available Roles drop-down list, select the custom Redshift role that we created in our prerequisite section. Once completed, click on Continue.
- Review the settings and changes on the Review page and select the Launch Cluster option when completed.
The cluster takes a few minutes to spin up depending on whether or not you have opted for a single instance deployment or multiple instances. Once completed, you should see your cluster listed on the Clusters page, as shown in the following screenshot. Ensure that the status of your cluster is shown as healthy under the DB Health column. You can additionally make a note of the cluster's endpoint as well, for accessing it programmatically:
With the cluster all set up, the next thing to do is connect to the same.
This Amazon Redshift tutorial has been taken from AWS Administration - The Definitive Guide - Second Edition.
Read More
Amazon S3 Security access and policies
How to run Lambda functions on AWS Greengrass
AWS Fargate makes Container infrastructure management a piece of cake