Data Engineering with AWS Cookbook

Managing Data Lake Storage

Amazon Simple Storage Service (Amazon S3) is a highly scalable and secure cloud storage service. It allows you to store and retrieve any amount of data at any time from anywhere in the world. S3 buckets aim to help enterprises and individuals achieve their data backup and delivery needs and serve a variety of use cases, including but not limited to web and mobile applications, big data analytics, data lakes, and data backup and archiving.

In this chapter, we will learn how to keep data secure in S3 buckets and configure your buckets in a way that best serves your use case from performance and cost perspectives.

The following recipes will be covered in this chapter:

Controlling access to S3 buckets
Storage types in S3 for optimized storage costs
Enforcing encryption of S3 buckets
Setting up retention policies for your objects
Versioning your data
Replicating your data
Monitoring your S3 buckets

Controlling access to S3 buckets

Controlling access to S3 buckets through policies and IAM roles is crucial for maintaining the security and integrity of your objects and data stored in Amazon S3. By defining granular permissions and access controls, you can ensure that only authorized users or services have the necessary privileges to interact with your S3 resources. You can restrict permissions according to your requirements by precisely defining who can access your data, what actions they can take, and under what conditions. This fine-grained access control helps protect sensitive data, prevent unauthorized modifications, and mitigate the risk of accidental or malicious actions.

AWS Identity and Access Management (IAM) allows you to create an entity referred to as an IAM identity, which is granted specific actions on your AWS account. This entity can be a person or an application. You can create this identity as an IAM role, which is designed to be attached to any entity that needs it. Alternatively, you can create IAM users, which represent individual people and are usually used for granting long-term access to specific users. IAM users can be grouped into an IAM group, allowing permissions to be assigned at the group level and inherited by all member users. IAM policies are sets of permissions that can be attached to the IAM identity to grant specific access rights.

In this recipe, we will learn how to create a policy so that we can view all the buckets in the account, give read access to one specific bucket content, and then give write access to one of its folders.

Getting ready

For this recipe, you need to have an IAM user, role, or group to which you want to grant access. You also need to have an S3 bucket with a folder to grant access to.

To learn how to create IAM identities, go to https://docs.aws.amazon.com/IAM/latest/UserGuide/id.html.

How to do it…

Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the IAM console.
Choose Policies from the navigation pane on the left and choose Create policy.

Choose the JSON tab to provide the policy in JSON format and replace the existing JSON with this policy:

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Sid": "AllowListBuckets",
          "Effect": "Allow",
          "Action": [
              "s3:ListAllMyBuckets"
          ],
          "Resource": "*"
      },
      {
          "Sid": "AllowBucketListing",
          "Effect": "Allow",
          "Action": [
              "s3:ListBucket",
              "s3:GetBucketLocation"
          ],
          "Resource": [
              "arn:aws:s3:::<bucket-name>"
          ]
      },
      {
          "Sid": "AllowFolderAccess",
          "Effect": "Allow",
          "Action": [
              "s3:GetObject",
              "s3:PutObject",
              "s3:DeleteObject"
          ],
          "Resource": [
              "arn:aws:s3:::<bucket-name>/<folder-name>/*"
          ]
      }
  ]
}

Provide a policy name and, optionally, a description of the policy in the respective fields.
Click on Create Policy.

Now, you can attach this policy to an IAM role, user, or group. However, exercise caution and ensure access is granted only as necessary; avoid providing admin access policies to regular users.

How it works…

An IAM policy comprises three key elements:

Effect: This specifies whether the policy allows or denies access
Action: This details the specific actions being allowed or denied
Resource: This identifies the resources to which the actions apply

A single statement can apply multiple actions to multiple resources. In this recipe, we’ve defined three statements:

The AllowListBuckets statement gives access to list all buckets in the AWS account
The AllowBucketListing statement gives access to list the content of a specific S3 bucket
The AllowFolderAccess gives access to upload, download, and delete objects from a specific folder

There’s more…

If you want to make sure that no access is given to a specific bucket or object in your bucket, you can use a deny statement, as shown here:

{
    "Sid":"DenyListBucketFolder",
         "Action":[
            "s3:*"
         ],
         "Effect":"Deny",
         "Resource":[
            "arn:aws:s3:::<bucket-name>/<folder-name>/*"
}

Instead of using an IAM policy to set up permissions to your bucket, you can use S3 bucket policies. These can be located in the Permission tab of the bucket. Bucket policies can be used when you’re trying to set up access at the bucket level, regardless of the IAM role or user.

Storage types in S3 for optimized storage costs

Amazon S3 offers different tiers or classes of storage that allow you to optimize for cost and performance based on your access pattern and data requirements. The default storage class for S3 buckets is S3 Standard, which offers high availability and low latency. For less frequently accessed data, S3 Standard-IA and S3 One Zone-IA can be used. For rare access, Amazon S3 offers archiving classes called Glacier, which are the lowest-cost classes. If you’re not sure how frequently your data will be accessed, S3 Intelligent-Tiering would be optimal for you as it will automatically move objects between the classes based on the access patterns. However, be aware that additional costs may be incurred when you’re moving objects to a higher-cost storage class.

These storage classes provide users with the flexibility to choose the right trade-off between storage costs and access performance based on their specific data storage and retrieval requirements. You can choose the storage class based on your access patterns, durability requirements, and budget considerations. Configuring storage classes at the object level allows for a mix of storage classes within the same bucket. Objects from diverse storage classes, including S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA, can coexist in a single bucket.

In this recipe, we will learn how to enforce the S3 Intelligent-Tiering storage class for an S3 bucket through a bucket policy.

Getting ready

For this recipe, you only need to have an S3 bucket for which you will enforce the storage class.

How to do it…

Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
Locate and select the S3 bucket on which you want to enable S3 Intelligent-Tiering and navigate to the Permissions tab.
Under the Bucket Policy section, click on Edit.

In the bucket policy editor, add the following statement. Make sure you replace <your_bucket_name> with the actual name of your S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnableIntelligentTiering",
      "Effect": "Deny",
      "Principal": {
        "AWS": "*"
      },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::<your-bucket-name>/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-storage-class": "INTELLIGENT_TIERING"
        }
      }
    }
  ]
}

Save the bucket policy by clicking on Save changes.

How it works…

The policy will ensure that objects are stored via the Intelligent-Tiering class by allowing the PUT operation to be used on the bucket for all users (Principal: *), but only if the storage class is set to INTELLIGENT_TIERING. You can do this by choosing it from the storage class list in the Object properties section. If you’re using the console or the S3 API, add the x-amz-storage-class: INTELLIGENT_TIERING header. Use the -storage-class INTELLIGENT_TIERING parameter when using the AWS CLI.

There’s more…

Intelligent-Tiering will place newly uploaded objects in the S3 Standard class (Frequent Access class). If the object hasn’t been accessed in 30 consecutive days, it will be moved to the Infrequent Access tier; if it hasn’t been accessed in 90 consecutive days, it will be moved to the Archive Instant Access tier. For further cost savings, you can enable INTELLIGENT_TIERING to move your object to the Archive Access tier and Deep Archive Access tier if they have not been accessed for a longer period. To do this, follow these steps:

Navigate to the Properties tab for the bucket.
Scroll down to Intelligent-Tiering Archive configurations and click on Create configuration.
Name the configuration and specify whether you want to enable it for all objects in the bucket or on a subset based on a filter and/or tags.
Under Status, click on Enable to enable the configuration directly after you create it.
Under Archive rule actions, enable the Archive Access tier and specify the number of days in which the objects should be moved to this class if they’re not being accessed. The value must be between 90 and 730 days. Similarly, enable the Deep Archive Access tier and set the number of days to a minimum of 180 days. It’s also possible to enable only one of these classes:

Figure 1.1 – Intelligent-Tiering Archive rule action

Click on Create to create the configuration.

Enforcing encryption on S3 buckets

Amazon S3 encryption increases the level of security and privacy of your data; it helps ensure that only authorized parties can read it. Even if an unauthorized person gains logical or physical access to that data, the data is unreadable if they don’t get a hold of the key to unencrypt it.

S3 supports encrypting data both at transit (as it travels to and from S3) and at rest (while it’s stored on disks in S3 data centers).

For protecting data at rest, you have two options. The first is server-side encryption (SSE), in which Amazon S3 will be handling the heavy encryption operation on the server side in AWS. By default, Amazon S3 encrypts your data using SSE-S3. However, you can change this to SSE-KMS, which uses KMS keys for encryption, or to SSE-C, where you can provide and manage your own encryption key. Alternatively, you can encrypt your data using client-side encryption, where Amazon S3 doesn’t play any role in the encryption process rather; you are responsible for all the encryption operations.

In this recipe, we’ll learn how to enforce SSE-KMS server-side encryption using customer-managed keys.

Getting ready

For this recipe, you need to have a KMS key in the same region as your bucket to use for encryption. KMS provides a managed key for S3 (aws/s3) that can be utilized for encryption. However, if you desire greater control over the key properties, such as modifying its policies or performing key rotation, you can create a customer-managed key. To do so, follow these steps:

Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the AWS Key Management Service (AWS KMS) service.
In the navigation pane, choose Customer managed keys and click on Create key.
For Key type, choose Symmetric, while for Key usage, choose Encrypt and decrypt. Click on Next:

Figure 1.2 – KMS configuration

Click on Next.
Type an Alias value for the KMS key. This will be the display name. Optionally, you can provide Description and Tags key-value pairs for the key.
Click on Next. Optionally, you can provide Key administrators to administer the key. Click on Finish to create the key.

How to do it…

Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
In the Buckets list, choose the name of the bucket that you want to change the encryption for and navigate to the Properties tab.
Click on Edit in the Default encryption section.
For Encryption type, choose Server-side encryption with AWS Key Management Service keys (SSE-KMS).
For AWS KMS key, you can select Enter AWS KMS key ARN to enter the key you have created or browse it using Choose from your AWS KMS keys.
Keep Bucket Key enabled and save your changes:

Figure 1.3 – Changing the default encryption

How it works…

By changing the default encryption for your bucket, all newly uploaded objects to your bucket, which don’t have an encryption setting, will be encrypted using the KMS you have provided. Already existing objects in your bucket will not be affected. Enabling the bucket key leads to cost savings in KMS service calls associated with the encryption or decryption of individual objects. This is achieved by KMS generating a key at the bucket level rather than generating a separate KMS key for each encrypted object. S3 uses this bucket-level key to generate distinct data keys for objects within the bucket, thereby eliminating the need for additional KMS requests to complete encryption operations.

There’s more…

By following this recipe, you can encrypt your objects with SSE-KMS but only if they don’t have encryption configured. You can enforce your objects to have an SSE-KMS encryption setting in the PUT operation using a bucket policy, as shown here:

Navigate to the bucket’s Permissions tab.
Go to the Bucket Policy section and click on Edit.

Paste the following policy. Make sure you replace <your-bucket-name> with the actual name of your S3 bucket and <your-kms-key-arn> with the Amazon Resource Name (ARN) of your KMS key:

{
  "Version": "2012-10-17",
  "Id": "EnforceSSE-KMS",
  "Statement": [
      {
          "Sid": "DenyNonKmsEncrypted",
          "Effect": "Deny",
          "Principal": "*",
          "Action": "s3:PutObject",
          "Resource": "arn:aws:s3:::<your-bucket-name>/*",
          "Condition": {
              "StringNotEquals": {
                  "s3:x-amz-server-side-encryption": "aws:kms"
              }
          }
      },
      {
          "Sid": "AllowKmsEncrypted",
          "Effect": "Allow",
          "Principal": "*",
          "Action": "s3:PutObject",
          "Resource": "arn:aws:s3:::<your-bucket-name>/*",
          "Condition": {
              "StringEquals": {
                  "s3:x-amz-server-side-encryption": "aws:kms",
                  "s3:x-amz-server-side-encryption-aws-kms-key-id": "<your-kms-key-arn>"
              }
          }
      }
  ]
}

Save your changes.

This policy contains two statements. The first statement (DenyNonKmsEncrypted) denies the s3:PutObject action for any request that does not include SSE-KMS encryption. The second statement (AllowKmsEncrypted) only allows the s3:PutObject action when the request includes SSE-KMS encryption and the specified KMS key.

Setting up retention policies for your objects

Amazon S3’s storage lifecycle allows you to manage the lifecycle of objects in an S3 bucket based on predefined rules. The lifecycle management feature consists of two main actions: transitions and expiration. Transitions involve automatically moving objects between different storage classes based on a defined duration. This helps in optimizing costs by storing less frequently accessed data in a cheaper storage class. Expiration, on the other hand, allows users to set rules to automatically delete objects from an S3 bucket. These rules can be based on a specified duration. Additionally, you can apply a combination of transitions and expiration actions to objects. Amazon S3’s storage lifecycle provides flexibility and ease of management for users and it helps organizations optimize storage costs while ensuring that data is stored according to its relevance and access patterns.

In this recipe, we will learn how to set up a lifecycle policy to archive objects in S3 Glacier after a certain period and then expire them.

Getting ready

To complete this recipe, you need to have a Glacier vault, which is a separate storage container that can be used to store archives, independent from S3. You can create one by following these steps:

Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the Glacier service.
Click on Create vault to start creating a new Glacier vault.
Provide a unique and descriptive name for your vault in the Vault name field.
Optionally, you can choose to receive notifications for events by clicking Turn on notifications under the Event notifications section.
Click on Create to create the vault.

How to do it…

Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
Select the desired bucket for which you want to configure the lifecycle policy and navigate to the Management tab.
In the left panel, select Lifecycle and click on Create lifecycle rule.
Under Rule name, name the lifecycle rule to identify it.
Under Choose a rule scope, you can choose Apply to all objects in the bucket or Limit the scope of this rule using one or more filters to specify the objects for which the rule will be applied. You can use one of the following filters or a combination of them:
- Filter objects based on prefixes (for example, logs)
- Filter objects based on tags; you can add multiple key-value pair tags to filter on
- Filter objects based on object size by setting Specify minimum object size and/or Specify maximum object size and specifying the size value and unit
The following screenshot shows a rule that’s been restricted to a set of objects based on a prefix:

Figure 1.4 – Lifecycle rule configuration

Under Lifecycle rule actions, select the following options:
- Move current versions of objects between storage classes. Then, choose one of the Glacier classes and set Days after object creation in which the object will be transitioned (for example, 60 days).
- Expire current versions of objects. Then, set Days after object creation in which the object will expire. Choose a value higher than the one you set for transitioning the object to Glacier (for example, 100).
Review the transition and expiration actions you have set and click on Create rule to apply the lifecycle policy to the bucket:

Figure 1.5 – Reviewing the lifecycle rule

Note

It may take some time for the lifecycle rule to be applied to all the selected objects, depending on the size of the bucket and the number of objects. The rule will affect existing files, not just new ones, so ensure that no applications are accessing files that will be archived or deleted as they will no longer be accessible via direct S3 retrieval.

How it works…

After you save the lifecycle rule, Amazon S3 will periodically evaluate it to find objects that meet the criteria specified in the lifecycle rule. In this recipe, the object will remain in its default storage type for the specified period (for example, 60 days) after which it will automatically be moved to the Glacier storage class. This transition is handled transparently, and the object’s metadata and properties remain unchanged. Once the objects are transitioned to Glacier, they are stored in a Glacier vault and become part of the Glacier storage infrastructure. Objects will then remain in Glacier for the remaining period of expiry (for example, 40 days), after which they will expire and be permanently deleted from your S3 bucket.

Please note that once the objects have expired, they will be queued for deletion, so it might take a few days after the object reaches the end of its lifetime for it to be deleted.

There’s more…

Lifecycle configuration can be specified as an XML when using the S3 API or AWS console, which can be helpful if you are planning on using the same lifecycle rules on multiple buckets. You can read more on setting this up at https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html.

Versioning your data

Amazon S3 versioning refers to maintaining multiple variants of an object at the same time in the same bucket. Versioning provides you with an additional layer of protection by giving you a way to recover from unintended overwrites and accidental deletions as well as application failures.

S3 Object Versioning is not enabled by default and has to be explicitly enabled for each bucket. Once enabled, versioning cannot be disabled and can only be suspended. When versioning is enabled, you will be able to preserve, retrieve, and restore any version of an object stored in the bucket using the version ID. Every version of an object is the whole object, not the delta from the previous version, and you can set permissions at the version level. So, you can set different permissions for different versions of the same object.

In this recipe, we’ll learn how to delete the current version of an object to make the previous one the current version.

Getting ready

For this recipe, you need to have a version-enabled bucket with an object that has at least two versions.

You can enable versioning for your bucket by going to the bucket’s Properties tab, editing the Bucket Versioning area, and setting it to Enable:

Figure 1.6 – Enabling bucket versioning

You can create a new version of an object by simply uploading a file with the same name to the versioning-enabled bucket.

It’s important to note that enabling versioning for a bucket is irreversible. Once versioning is enabled, it will be applied to all existing and future objects in that bucket. So, before enabling versioning, make sure that your application or workflow is compatible with object versioning.

Enabling versioning for the first time will take time to take effect, so we recommend waiting 15 minutes before performing any write operation on objects in the bucket.

How to do it…

Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
In the Buckets list, select the S3 bucket that contains the object for which you want to set the previous version as the current one.
In the Objects tab, click on Show versions. Here, you can view all your object versions:

Figure 1.7 – Object versions

Select the current version of the object that you want to delete. It’s the top-most version with the latest modified date.
Click on the Delete button and write permanently delete as prompted on the next screen.
After deleting the current version, the previous version will automatically become the latest version:

Figure 1.8 – Object versions after version deletion

Verify that the previous version is now the latest version by checking the Last modified timestamps or verifying this through object listing, metadata, or download.

How it works…

Once you enable bucket versioning, each object in the bucket will have a version ID that uniquely identifies the object in the bucket, and the non-version-enabled buckets will have their version IDs set to null for their objects. The older versions of an object become non-current but continue to exist and remain accessible. When you delete the current version of the object, it will be permanently removed and the S3 versioning mechanism will automatically promote the previous version as the current one after deletion. If you delete an object without specifying the version ID, Amazon S3 doesn’t delete it permanently; instead, it inserts a delete marker into it and it becomes the current object version. However, you can still restore its previous versions:

Figure 1.9 – Object with a delete marker

There’s more…

S3 rates apply to every version of an object that’s stored and requested, so keeping non-current versions of objects can increase your storage cost. You can use lifecycle rules to archive the non-current versions or permanently delete them after a certain period and keep the bucket clean from unnecessary object versions.

Follow these steps to add a lifecycle rule to delete non-current versions after a certain period:

Go to the bucket’s Management tab and click on the Lifecycle configuration.
Click on the Add lifecycle rule button to create a new rule.
Provide a unique name for the rule.
Under Apply rule to, select the appropriate resources (for example, the entire bucket or specific prefixes).
Set the action to Permanently delete non-current versions.
Specify Days after objects become noncurrent in which the delete will be executed. Optionally, you can specify Number of newer versions to retain, which means it will keep the said number of versions for the object and all others will be deleted when they are eligible for deletion based on the specified period.
Click on Save to save the lifecycle rule.

Replicating your data

AWS S3 replication is an automatic asynchronous process that involves copying objects to one or multiple destination buckets. Replication can be configured across buckets in the same AWS region with Same-Region Replication, which can be useful for scenarios such as isolating different workloads, segregating data for different teams, or achieving compliance requirements. Replication can also be configured for buckets across different AWS regions with Cross-Region Replication (CRR), which helps in reducing latency for accessing data, especially for enterprises with a large number of locations, by maintaining multiple copies of the objects in different geographies or different regions. It provides compliance and data redundancy for improved performance, availability, and disaster recovery capabilities.

In this recipe, we’ll learn how to set up replication between two buckets in different AWS regions and the same AWS account.

Getting ready

You need to have an S3 bucket in the destination AWS region to act as a target for the replication. Also, S3 versioning must be enabled for both the source and destination buckets.

How to do it…

Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
In the Buckets list, choose the source bucket you want to replicate.
Go to the Management tab and select Create replication rule under Replication rules.
Under Replication rule name in the Replication rule configuration section, give your rule a unique name.
Under Status, either keep it Enabled for the rule to take effect once you save it or change it to Disabled to enable it later as required:

Figure 1.10 – Replication rule configuration

If this is the first replication rule for the bucket, Priority will be set to 0. Subsequent rules that are added will be assigned higher priorities. When multiple rules share the same destination, the rule with the highest priority takes precedence during execution, typically the one created last. If you wish to control the priority for each rule, you can achieve this by setting the rule using XML. For guidance on how to configure this, refer to the See also section.
In the Source bucket section, you have the option to replicate all objects in the bucket by selecting Apply to all objects in the bucket or you can narrow it down to specific objects by selecting Limit the scope of this rule using one or more filters and specifying a Prefix value (for example, logs_ or logs/) to filter objects. Additionally, you have the option to replicate objects based on their tags. Simply choose Add tag and input key-value pairs. This process can be repeated so that you can include multiple tags:

Figure 1.11 – Source bucket configuration

Under Destination, select Choose a bucket in this account and enter or browse for the destination bucket name.
Under IAM role, select Choose from existing IAM roles, then choose Create new role from the drop-down list.
Under Destination storage class, you can select Change the storage class for the replicated objects and choose one of the storage classes to be set for the replicated objects in the destination bucket.
Click on Save to save your changes.

How it works…

By adding this replication rule, you grant the source bucket permission to replicate objects to the destination bucket in the said region. Once the replication process is complete, the destination bucket will contain a copy of the objects from the source bucket. The objects in the destination bucket will have the same ownership, permissions, and metadata as the source objects. When you enable replication to your bucket, several background processes occur to facilitate this process. S3 continuously monitors changes to objects in your source bucket. Once a change is detected, S3 generates a replication request for the corresponding objects and initiates the process of transferring the data from the source to the destination bucket.

There’s more…

There are additional options that you can enable while setting the replication rule under Additional replication options. The Replication metrics option enables you to monitor the replication progress with S3 Replication metrics. It does this by tracking bytes pending, operations pending, and replication latency. The Replication Time Control (RTC) option can be beneficial if you have a strict service-level agreement (SLA) for data replication as it will ensure that approximately 99% of your objects will be replicated within a 15-minute timeframe. It also enables replication metrics to notify you of any instances of delayed object replication. The Delete marker replication option will replicate object versions with a delete marker. Finally, the Replica modification sync option will replicate the metadata changes of objects.

Data Engineering with AWS Cookbook: A recipe-based approach to help you tackle data engineering problems with AWS services

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

About the 4 authors

FAQs