Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Data Engineering with AWS Cookbook
Data Engineering with AWS Cookbook

Data Engineering with AWS Cookbook: A recipe-based approach to help you tackle data engineering problems with AWS services

Arrow left icon
Profile Icon Trâm Ngọc Phạm Profile Icon Gonzalo Herreros González Profile Icon Viquar Khan Profile Icon Huda Nofal
Arrow right icon
$49.99
Paperback Nov 2024 528 pages 1st Edition
eBook
$35.98 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Trâm Ngọc Phạm Profile Icon Gonzalo Herreros González Profile Icon Viquar Khan Profile Icon Huda Nofal
Arrow right icon
$49.99
Paperback Nov 2024 528 pages 1st Edition
eBook
$35.98 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$35.98 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Data Engineering with AWS Cookbook

Managing Data Lake Storage

Amazon Simple Storage Service (Amazon S3) is a highly scalable and secure cloud storage service. It allows you to store and retrieve any amount of data at any time from anywhere in the world. S3 buckets aim to help enterprises and individuals achieve their data backup and delivery needs and serve a variety of use cases, including but not limited to web and mobile applications, big data analytics, data lakes, and data backup and archiving.

In this chapter, we will learn how to keep data secure in S3 buckets and configure your buckets in a way that best serves your use case from performance and cost perspectives.

The following recipes will be covered in this chapter:

  • Controlling access to S3 buckets
  • Storage types in S3 for optimized storage costs
  • Enforcing encryption of S3 buckets
  • Setting up retention policies for your objects
  • Versioning your data
  • Replicating your data
  • Monitoring your S3 buckets

Technical requirements

The recipes in this chapter assume you have an S3 bucket with admin permission. If you don’t have admin permission to the bucket, you will need to configure the permission for each recipe as needed.

You can find the code files for this chapter in this book’s GitHub repository: https://github.com/PacktPublishing/Data-Engineering-with-AWS-Cookbook/tree/main/Chapter01.

Controlling access to S3 buckets

Controlling access to S3 buckets through policies and IAM roles is crucial for maintaining the security and integrity of your objects and data stored in Amazon S3. By defining granular permissions and access controls, you can ensure that only authorized users or services have the necessary privileges to interact with your S3 resources. You can restrict permissions according to your requirements by precisely defining who can access your data, what actions they can take, and under what conditions. This fine-grained access control helps protect sensitive data, prevent unauthorized modifications, and mitigate the risk of accidental or malicious actions.

AWS Identity and Access Management (IAM) allows you to create an entity referred to as an IAM identity, which is granted specific actions on your AWS account. This entity can be a person or an application. You can create this identity as an IAM role, which is designed to be attached to any entity that needs it. Alternatively, you can create IAM users, which represent individual people and are usually used for granting long-term access to specific users. IAM users can be grouped into an IAM group, allowing permissions to be assigned at the group level and inherited by all member users. IAM policies are sets of permissions that can be attached to the IAM identity to grant specific access rights.

In this recipe, we will learn how to create a policy so that we can view all the buckets in the account, give read access to one specific bucket content, and then give write access to one of its folders.

Getting ready

For this recipe, you need to have an IAM user, role, or group to which you want to grant access. You also need to have an S3 bucket with a folder to grant access to.

To learn how to create IAM identities, go to https://docs.aws.amazon.com/IAM/latest/UserGuide/id.html.

How to do it…

  1. Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the IAM console.
  2. Choose Policies from the navigation pane on the left and choose Create policy.
  3. Choose the JSON tab to provide the policy in JSON format and replace the existing JSON with this policy:
    {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Sid": "AllowListBuckets",
              "Effect": "Allow",
              "Action": [
                  "s3:ListAllMyBuckets"
              ],
              "Resource": "*"
          },
          {
              "Sid": "AllowBucketListing",
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket",
                  "s3:GetBucketLocation"
              ],
              "Resource": [
                  "arn:aws:s3:::<bucket-name>"
              ]
          },
          {
              "Sid": "AllowFolderAccess",
              "Effect": "Allow",
              "Action": [
                  "s3:GetObject",
                  "s3:PutObject",
                  "s3:DeleteObject"
              ],
              "Resource": [
                  "arn:aws:s3:::<bucket-name>/<folder-name>/*"
              ]
          }
      ]
    }
  4. Provide a policy name and, optionally, a description of the policy in the respective fields.
  5. Click on Create Policy.

Now, you can attach this policy to an IAM role, user, or group. However, exercise caution and ensure access is granted only as necessary; avoid providing admin access policies to regular users.

How it works…

An IAM policy comprises three key elements:

  • Effect: This specifies whether the policy allows or denies access
  • Action: This details the specific actions being allowed or denied
  • Resource: This identifies the resources to which the actions apply

A single statement can apply multiple actions to multiple resources. In this recipe, we’ve defined three statements:

  • The AllowListBuckets statement gives access to list all buckets in the AWS account
  • The AllowBucketListing statement gives access to list the content of a specific S3 bucket
  • The AllowFolderAccess gives access to upload, download, and delete objects from a specific folder

There’s more…

If you want to make sure that no access is given to a specific bucket or object in your bucket, you can use a deny statement, as shown here:

{
    "Sid":"DenyListBucketFolder",
         "Action":[
            "s3:*"
         ],
         "Effect":"Deny",
         "Resource":[
            "arn:aws:s3:::<bucket-name>/<folder-name>/*"
}

Instead of using an IAM policy to set up permissions to your bucket, you can use S3 bucket policies. These can be located in the Permission tab of the bucket. Bucket policies can be used when you’re trying to set up access at the bucket level, regardless of the IAM role or user.

See also

Storage types in S3 for optimized storage costs

Amazon S3 offers different tiers or classes of storage that allow you to optimize for cost and performance based on your access pattern and data requirements. The default storage class for S3 buckets is S3 Standard, which offers high availability and low latency. For less frequently accessed data, S3 Standard-IA and S3 One Zone-IA can be used. For rare access, Amazon S3 offers archiving classes called Glacier, which are the lowest-cost classes. If you’re not sure how frequently your data will be accessed, S3 Intelligent-Tiering would be optimal for you as it will automatically move objects between the classes based on the access patterns. However, be aware that additional costs may be incurred when you’re moving objects to a higher-cost storage class.

These storage classes provide users with the flexibility to choose the right trade-off between storage costs and access performance based on their specific data storage and retrieval requirements. You can choose the storage class based on your access patterns, durability requirements, and budget considerations. Configuring storage classes at the object level allows for a mix of storage classes within the same bucket. Objects from diverse storage classes, including S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA, can coexist in a single bucket.

In this recipe, we will learn how to enforce the S3 Intelligent-Tiering storage class for an S3 bucket through a bucket policy.

Getting ready

For this recipe, you only need to have an S3 bucket for which you will enforce the storage class.

How to do it…

  1. Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
  2. Locate and select the S3 bucket on which you want to enable S3 Intelligent-Tiering and navigate to the Permissions tab.
  3. Under the Bucket Policy section, click on Edit.
  4. In the bucket policy editor, add the following statement. Make sure you replace <your_bucket_name> with the actual name of your S3 bucket:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "EnableIntelligentTiering",
          "Effect": "Deny",
          "Principal": {
            "AWS": "*"
          },
          "Action": "s3:PutObject",
          "Resource": "arn:aws:s3:::<your-bucket-name>/*",
          "Condition": {
            "StringNotEquals": {
              "s3:x-amz-storage-class": "INTELLIGENT_TIERING"
            }
          }
        }
      ]
    }
  5. Save the bucket policy by clicking on Save changes.

How it works…

The policy will ensure that objects are stored via the Intelligent-Tiering class by allowing the PUT operation to be used on the bucket for all users (Principal: *), but only if the storage class is set to INTELLIGENT_TIERING. You can do this by choosing it from the storage class list in the Object properties section. If you’re using the console or the S3 API, add the x-amz-storage-class: INTELLIGENT_TIERING header. Use the -storage-class INTELLIGENT_TIERING parameter when using the AWS CLI.

There’s more…

Intelligent-Tiering will place newly uploaded objects in the S3 Standard class (Frequent Access class). If the object hasn’t been accessed in 30 consecutive days, it will be moved to the Infrequent Access tier; if it hasn’t been accessed in 90 consecutive days, it will be moved to the Archive Instant Access tier. For further cost savings, you can enable INTELLIGENT_TIERING to move your object to the Archive Access tier and Deep Archive Access tier if they have not been accessed for a longer period. To do this, follow these steps:

  1. Navigate to the Properties tab for the bucket.
  2. Scroll down to Intelligent-Tiering Archive configurations and click on Create configuration.
  3. Name the configuration and specify whether you want to enable it for all objects in the bucket or on a subset based on a filter and/or tags.
  4. Under Status, click on Enable to enable the configuration directly after you create it.
  5. Under Archive rule actions, enable the Archive Access tier and specify the number of days in which the objects should be moved to this class if they’re not being accessed. The value must be between 90 and 730 days. Similarly, enable the Deep Archive Access tier and set the number of days to a minimum of 180 days. It’s also possible to enable only one of these classes:
Figure 1.1 – Intelligent-Tiering Archive rule action

Figure 1.1 – Intelligent-Tiering Archive rule action

  1. Click on Create to create the configuration.

See also

Enforcing encryption on S3 buckets

Amazon S3 encryption increases the level of security and privacy of your data; it helps ensure that only authorized parties can read it. Even if an unauthorized person gains logical or physical access to that data, the data is unreadable if they don’t get a hold of the key to unencrypt it.

S3 supports encrypting data both at transit (as it travels to and from S3) and at rest (while it’s stored on disks in S3 data centers).

For protecting data at rest, you have two options. The first is server-side encryption (SSE), in which Amazon S3 will be handling the heavy encryption operation on the server side in AWS. By default, Amazon S3 encrypts your data using SSE-S3. However, you can change this to SSE-KMS, which uses KMS keys for encryption, or to SSE-C, where you can provide and manage your own encryption key. Alternatively, you can encrypt your data using client-side encryption, where Amazon S3 doesn’t play any role in the encryption process rather; you are responsible for all the encryption operations.

In this recipe, we’ll learn how to enforce SSE-KMS server-side encryption using customer-managed keys.

Getting ready

For this recipe, you need to have a KMS key in the same region as your bucket to use for encryption. KMS provides a managed key for S3 (aws/s3) that can be utilized for encryption. However, if you desire greater control over the key properties, such as modifying its policies or performing key rotation, you can create a customer-managed key. To do so, follow these steps:

  1. Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the AWS Key Management Service (AWS KMS) service.
  2. In the navigation pane, choose Customer managed keys and click on Create key.
  3. For Key type, choose Symmetric, while for Key usage, choose Encrypt and decrypt. Click on Next:
Figure 1.2 – KMS configuration

Figure 1.2 – KMS configuration

  1. Click on Next.
  2. Type an Alias value for the KMS key. This will be the display name. Optionally, you can provide Description and Tags key-value pairs for the key.
  3. Click on Next. Optionally, you can provide Key administrators to administer the key. Click on Finish to create the key.

How to do it…

  1. Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
  2. In the Buckets list, choose the name of the bucket that you want to change the encryption for and navigate to the Properties tab.
  3. Click on Edit in the Default encryption section.
  4. For Encryption type, choose Server-side encryption with AWS Key Management Service keys (SSE-KMS).
  5. For AWS KMS key, you can select Enter AWS KMS key ARN to enter the key you have created or browse it using Choose from your AWS KMS keys.
  6. Keep Bucket Key enabled and save your changes:
Figure 1.3 – Changing the default encryption

Figure 1.3 – Changing the default encryption

How it works…

By changing the default encryption for your bucket, all newly uploaded objects to your bucket, which don’t have an encryption setting, will be encrypted using the KMS you have provided. Already existing objects in your bucket will not be affected. Enabling the bucket key leads to cost savings in KMS service calls associated with the encryption or decryption of individual objects. This is achieved by KMS generating a key at the bucket level rather than generating a separate KMS key for each encrypted object. S3 uses this bucket-level key to generate distinct data keys for objects within the bucket, thereby eliminating the need for additional KMS requests to complete encryption operations.

There’s more…

By following this recipe, you can encrypt your objects with SSE-KMS but only if they don’t have encryption configured. You can enforce your objects to have an SSE-KMS encryption setting in the PUT operation using a bucket policy, as shown here:

  1. Navigate to the bucket’s Permissions tab.
  2. Go to the Bucket Policy section and click on Edit.
  3. Paste the following policy. Make sure you replace <your-bucket-name> with the actual name of your S3 bucket and <your-kms-key-arn> with the Amazon Resource Name (ARN) of your KMS key:
    {
      "Version": "2012-10-17",
      "Id": "EnforceSSE-KMS",
      "Statement": [
          {
              "Sid": "DenyNonKmsEncrypted",
              "Effect": "Deny",
              "Principal": "*",
              "Action": "s3:PutObject",
              "Resource": "arn:aws:s3:::<your-bucket-name>/*",
              "Condition": {
                  "StringNotEquals": {
                      "s3:x-amz-server-side-encryption": "aws:kms"
                  }
              }
          },
          {
              "Sid": "AllowKmsEncrypted",
              "Effect": "Allow",
              "Principal": "*",
              "Action": "s3:PutObject",
              "Resource": "arn:aws:s3:::<your-bucket-name>/*",
              "Condition": {
                  "StringEquals": {
                      "s3:x-amz-server-side-encryption": "aws:kms",
                      "s3:x-amz-server-side-encryption-aws-kms-key-id": "<your-kms-key-arn>"
                  }
              }
          }
      ]
    }
  4. Save your changes.

This policy contains two statements. The first statement (DenyNonKmsEncrypted) denies the s3:PutObject action for any request that does not include SSE-KMS encryption. The second statement (AllowKmsEncrypted) only allows the s3:PutObject action when the request includes SSE-KMS encryption and the specified KMS key.

See also

Setting up retention policies for your objects

Amazon S3’s storage lifecycle allows you to manage the lifecycle of objects in an S3 bucket based on predefined rules. The lifecycle management feature consists of two main actions: transitions and expiration. Transitions involve automatically moving objects between different storage classes based on a defined duration. This helps in optimizing costs by storing less frequently accessed data in a cheaper storage class. Expiration, on the other hand, allows users to set rules to automatically delete objects from an S3 bucket. These rules can be based on a specified duration. Additionally, you can apply a combination of transitions and expiration actions to objects. Amazon S3’s storage lifecycle provides flexibility and ease of management for users and it helps organizations optimize storage costs while ensuring that data is stored according to its relevance and access patterns.

In this recipe, we will learn how to set up a lifecycle policy to archive objects in S3 Glacier after a certain period and then expire them.

Getting ready

To complete this recipe, you need to have a Glacier vault, which is a separate storage container that can be used to store archives, independent from S3. You can create one by following these steps:

  1. Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the Glacier service.
  2. Click on Create vault to start creating a new Glacier vault.
  3. Provide a unique and descriptive name for your vault in the Vault name field.
  4. Optionally, you can choose to receive notifications for events by clicking Turn on notifications under the Event notifications section.
  5. Click on Create to create the vault.

How to do it…

  1. Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
  2. Select the desired bucket for which you want to configure the lifecycle policy and navigate to the Management tab.
  3. In the left panel, select Lifecycle and click on Create lifecycle rule.
  4. Under Rule name, name the lifecycle rule to identify it.
  5. Under Choose a rule scope, you can choose Apply to all objects in the bucket or Limit the scope of this rule using one or more filters to specify the objects for which the rule will be applied. You can use one of the following filters or a combination of them:
    • Filter objects based on prefixes (for example, logs)
    • Filter objects based on tags; you can add multiple key-value pair tags to filter on
    • Filter objects based on object size by setting Specify minimum object size and/or Specify maximum object size and specifying the size value and unit

    The following screenshot shows a rule that’s been restricted to a set of objects based on a prefix:

Figure 1.4 – Lifecycle rule configuration

Figure 1.4 – Lifecycle rule configuration

  1. Under Lifecycle rule actions, select the following options:
    • Move current versions of objects between storage classes. Then, choose one of the Glacier classes and set Days after object creation in which the object will be transitioned (for example, 60 days).
    • Expire current versions of objects. Then, set Days after object creation in which the object will expire. Choose a value higher than the one you set for transitioning the object to Glacier (for example, 100).

    Review the transition and expiration actions you have set and click on Create rule to apply the lifecycle policy to the bucket:

Figure 1.5 – Reviewing the lifecycle rule

Figure 1.5 – Reviewing the lifecycle rule

Note

It may take some time for the lifecycle rule to be applied to all the selected objects, depending on the size of the bucket and the number of objects. The rule will affect existing files, not just new ones, so ensure that no applications are accessing files that will be archived or deleted as they will no longer be accessible via direct S3 retrieval.

How it works…

After you save the lifecycle rule, Amazon S3 will periodically evaluate it to find objects that meet the criteria specified in the lifecycle rule. In this recipe, the object will remain in its default storage type for the specified period (for example, 60 days) after which it will automatically be moved to the Glacier storage class. This transition is handled transparently, and the object’s metadata and properties remain unchanged. Once the objects are transitioned to Glacier, they are stored in a Glacier vault and become part of the Glacier storage infrastructure. Objects will then remain in Glacier for the remaining period of expiry (for example, 40 days), after which they will expire and be permanently deleted from your S3 bucket.

Please note that once the objects have expired, they will be queued for deletion, so it might take a few days after the object reaches the end of its lifetime for it to be deleted.

There’s more…

Lifecycle configuration can be specified as an XML when using the S3 API or AWS console, which can be helpful if you are planning on using the same lifecycle rules on multiple buckets. You can read more on setting this up at https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html.

See also

Versioning your data

Amazon S3 versioning refers to maintaining multiple variants of an object at the same time in the same bucket. Versioning provides you with an additional layer of protection by giving you a way to recover from unintended overwrites and accidental deletions as well as application failures.

S3 Object Versioning is not enabled by default and has to be explicitly enabled for each bucket. Once enabled, versioning cannot be disabled and can only be suspended. When versioning is enabled, you will be able to preserve, retrieve, and restore any version of an object stored in the bucket using the version ID. Every version of an object is the whole object, not the delta from the previous version, and you can set permissions at the version level. So, you can set different permissions for different versions of the same object.

In this recipe, we’ll learn how to delete the current version of an object to make the previous one the current version.

Getting ready

For this recipe, you need to have a version-enabled bucket with an object that has at least two versions.

You can enable versioning for your bucket by going to the bucket’s Properties tab, editing the Bucket Versioning area, and setting it to Enable:

Figure 1.6 – Enabling bucket versioning

Figure 1.6 – Enabling bucket versioning

You can create a new version of an object by simply uploading a file with the same name to the versioning-enabled bucket.

It’s important to note that enabling versioning for a bucket is irreversible. Once versioning is enabled, it will be applied to all existing and future objects in that bucket. So, before enabling versioning, make sure that your application or workflow is compatible with object versioning.

Enabling versioning for the first time will take time to take effect, so we recommend waiting 15 minutes before performing any write operation on objects in the bucket.

How to do it…

  1. Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
  2. In the Buckets list, select the S3 bucket that contains the object for which you want to set the previous version as the current one.
  3. In the Objects tab, click on Show versions. Here, you can view all your object versions:
Figure 1.7 – Object versions

Figure 1.7 – Object versions

  1. Select the current version of the object that you want to delete. It’s the top-most version with the latest modified date.
  2. Click on the Delete button and write permanently delete as prompted on the next screen.

    After deleting the current version, the previous version will automatically become the latest version:

Figure 1.8 – Object versions after version deletion

Figure 1.8 – Object versions after version deletion

  1. Verify that the previous version is now the latest version by checking the Last modified timestamps or verifying this through object listing, metadata, or download.

How it works…

Once you enable bucket versioning, each object in the bucket will have a version ID that uniquely identifies the object in the bucket, and the non-version-enabled buckets will have their version IDs set to null for their objects. The older versions of an object become non-current but continue to exist and remain accessible. When you delete the current version of the object, it will be permanently removed and the S3 versioning mechanism will automatically promote the previous version as the current one after deletion. If you delete an object without specifying the version ID, Amazon S3 doesn’t delete it permanently; instead, it inserts a delete marker into it and it becomes the current object version. However, you can still restore its previous versions:

Figure 1.9 – Object with a delete marker

Figure 1.9 – Object with a delete marker

There’s more…

S3 rates apply to every version of an object that’s stored and requested, so keeping non-current versions of objects can increase your storage cost. You can use lifecycle rules to archive the non-current versions or permanently delete them after a certain period and keep the bucket clean from unnecessary object versions.

Follow these steps to add a lifecycle rule to delete non-current versions after a certain period:

  1. Go to the bucket’s Management tab and click on the Lifecycle configuration.
  2. Click on the Add lifecycle rule button to create a new rule.
  3. Provide a unique name for the rule.
  4. Under Apply rule to, select the appropriate resources (for example, the entire bucket or specific prefixes).
  5. Set the action to Permanently delete non-current versions.
  6. Specify Days after objects become noncurrent in which the delete will be executed. Optionally, you can specify Number of newer versions to retain, which means it will keep the said number of versions for the object and all others will be deleted when they are eligible for deletion based on the specified period.
  7. Click on Save to save the lifecycle rule.

See also

Replicating your data

AWS S3 replication is an automatic asynchronous process that involves copying objects to one or multiple destination buckets. Replication can be configured across buckets in the same AWS region with Same-Region Replication, which can be useful for scenarios such as isolating different workloads, segregating data for different teams, or achieving compliance requirements. Replication can also be configured for buckets across different AWS regions with Cross-Region Replication (CRR), which helps in reducing latency for accessing data, especially for enterprises with a large number of locations, by maintaining multiple copies of the objects in different geographies or different regions. It provides compliance and data redundancy for improved performance, availability, and disaster recovery capabilities.

In this recipe, we’ll learn how to set up replication between two buckets in different AWS regions and the same AWS account.

Getting ready

You need to have an S3 bucket in the destination AWS region to act as a target for the replication. Also, S3 versioning must be enabled for both the source and destination buckets.

How to do it…

  1. Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
  2. In the Buckets list, choose the source bucket you want to replicate.
  3. Go to the Management tab and select Create replication rule under Replication rules.
  4. Under Replication rule name in the Replication rule configuration section, give your rule a unique name.
  5. Under Status, either keep it Enabled for the rule to take effect once you save it or change it to Disabled to enable it later as required:
Figure 1.10 – Replication rule configuration

Figure 1.10 – Replication rule configuration

  1. If this is the first replication rule for the bucket, Priority will be set to 0. Subsequent rules that are added will be assigned higher priorities. When multiple rules share the same destination, the rule with the highest priority takes precedence during execution, typically the one created last. If you wish to control the priority for each rule, you can achieve this by setting the rule using XML. For guidance on how to configure this, refer to the See also section.
  2. In the Source bucket section, you have the option to replicate all objects in the bucket by selecting Apply to all objects in the bucket or you can narrow it down to specific objects by selecting Limit the scope of this rule using one or more filters and specifying a Prefix value (for example, logs_ or logs/) to filter objects. Additionally, you have the option to replicate objects based on their tags. Simply choose Add tag and input key-value pairs. This process can be repeated so that you can include multiple tags:
Figure 1.11 – Source bucket configuration

Figure 1.11 – Source bucket configuration

  1. Under Destination, select Choose a bucket in this account and enter or browse for the destination bucket name.
  2. Under IAM role, select Choose from existing IAM roles, then choose Create new role from the drop-down list.
  3. Under Destination storage class, you can select Change the storage class for the replicated objects and choose one of the storage classes to be set for the replicated objects in the destination bucket.
  4. Click on Save to save your changes.

How it works…

By adding this replication rule, you grant the source bucket permission to replicate objects to the destination bucket in the said region. Once the replication process is complete, the destination bucket will contain a copy of the objects from the source bucket. The objects in the destination bucket will have the same ownership, permissions, and metadata as the source objects. When you enable replication to your bucket, several background processes occur to facilitate this process. S3 continuously monitors changes to objects in your source bucket. Once a change is detected, S3 generates a replication request for the corresponding objects and initiates the process of transferring the data from the source to the destination bucket.

There’s more…

There are additional options that you can enable while setting the replication rule under Additional replication options. The Replication metrics option enables you to monitor the replication progress with S3 Replication metrics. It does this by tracking bytes pending, operations pending, and replication latency. The Replication Time Control (RTC) option can be beneficial if you have a strict service-level agreement (SLA) for data replication as it will ensure that approximately 99% of your objects will be replicated within a 15-minute timeframe. It also enables replication metrics to notify you of any instances of delayed object replication. The Delete marker replication option will replicate object versions with a delete marker. Finally, the Replica modification sync option will replicate the metadata changes of objects.

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get up to speed with the different AWS technologies for data engineering
  • Learn the different aspects and considerations of building data lakes, such as security, storage, and operations
  • Get hands on with key AWS services such as Glue, EMR, Redshift, QuickSight, and Athena for practical learning
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Performing data engineering with Amazon Web Services (AWS) combines AWS's scalable infrastructure with robust data processing tools, enabling efficient data pipelines and analytics workflows. This comprehensive guide to AWS data engineering will teach you all you need to know about data lake management, pipeline orchestration, and serving layer construction. Through clear explanations and hands-on exercises, you’ll master essential AWS services such as Glue, EMR, Redshift, QuickSight, and Athena. Additionally, you’ll explore various data platform topics such as data governance, data quality, DevOps, CI/CD, planning and performing data migration, and creating Infrastructure as Code. As you progress, you will gain insights into how to enrich your platform and use various AWS cloud services such as AWS EventBridge, AWS DataZone, and AWS SCT and DMS to solve data platform challenges. Each recipe in this book is tailored to a daily challenge that a data engineer team faces while building a cloud platform. By the end of this book, you will be well-versed in AWS data engineering and have gained proficiency in key AWS services and data processing techniques. You will develop the necessary skills to tackle large-scale data challenges with confidence.

Who is this book for?

If you're involved in designing, building, or overseeing data solutions on AWS, this book provides proven strategies for addressing challenges in large-scale data environments. Data engineers as well as big data professionals looking to enhance their understanding of AWS features for optimizing their workflow, even if they're new to the platform, will find value. Basic familiarity with AWS security (users and roles) and command shell is recommended.

What you will learn

  • Define your centralized data lake solution, and secure and operate it at scale
  • Identify the most suitable AWS solution for your specific needs
  • Build data pipelines using multiple ETL technologies
  • Discover how to handle data orchestration and governance
  • Explore how to build a high-performing data serving layer
  • Delve into DevOps and data quality best practices
  • Migrate your data from on-premises to AWS
Estimated delivery fee Deliver to South Africa

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 29, 2024
Length: 528 pages
Edition : 1st
Language : English
ISBN-13 : 9781805127284
Vendor :
Amazon
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to South Africa

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Nov 29, 2024
Length: 528 pages
Edition : 1st
Language : English
ISBN-13 : 9781805127284
Vendor :
Amazon
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Table of Contents

15 Chapters
Chapter 1: Managing Data Lake Storage Chevron down icon Chevron up icon
Chapter 2: Sharing Your Data Across Environments and Accounts Chevron down icon Chevron up icon
Chapter 3: Ingesting and Transforming Your Data with AWS Glue Chevron down icon Chevron up icon
Chapter 4: A Deep Dive into AWS Orchestration Frameworks Chevron down icon Chevron up icon
Chapter 5: Running Big Data Workloads with Amazon EMR Chevron down icon Chevron up icon
Chapter 6: Governing Your Platform Chevron down icon Chevron up icon
Chapter 7: Data Quality Management Chevron down icon Chevron up icon
Chapter 8: DevOps – Defining IaC and Building CI/CD Pipelines Chevron down icon Chevron up icon
Chapter 9: Monitoring Data Lake Cloud Infrastructure Chevron down icon Chevron up icon
Chapter 10: Building a Serving Layer with AWS Analytics Services Chevron down icon Chevron up icon
Chapter 11: Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads Chevron down icon Chevron up icon
Chapter 12: Harnessing the Power of AWS for Seamless Data Warehouse Migration Chevron down icon Chevron up icon
Chapter 13: Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela