Solving S3 eventual consistency problems using AWS Glue
Let’s assume you have a use case where you are dealing with writing huge data into Amazon S3 – that is, you have a clickstream fact table dataset in Parquet format but the Spark application fails with an exception File not found error. When running Spark jobs on Amazon S3, Spark writes the output to a _TEMPORARY
prefix in S3, then moves the data from _TEMPORARY
to its final destination. In S3, a move is a rename operation. If the move happens immediately after the write operation, there is a chance of eventual consistency, which causes this move operation to fail. You will see that it failed due to a Rename failed or File not found error message. In this section, you will learn how to handle these problematic scenarios and fix them in the long term. The following diagram shows the S3 eventual consistency model:
Figure 15.7 – S3 eventual consistency model
First, let’s understand...