Test your knowledge
Before moving on to the next chapter, test your knowledge with the following questions:
- Assume you have integrated the complete ETL pipeline but when your input file gets pushed to the input S3 bucket, the Lambda function does not launch the EMR cluster. When you plan to debug the Lambda function execution, you don't find any logs for the Lambda function in CloudWatch log groups. What might be the problem that stops the Lambda function from writing logs in CloudWatch and how would you resolve it?
- Assume you have multiple data sources that are sending input files for processing. Instead of triggering an EMR cluster launch on an S3 file arrival event, you would like to schedule a PySpark job to run at a particular time of the day, so that it picks up all the input files available at that point of time for processing. How would you schedule the cluster creation and job execution?
- You have integrated Amazon EMR for your batch analytics workload...