Summary
In this chapter, we tackled feature engineering for a large (~ 500 GB) dataset. We looked at challenges including scalability, bias, and explainability. We saw how to use SageMaker Data Wrangler, Clarify, and Processing jobs to explore and prepare data.
While there are many ways to use these tools, we recommend using Data Wrangler for interactive exploration of small to mid-sized datasets. For processing large datasets in their entirety, switch to programmatic use of processing jobs using the Spark framework to take advantage of parallel processing. (At the time of writing, Data Wrangler does not support running on multiple instances, but you can run a processing job on multiple instances.) You can always export a Data Wrangler flow as a starting point.
If your dataset is many terabytes, consider running a Spark job directly in EMR or Glue and invoking SageMaker using the SageMaker Spark SDK. EMR and Glue have optimized Spark runtimes and more efficient integration with...