Summary
In this chapter, we continued building advanced NLP solutions to address real-world requirements. We focused on asynchronously processing PDF documents and improving their accuracy by reviewing and modifying low - confidence detections using Amazon Textract and Amazon A2I.
We learned how to register companies to the SEC use case with a need to extract text, and then validate and modify specific text lines in the documents before they could be passed to the Partner Integration team for submission to SEC. We considered an architecture built for scale and ease of setup. We assumed that you are the chief architect overseeing this project, and we then proceeded to provide an overview of the solution components in the Introducing the PDF batch processing use case section.
We then went through the prerequisites for the solution build, set up an Amazon SageMaker Notebook instance, cloned our GitHub repository, and started executing the code in the notebook based on the instructions...