Summary
In this chapter, we discussed the extraction stage of an IDP pipeline, and how we can leverage Amazon Textract to accurately extract elements from documents. Documents can be of different types, such as an unstructured dense text type of document, a semi-structured document such as a form, or a structured document such as a table. We walked through the sample code and its API response to accurately extract elements from any type of scanned document.
We then reviewed the need for accurate extraction of elements from specialized document types, such as ID documents such as a US driver’s license, a US passport, or invoice/receipt types of documents. We discussed Amazon Textract’s analyze_id
and analyze_expense
APIs to accurately extract elements from ID and invoice/receipt types of documents respectively. We walked you through the sample code for your accurate extraction of specialized document types.
In the next chapter, we will extend the extraction stage...