Document Capture and Categorization
One of the first stages of an Intelligent Document Processing (IDP) pipeline is to collect your documents and store them in a highly available, reliable, and secure data store. Data is our gold mine, and to extract insights from our documents, we need to understand our data and pre-process it as needed. Most of the time, organizations receive a package of documents that are not labeled. To understand the documents, you need to manually scan these documents and label them into the right category, which is known as the document classification stage of the IDP pipeline. Thus, we are looking for an automated process for data collection and document classification.
In this chapter, we will be covering the following topics:
- Understanding data capture with Amazon S3
- Understanding document classification with Amazon Comprehend’s custom classifier
- Understanding document categorization with computer vision