Introducing Intelligent Document Processing pipeline
IDP seems simple but in reality, it is a complex challenge to solve. Imagine a physical library – racks and racks of books divided and arranged in rows tagged with the right author and genre. Have you wondered about the human workforce behind doing this diligent, structured work to help us find the right book in a timely and efficient manner?
Similarly, as you know, we deal with documents across industries for various use cases. In the traditional world, you would need many teams to go through the entire list of documents and manually read documents individually. Then, they would need to identify the category the document belongs to and tag it with the right keywords and topics so that it can be easily identifiable or searchable. Following the process, your main goal is to extract insights from these documents. This is a massive process and takes months and years to set up based on the volume of the data and the skill level of the manual workforce. Manual operations can be time-consuming, error-prone, and expensive. To onboard a new document type and update or remove a document type, these steps need to be followed incrementally. This is a significant investment, effort, and a lot of pressure on the manual workforce. Sometimes, the time and effort needed are not budgeted for and can cause significant delays or pause the process. To automate this, we need digitization and digitalization.
Digitization is the conversion of data to digital format. Digitization has to happen before digitalization. Digitalization is the process of using these digital copies to derive context-based insights and transform the process. After transformation, there must be a way to consume the information. This entire process is known as the IDP pipeline. Go through the following in Figure 1.6 to get a detailed view of the IDP pipeline and its stages:
Figure 1.6 – The IDP pipeline and its stages
Now that we know what the IDP pipeline is, let’s understand each phase of IDP in detail.
Data capture
In our library books example, we can go to a library directly, look for books, borrow a book, return a book, or just sit and read a book in the library. We want a place where all books are available, well-organized, easily accessible when we need them, and affordable. Similarly, at the data capture stage, documents are similar to our library books. During this stage, we collect and aggregate all our data in a secure, centralized, scalable data store. While building the data capture stage for your IDP pipeline, you have to take data sources, data formats, and the data store into consideration:
- Document sources: Data can come from various sources. It can be as simple as mobile capture, such as submitting receipts for reimbursement or submitting digital pictures of all your applications, transcripts, ID documents, and supporting documents during any registration process. Other sources can be simple fax or mail attachments.
- Document format: The data we speak about comes in different formats and sizes. Some can just be a single page, such as a driver’s license or insurance card, but others can be multiple pages, such as in a loan mortgage application or with insurance benefit summary documents. But we categorize data into three broad categories: structured, semi-structured, and unstructured. Structured documents have structured elements in them, such as table-type elements. Unstructured documents have dense text, as in legal and contractual documents. Finally, semi-structured documents contain key-value elements, as in an insurance application form. But most often documents can have multiple pages with all the different category (structured, semi-structured, and unstructured) elements in them. There are also different types of digital documents – some can be image-based, with JPEGs and PNGs, and others can be PDF or TIFF types of documents with varying resolutions and printing angles.
- Document store: To store the untransformed and transformed documents, we need a secure data store. At times, we have a requirement to store metadata about documents, such as the author or date, which is not mentioned in the document, for future mapping of metadata to extraction results. Industries such as healthcare, the public sector, and finance should be able to store their documents and their results securely, following their security, governance, and compliance requirements. For easier, instantaneous, and highly performant access, they need storage with industry-leading easy-to-use management and simpler access from anywhere at the click of a button. The volume of data and documents is vast. To support it, we require a scalable data store, which can scale as per our needs. Another important factor is the high reliability and availability of your data store so that you can access it whenever you have a need. Moreover, given the high volume of documents, we are looking for a cost-effective document store.
Let’s now move on to the next IDP phase.
Document classification
Going back to our book library example, the books are categorized and stacked by category. For example, for any fiction or non-fiction books, you can directly check the label on the rack and go to the section where you can find the books related to that category. Each section can be further subdivided into sub-sections or can be arranged by the first letter of the author’s name. Can you imagine how difficult it would be to locate a book if it were not categorized correctly?
Similarly, at times, you receive a package of documents, or sometimes a single PDF, with all the required documents merged. A human can preview the documents to categorize them into their specified folder. This helps later with information extraction and metadata extraction from a variety of complex documents, depending on the document type. This process of categorizing the documents is known as document classification or a document splitter.
This process is crucial when we try to automate our document extraction process and when we receive multiple documents and don’t have a clear way to identify each document type. If you are dealing with a single document type or have an identifiable way to locate the document, then you can skip this step in the IDP pipeline. Otherwise, classify those documents correctly before proceeding in the IDP pipeline.
Document extraction
Again, analogous to our library books, now that all the books are accurately categorized and stacked, we can easily find a book of our choice. When we read a book, we might come across multiple different formats of text, such as dense paragraphs interweaved between tables and some structured or semi-structured elements such as key values. As human beings, we can read and process that information. Human beings know how to read a table or key-value types of elements in addition to a paragraph of text. Can we automate this step? Can we ask a machine to do the extraction for us?
The process of accurately extracting all elements, including structural elements, is broadly known as document extraction in the IDP pipeline. This helps us to identify key information from documents through extensive, accurate extraction. The intelligent capture of the data elements from documents during the extraction phase of the IDP pipeline helps us derive new insights in an automated manner.
Some of the examples of the extraction stage include Named Entity Recognition (NER) for automatically identifying entities in unstructured data. We will look into the details more deeply in Chapter 4, Accurate Extraction with Amazon Comprehend.
Document enrichment
To get insights and business value out of your document, you will need to understand the dynamic topics and document attributes in your document. During the document enrichment stage, you append or enhance the existing data with additional business or domain-specific context from additional sources.
For example, while processing healthcare claims, at times, we need to refer to a doctor’s note to verify the medical condition mentioned in the claims form. Additional documents such as doctor’s notes are requested for further processing. We get a raw doctor’s note deriving medical information such as details about medication and medical conditions – being able to get this directly from the main document is critical to enable business value such as improving patient care. To achieve this, we need the medical context, metadata, attributes, and domain-specific knowledge. This is an example of the enrichment stage of the IDP pipeline.
While entity recognition can extract metadata from texts of various document types, we need a process to recognize the non-text elements in our documents. This is where the object detection feature comes in handy. This can be extended further into identifying personal information with Personally Identifiable Information (PII) and Protected Health Information (PHI) detection methods. We can also de-identify our PII or PHI for further downstream processing. We will look into the details in Chapter 5, Document Enrichment in Intelligent Document Processing.
Document post-processing (review and verification)
Going back to our book library example, there are certain instances when a library gets a new book and places the book in a new book section instead of categorizing it by genre. These are some specific rules that we follow for certain books in that library. Some specific rules and post-processing are required to organize our books in the library.
Similarly, with document processing, you might want to use your business rules or domain-specific validation to check for its completeness. For example, during a claims processing pipeline in the insurance industry, you want to validate for insurer ID and additional basic information. This is to check for the completeness of the claims form. This is a type of post-processing in the IDP pipeline.
Additionally, the extraction process or the enrichment steps previously discussed may not be able to give you the accuracy required for your business needs. You may want to include a human workforce for manual review for higher accuracy. Having a human being review some or certain fields of your documents in an automated way for higher accuracy can also be a part of the post-processing phase in IDP. Human review can be expensive, so in an automated manner, we will only process limited data on our documents in this way as per our business needs and requirements. We will further discuss this phase in Chapter 6, Review and Verification of Intelligent Document Processing.
Consumption
In our book library example, we always wish for a centralized, unified portal to track all our library books and their statuses. To maintain a digital library or online library system, nowadays, libraries support online catalogs where you can check all the books in a library in a centralized portal and their reservation statuses, the number of copies, and additional book information about its author and ratings. This is an example where we are to not only maintain and organize a book library but also integrate library information with our portals. We might maintain multiple different portals or tracking systems to manage and maintain our library books. This is an example of the integration and consumption stage for our library books.
Similarly, in our IDP pipeline, we collect our documents and categorize them during the data capture and classification stages. Then, we accurately extract all the required information from our documents. With the enrichment stage, we derived additional information and transformed our data for our business needs. Now is the time to consume the information for our business requirements. We need to integrate with our existing systems to consume the information and insights derived from our documents. Most of the time, I come across customers already using an existing portal or tracking system and wanting to integrate the insights derived from their documents with the existing system. This will also help them to build a 360 view of their product from the consumer perspective. At other times, the customer wants just a data dump in their database for better, faster queries. There can be many different ways and channels you want to use to consume the extracted information. This stage is known as the consumption or integration stage in our IDP pipeline.
Let’s now summarize the chapter.