Search
  • Jake Becker

The Evolution of Document Classification, Data Extraction and the Impact of AI on Document Process

Artificial Intelligence has been in the headlines for years. From bringing the end of humanity or the end of human suffering, the promise of a different world has long been expected as a product of thinking machines. While we are still a long way off from sentient self aware computers, the use of several techniques that fall under the category of “Artificial Intelligence” have taken hold in the document management services industry.



For decades, companies have sought to reduce the labor required to organize and extract data from their records. Starting with simple cross reference indexing, which told users which page in a book or frame on a roll of film to look for a document, we’ve tried to make indexing and data capture easier.

After the initial rise of manual key from image (typing), or KFI, companies took two paths, either seeking cheaper labor through outsourcing to countries with lower wages, or through better technology automation. Some companies did both seeking to automate what they could and then turn the rest over to inexpensive data entry operators in places like China, India, the Philippines, South America and the Carribean Islands.

With access to cheap labor reduced, the need to focus on automation has become a last safe harbor for those continuing to seek to reduce the effort and cost to classify and extract data from documents. Starting originally with simple Optical Character Recognition (OCR) companies could leverage full text search to augment indexed fields. For instance, a company might index a couple fields in order to organize and manage the files effectively, like client name and date of service, and then use OCR to find keywords in files they are looking for.

But simply augmenting search wasn’t enough. With the advent of more accurate OCR that could process large volumes of documents, software companies began to combine the capabilities of OCR with templates that provided a way to identify where information on a page was likely to be and attempt to automatically extract field level data off of structured and semi-structured documents like tax forms and checks.

This approach works very well for documents that are very consistent and that were created with this kind of treatment in mind, like U.S. Tax Form 1040's. In contrast, documents that didn’t contain any structure, like legal agreements and highly variable forms, like Tax Transcripts where very hard to capture, because the data keeps moving around on the form from one document to the next.

In addition, just printing a document on a different size piece of paper or with different margins could throw the template system off and cause it to capture the wrong information from the form. In the information processing world, it is much worse to capture false positives (I think I got it right.) than it is to capture nothing at all. So when your automation software starts to confidently deliver the wrong data, you’ve got some real problems.

In order to deal with the slight variations that can occur in even the most structured forms, software developers took another step forward and combined OCR and software driven business rules that allowed operators to describe the location of information on the page relative to other information that consistently appears on the page. For instance, you could tell the software, “Look in the top right quadrant of the page for the word ‘Invoice’, if you find it, look to the left and below it for a sequence of numbers. Capture and place that sequence of numbers in the Invoice Number field.

This approach allows for a document that contains predictable elements, like an invoice, to have data that can move around on the page to still be captured using automated techniques. This practice can be very effective on the right kinds of documents. However, both templates and the rules based approach called “Intelligent Data Capture” require the process owners to manage large collections of templates in order to automate the capture of records where there are a large number of variations of documents.

For instance, if you are a large company, your A/P department may process 5,000 invoices each month. Those invoices could come from 8,000 different vendors over the course of just a few months. Think of all the templates you’d need to create to capture invoices from 8,000 different vendors! What ends up happening is templates begin to conflict with one another and the practice of automation becomes one of the snake eating its own tail. As you fix one problem with capturing a particular line of data off of an invoice, another two problems are created on other invoices by the fix, sending you on a never ending management headache of whack-a-mole template or rules management.

Additionally, templates require management at both a micro and macro level. If you only receive an invoice from a particular vendor once, does it makes sense to spend 30 minutes creating a template to automate what will take 5 minutes to do manually? How do you know when an invoice is going to be common and when it is an edge case and will never be seen again? Templates are messy.

The technologies of OCR and templates and rules have been around for over 25 years. They have been improving in both accuracy and speed over that time, bringing us to a place where OCR results on clean documents can be expected to be in the high 90 percent accuracy range and can read a document in a second or less. But the challenge of template management and rules management wasn’t able to be solved just by having better OCR, we needed a better way for software to predict where on a page information would be located, or what type of document the software was looking at.

Enter the age of Artificial Intelligence. About five years ago companies in the document processing services industry began working on new technologies to overcome the problems created by the last decade’s solutions. How do we get the computer to create and manage rules based on these new technologies without becoming overwhelmed?

So first, what is A.I. in the context of document processing? We are talking about two different technologies that are being added to existing approaches to overcome the challenges from templates and rules. These new technologies are Deep Learning and Machine Learning.

Deep learning is an AI function that is similar in nature to the human brain in its processing data and pattern recognition for use in decision making. Deep learning is a part of machine learning in AI that includes networks capable of learning without operator oversight from data that is unstructured. Also known as deep neural learning or deep neural network.

Machine learning is the ability of a computer program to learn and adapt to new data without a human operator. Machine learning keeps a computer’s built-in algorithms up to date with the newest inbound information without the aid of a human operator.

What drives success in Artificial Intelligence's of the Deep Learning/Machine Learning kind is sheer volume. The more these systems have seen, the better they are at predicting what to do on the next item. In the world of document processing, this works out very nicely, because most document processing companies already have literally millions of documents that have already been processed correctly by humans that can be fed into these systems to help build the basic knowledge to drive the algorithms to success in the future.

For example, if you want to train your AI to classify documents using this new model, instead of trying to pre-create a map for the computer to follow to identify a form, you simply show the system millions of forms and let the system start to differentiate between items. Once the system is adequately trained, you can start to identify the classes of documents the AI system has identified as unique.

Imagine, if it were a physical world, that the robot was looking at each document and placing it in a pile of like documents. After doing this for a few days, you could show up and then point at each pile and name it. “Those are 1040’s, and those are one kind of W-2 and that pile is another kind of W-2…” and so on. Through this method, you haven’t had to create a management layer between the system and the documents, the AI system creates it automatically through its use of machine learning algorithms.

The tools for performing these tasks are now available to the general public. For the example of document classification, you’ll find “Bag of Words” algorithms that are able to compare word frequency and word combination frequency on documents in order to differentiate them. The programming language Python is used commonly in these implementations due to its ease of use.



With templates we were held back from creating more and more document classifications due to the overhead, complexity of managing thousands of templates and the frustration of “one step forward, two steps back” interference. As a result, we end up attempting to only classify the most common most structured documents and leave the rest for manual classification. That’s not how things work out in the AI world.

What does this mean for companies that process inbound documents as part of a business process? With AI running the classification show, we can become much more expansive in our use of individual document classes. When looking to support a client who uses a document type of “Miscellaneous” as a catch all for documents that are not critical to their business process, we decide to throw everything that is not on a predefined list of expected or required documents into the non-descriptive bucket of “Misc.” The cost of this compromise is often lost in the shuffle, literally. Where we could be detecting specific types of documents being submitted, instead we are calling them all the same ambiguous label, where we lose the ability to do any analysis or predictive work on those documents. They’ve been anonymized in the database.

With Bag of Words, we can begin to identify an unlimited set of unique documents and are limited only by our own effort to identify more and more documents specifically in the business processes. What’s more, while we are limited our conversation to the use of AI to classify and extract data from documents, downstream AI could be analyzing the mix of documents submitted under the new expanded document classification process to find important trends in document submissions.

For instance, there could be a type of document that is commonly submitted by a particularly profitable type of client but isn’t being noticed because it’s traditionally been called a “Miscellaneous” document by the processor. With Bag of Words, the document can now be classified as what it is, without any additional cost or effort than calling it Miscellaneous. If it turns out that certain trends in document submissions can be detected, those trends could expose all kinds of interesting things. But, you will never find out if you keep calling everything outside of what you expect “Miscellaneous.”

So, with Bag of Words, we can be much less limited in how many different document types we are able to automatically identify with a high degree of accuracy. This should lead us to reduce extensively our use of catch all categories for document processing. This in turn will lead to more specific data regarding what is being submitted. The behavior of clients submitting documents could be revealing certain buyer personas that could be important for your business. So, through Bag of Words, we can have more specificity, which will give us better insight.

Because classification and data extraction have traditionally been human based activities, we’ve made decisions and compromises in order to achieve our goals in light of the cost and effort to perform the tasks. Simply put, we did just enough to get the job done. If, suddenly, both the effort and the cost are reduced, we should reconsider compromises we’ve made in light of them.

For instance, most grocery stores do no enter their line item billing from vendors into the ERP's at the store level. Most invoices for grocery stores are delivered with the product in the receiving area. That means, if you are a grocery store chain, it’s very likely you are buying produce from multiple vendors across a geographic region. You probably buy tomatoes from a number of vendors. Due to the limitations of humans typing invoice data into the ERP from the receiving dock, most grocery chains don’t know how much they are paying each vendor for their tomatoes. As a result, they don’t know who they’re getting the best deal from.



What if…. What if the invoice, instead of being typed by a person, was read by a computer and entered into the ERP automatically? Would that entice the grocery store chain to start capturing more data from their invoices so that they could create regional heat maps of the cost of goods to determine which vendors were providing the best prices and expand purchasing from those vendors? Only if the cost makes sense.

Bag of Words document classification is a much easier automated task than data extraction. First off, with classification, we are only deciding what the entire document is, not what one word on the document is. We have more chances to find information that supports the computer’s decision with classification than with data extraction.

With data extraction, we are still in the early stages of development of most software products that are using Deep Learning and Machine Learning to attempt to extract field level data from documents for processing. What we are seeing is that the AI is producing huge benefits in some commonly troublesome areas. For instance, invoices, where we are attempting to capture line item data, the variety of grids used to define this information is nearly infinite. Attempting, through template creation to capture invoice line item data is very challenging. For all of the reasons we’ve already stated, this is hard to do.

With AI, we are seeing systems that can automatically detect and adjust capture based on the line item grid information location. This is very significant as a capability. It shows us that the AI is able to adapt to changing document structures on the fly without human intervention and can capture data that is structured in novel ways without an operator pre-mapping documents for capture or creating business rules logic to assist the AI. Once trained, the AI is able to determine where the data that needs to be captured is located and go after that data.

These systems are very new. Many are struggling with black box issues of the AI fighting with itself over very similar documents and not giving the operator the ability to undo whatever is happening inside. This can lead to the AI software introducing rules that are counterproductive. This has been seen with the use of AI in other industries as well, where the algorithms seem to start having a mind of their own and do things that are unintended. A great example is Amazon’s hiring AI which after reviewing resumes from the existing hired staff decided women’s schools must not be considered for hiring engineers. Oops. The AI was biased by Amazon’s own hiring history when it tried to interpret which past resumes led to a successful hire the results were very biased towards hiring male engineers. Since male engineers didn’t attend women’s universities, the AI system created a rule to deduct points from a resume that contained words in the university’s name consistent with a women’s college.

For the time being, we still have to watch AI as it works and continue to shape and prune its behavior, a lot like parents teaching a child. Mostly, you have to let the child experience the world, but when its getting into trouble, you need to step in and provide more information. Otherwise, it might become a sexist HR manager at Amazon, and nobody wants that.



The best systems are addressing this challenge with screens that allow an operator to look inside the brain of the AI and pick out rules that are defeating the business process. This means companies will need employees who understand how AI processes information and how to keep it doing a good job without going down the path of Amazon’s HR AI.

It’s still very early in the game to say for certain that AI will completely eliminate document classification and data extraction as human based jobs. But, it’s clear that these technologies are going to greatly assist us in the effort of organizing and extracting data from huge sets of documents for less money, faster and with higher accuracy than we as humans were able to do on our own. Reduced human suffering? Check!

In the future, we should expect to see an explosion in the number of documents that are being identified and tracked as part of business processes and the amount of data extracted from each document in the process is going to increase in both the volume and the accuracy of the data as OCR and AI continue to improve and mature. For clients with clean source documents that are produced through document scanning or submission of digital originals, the expectation of a Zero Touch process will become a reality if it has not already in the next few years.

For those with documents that come from uncontrolled sources, like pictures taken from cell phones, the ability to process some documents without human intervention will continue to improve as well, and in the next decade will likely reduce the reliance on human operators by between 15% and 95% depending on the business process complexity and document quality.

As we move towards our goal of a zero touch world of document processing, we learn how artificial intelligence can aid us, but not without risk. Like solutions from the past, problems that AI solves will produce new problems that will require something we aren’t aware of yet to solve. But ultimately, we will arrive at a place where everyone has the opportunity to do great work, because we’ve eliminated the manual repetitive tasks of the past.


Written By;

Bill Becker

9 views

Recent Posts

See All
  • Facebook Social Icon
  • Twitter Social Icon
  • YouTube
  • LinkedIn Social Icon
Logo-01.png

(603)766-8000