In 2000, the Census Bureau implemented OCR/ICR-based data capture for the first time. “I think I’ve been spoiled working with the Census Bureau,” Paxton told DIR. “Those guys are some of the biggest data junkies on the planet. They fuss over every piece of information. They are very familiar with the data coming in and have all sorts of lists to compare it to, to improve the accuracy of the OCR/ICR.
“The work [U.S. Census Bureau contractor] Lockheed Martin does in data capture is also very sophisticated. In addition to the 2000 U.S. Census, Lockheed has used the same team for U.K. and Canadian censuses, and it is now working on its second U.S. census. Lockheed is improving its techniques all the time. When I started working with commercial clients, their technology was nowhere near what I was used to. The good news is that this presents a tremendous opportunity.”
The differentiating factor
Some of the basic improvements Paxton noted that could be used in many commercial operations include improved forms design, doing field-level metrics and analysis vs. character-level, properly utilizing dictionaries to look-up results, and checking results against context. These are all concepts familiar to most advanced forms processing technology vendors and integrators. Where ADI differentiates itself is with the introduction of a model for determining the optimum confidence-level settings for OCR/ICR.
Confidence-level settings determine the number of fields and/or characters that will be sent to data entry operators for quality assurance (QA) and/or keying. Confidence settings are typically manifested in percentage numbers. For example, if you set your confidence level at 98%, any character or field that the engine is not at least 98% sure it has identified correctly, will be sent to an operator for quality assurance (QA) or key entry. Typically, the lower you set your confidence levels, the fewer items will be sent to these operators. In the above example, if you reduced your confidence level to 90%, you would lower your “reject rate” and thus reduce you manual labor costs.
However, setting confidence levels lower also increases the potential that incorrect data will be captured and sent to business processes. “It’s a classic trade-off between cost and quality,” said Paxton.
The hidden cost of bad data
The key to ADI’s model is weighing the costs of data errors vs. the cost of ensuring that data is captured correctly. “The cost of capturing data factors in software and hardware, as well the costs for manual operators,” Paxton told DIR. “The data error costs can vary, depending on the nature of the application.
“U.S. Census data, for example, is used to determine everything from the alignment of seats in the House of Representatives, to how many computers should be bought for schools. So, incorrect data can be very expensive in the long-run. In a financial services environment, incorrect data can cause the wrong funds to be deposited in the wrong places, so errors can also be very costly. In contrast, the cost of mistakes might be lower in an application involving survey results.”
Paxton stressed that it’s important for any business doing data capture to realize that their mistakes cost money. “Our model makes no sense if you don’t calculate the cost of errors downstream,” Paxton said. “I’ve had businesses tell me there is no cost attributable to capturing incorrect data. My response was to ask why they didn’t just set their confidence levels at zero and fire all their data entry personnel. Obviously, it didn’t matter if their data was correct or not, so what good are the operators?”
The model in action
In summary, the ADI model goes something line this: As you increase confidence levels, your reject rates increase, as a result, more documents are sent to operators, and your capture costs goes up. As you decrease confidence levels, your capture costs go down, but your costs for downstream mistakes rise. Of course, techniques like the “basic improvements” mentioned above can also be implemented to reduce reject rates without sacrificing confidence levels.
For example, in the earliest version of the 2000 Census processing system, with confidence levels set at 98%, ADI determined that approximately 37% of the documents being processed would need to be sent to human operators. After making some improvements to the application, at a 98% confidence level, the reject rate dropped to less than 5%. Once this improved system was in place, ADI began experimenting with higher confidence levels, which increased the reject rate, but guarded better against faulty data being passed downstream.
“For the Census, we figured the cost of errors to be 50-100 times the average cost of capturing data from the document,” said Paxton. “It might have been closer to 1,000x, but we took a stab at it and came up with 50-100x.”
Finding the sweet spot
Using those numbers, ADI plugged in different confidence levels and their subsequent reject rates. The highest reject rate shown on the chart Paxton shared with us was 80%, which equated to a cost per document of approximately $4.00. As the confidence level was decreased, the reduction in cost for operators outweighed the cost for the increase in downstream mistakes until confidence levels were lowered to the point where there was a 20% reject rate. After that, as the confidence level continued to decrease and the reject rate sank below 20%, the amount of bad information going through caused the overall cost per document to shoot up sharply.
Thus for the 2000 Census, confidence levels set to produce a 20% reject rate proved to be optimum, creating a cost per document of approximately $2.50. This contrasted with a cost of around $4.00 for an 80% reject rate and $5 for a less-than-1% reject rate. “The cost savings between setting your confidence levels to create a reject rate of 30% and 25% may only be $.50 per document, but when you’re doing 100,000 documents per day, that adds up pretty quickly,” said Paxton.
Utilizing test decks
One special ingredient that ADI uses for creating its cost-per-document model, especially for applications that involve forms with hand-printed information, is its patented Digital Test Deck (DTD) technology. DTDs can be used to electronically create simulated hand-print filled forms. “One big advantage of a DTD is that, because you created the data on each form, you know what the correct results should be,” said Paxton. “So, you can measure exactly how many errors you are making.”
Paxton concluded our conversation by showing a slide charting work ADI is currently doing with an unnamed customer. With its current forms processing implementation, as this customer’s reject rate increases, the cost per document always goes down. In other words, the current version of the application is basically useless, because if it had a 100% reject rate, meaning all documents were being keyed, it would achieve its lowest cost per document. Obviously some improvements need to be made.
In its lab, ADI made some of the “basic improvements” we discussed earlier. With the old system, when the confidence level was set at 99%, the customer had a reject rate greater than 75%. With the new system, with the same confidence level, the reject rate was closer to 25%. ADI then set about finding the optimum confidence level and reject rate. The 25% reject rate proved to be the sweet spot, with the cost per document coming in at around $1.25. With its previous system, the lowest cost per document that could be achieved, with reject rates set at around 80%, was $1.85.
Innovations on the rise
ADI’s scientific model for determining optimum confidence levels, as well as its DTDs, are the latest in a series of innovations that are reducing the hocus-pocus factor related to forms processing. Recently, we’ve also done stories on technology from vendors like A2iAand Orbograph that similarly refines processes related to automated data capture. This is on top of the all the un- and semi-structured document processing technology we’ve seen introduced and improved over the past few years.
No, forms processing technology will never be for everyone, but with the increasing number of tools available to complement OCR/ICR engines, as well as improving engines themselves, we expect to continue to see an increase of satisfied customers well into the future. Annual capture software market growth of 15-20%, as tabulated by analyst firm Harvey Spencer Associates, is reflective of this.
Assembling solutions
The biggest challenge we see in the market right now is getting the right technology into the right applications. Despite its continued growth, the capture market remains very fragmented, with several vendors owning strengths in different technologies and verticals. The state of the legacy system installed at ADI’s unnamed customer is certainly not an isolated example. Too often, customers are not getting a true forms processing solution—but rather a piece of software that may or may not meet their data capture needs.
As the demand for forms processing technology increases, so must qualified manpower to service this demand. We encourage vendors to step up their reseller education programs [and resellers to take advantage of them] in order to clean up forms processing’s reputation once and for all. Now that tools for solid automated data capture processes are finally in place, let’s make sure we all know how and when to use them. After all, this know-how and ability to put together a solution made up of multiple vendor tools is what puts the “value” in value-added reseller.
For more information: www.adillc.net |