Multiple methods are available for capturing data from unstructured documents (letters, invoices, email, fax, forms etc)! The list of methods identified below is not exhaustive but it is a guide of the appropriate usage of each method when addressing business process automation projects.
As well as considering the method of data capture, due consideration of the origins of the documents(s) that need to be captured must happen, to see if the documents are available in their original electronic format which, has the potential to massively increase data capture accuracy and remove the need for printing and scanning. Methods of capture from documents in electronic format are identified below.
Whenever a method of capture is considered, it is advisable in the first instance to consider the original documents, to determine if the document or form can be updated to improve the capture/recognition process and method. Investigation of the existing line of business systems, to determine what additional metadata can be extracted for free using a single reference, can provide significant advantages!
The correct method(s) of metadata capture for a particular business process automation project, will consider all the methods identified below and the use of one or a number may be appropriate.
Manual keying of metadata from unstructured data is appropriate for data that is received in low volumes and results in low levels of recognition by intelligent data capture products (IDR, ICR). ProcessFlows has a Manual Keying service as part of our Outsourcing Solutions, please click here for more information.
Nearshore keying of Metadata is most appropriate for the following reasons:
ProcessFlows has a Nearshore Keying service as part of our Outsourcing Solutions, please click here for more information.
SingleClick is an Optical Character Recognition (OCR) tool that can be used to capture machine produced characters in low volume ad-hoc capture applications and populating a line of business application. For more on SingleClick, please click here.
OCR as a technology provides the ability to successfully capture machine produced characters in preset zones or, full page. OCR systems can recognise many different OCR fonts, as well as typewriter and computer-printed characters. Dependent upon the capabilities of the particular OCR product, this can be used to capture low to high volumes of data, where the information is in consistent location(s) on the documents. Please click here to learn more about OCR.
ICR is the computer translation of hand printed and written characters. Data is entered from hand-printed forms through a scanner, and the image of the captured data is then analysed and translated by sophisticated ICR software. ICR is similar to optical character recognition (OCR) but is a more difficult process since OCR is from printed text, as opposed to handwritten characters. Please click here to learn more about ICR.
Dependent upon the type of barcode that is used, the amount of metadata that can be included is high, as is the level of recognition. The application of single or multiple bar codes to particular document types such as Proof of Delivery notes, membership forms, application forms, gift aid etc, can dramatically increase the effectiveness of a business process. For more about Barcodes, please click here.
The level of capability is dependent upon the individual template based intelligent capture product! More advanced products are able to identify machine produced and to a lesser degree handwritten characters that are contained in particular area(s) of a document. These applications are used where the number of document types being received are relatively low (typically up to 30 different document types) but consistent. Used in applications such as census, inter-bank transfers and application forms. For more information on this technology, please see our ReadSoft Forms page.
The level of capability is dependent upon the individual product. These applications are used to capture metadata from documents that is rules based. For example, the product will identify post codes, logos, key words, VAT registration numbers and, through an ongoing learning process, capture information from multiple document types.
This type of capture is used for high volume invoice processing and digital mailroom applications, where the classification and indexing of incoming documents is key. IDR software applications use rules to identify and capture information from semi-structured documents. Rules, specified by end users, look for specific text on a document to identify the document type and additional rules can then be applied to each different type from then on, extracting different metadata fields from each type.
These applications are commonly used for digital mailroom environments, with the idea that documents are taken out of their envelopes and fed straight into a scanner with very little manual processing.
Specialised applications exist for departmental projects such as invoice processing. IDR applications can hold information about suppliers generated from other line-of-business systems and match invoices to that information, using recognised text such as VAT number, telephone number, post code etc. The application then looks for keyword identifiers on the invoice and extrapolates the value nearby. Validation rules are then applied, for example the NET amount plus the VAT amount must equal the gross amount, minimising the chance for errors.
In our experience, organisations often reduce everything to paper format before going through the process of capturing data. They often do this even when they receive the information in its original digital format. Where this is the case, it is unnecessary, time consuming and costly and often results in a lower level of success in extracting the required data.
Where information is available in its original digital format, tools such as Formate enable organisations to automate the receipt and interrogation of searchable pdf, Word docs, electronic forms, instant messaging, etc, thus capturing the required data digitally and negating the need to print and scan these documents prior to using ICR, OCR, IDR or any of the techniques identified above. As an example, invoices received via email in a searchable pdf format, can potentially have the required data automatically extracted with a high level of accuracy and no human input.
Examples include cheque requisition reports, property tax reports, invoice and credit note runs. The reports would be parsed by the application and broken down into individual records or pages. At the same time, index information is extracted from each record or page and associated with that record or page.
The full text content of the document is also made available for searching. To improve the presentation of the document to the end user, an overlay can be added. The Overlay can be a representation of the form or paper that the original report would have been printed on. Therefore, in the case of an invoice, the record resembles the original printed invoice. Datagrabber can also be used to import images, or files, along with indexing information extracted from a legacy system or from a manually created file. It can also be used to create the required structure of a database within Alchemy.
The capture of pure voice records and voice forms is becoming as important for businesses as other forms of communication (email, web forms, fax). Applications such as CX-E (CallXpress) provide the ability to capture voice commands to initiate business processes, store voice records alongside all other forms of communication for future reference in a document management system and convert speech to text. In the case of speech to text, this provides the opportunity to utilise OCR, ICR, IDR technology to support the business needs. Contact centres provide a good example of where the combination of voice, instant messaging, email, fax and web forms will all be found supporting a common business process.
For more information, please call us on +44 (0) 1962 835053, or email firstname.lastname@example.org.