Thursday, September 20, 2012

What are Document Types and their relation to Document Capture?

Beyond the custom document types you choose to work with in your company (XYZ template, claim form, service termination request, correspondence, etc.), or typical document types we use frequently in all industries (invoices, quotes), there are some generic types of documents important to consider in data capture projects of digitized documents.

Remember that document capture is used for automatic metadata completion or as a way to automate the data entry to another enterprise application (ERP, Accounting Software etc.) and that this is a task for the capture software.

1. Structured Documents

This are documents in which we know what information we'll find and where it is phisically located in the document (coordinates). An example of this document types are forms, templates, passports, ID cards, etc. For example, all California ID cards contain the same information and it is located in the same position. In this type of documents, it's very easy to find and extract data, since we know where to look for. 

In this video of athento's iCapture module we can see how with this Document Types users can define data extraction templates, indicating in Athento's OCR tool what data to extract and the coordinates where this data is located in a very easy way.

2. Semi-structured Documents

Management challenges keeps growing as soon as the content inside documents is less structured.
The semi-structured documents are those in which we know what information we'll find, but not exactly where we'll find it. A classic example of this type of documents is invoices.

An invoice must include a VAT, a Tax ID, an amount total and customer / seller data like an address among others. Regardless of the seller, we know that a bill will include this information, but there's no way we can guarantee that such information will be placed by all of our suppliers in the same area of the invoice.

The Tax ID may be to the right, but it could also be on the left of the document. In these cases we know what we're looking for but just don't know where to find it.

The way to extract data from semi-structured documents can not be the same as in the previous case, here we need to teach the software how to find what we're looking for. For example, before the value number of the VAT or whatever amount taxed you should find the word "VAT" or "Taxes".

More advanced applications of data mining and capture, like Athento's Document Capture, not only look for regular expressions, but they also try to contextualize the information sought, indicating their relationship with smaller structures within the document such as tables, images, paragraphs, etc..

3. Unstructured Documents 

In this document type, we don't what we'll find, or where. The difficulty of extracting data from these documents is maximum. We can find repports, letters, technical documents, etc.
I should clarify, that according to some authors, the second group (semi) doesn't exist, and documents such as invoices are included in this group.

Specifically, for the AIIM, an unstructured document meets three characteristics:
  • The structure of the document has not been designed by the company that now wants to manage them (ie are external documents)
  • The structure of these documents may vary depending the sender (for example, in the case of invoices, each provider has its own model)
  • They can't be processed sticking to a template.

This original post was published here by Verónica Meza in our spanish blog. We hope you found it useful!

For any help on intelligent document management or capture, feel free to contact us in our brand new Silicon Valley location. We want to hear from you! Yeey!

LikeUs Yerbabuena Software on LinkedIn

No comments:

Post a Comment