Monday, November 26, 2012

Intelligent Document Capture vs Documentation Professionals?

By Joaquin Hierro, Document Management Technical Architect

Within the debate on the development of document management professionals, as well as the current and future job opportunities, I find it strange that not nobody ever mentions (as far as I remember) the capture and advanced classifications.
To narrow the type of products and features to which I will refer, we will do a briefly review, although this is known to all.

The early scanning systems controlled the scanner, presented the image and allowed to include metadata about the document but with a closed structure and, at times, without being able to choose the document type and / or metadata to enter.

Gradually, flexibility was introduced to enter various document types with different definitions and, moreover, with the introduction of OCR we could automate the entry of that metadata.

The introduction of OCR (and "specializations" such as OMR, ICR, ...) allowed in one hand the automatic capture of metadata (looking at fixed positions in the document), and in the other, document classification (based on words or logos). But all on structured documents ie forms with structure and fixed positions.

After the possibility of processing semi-structured documents (such as invoices or payrolls) with a more flexible structure than forms (but not fully flexible) the current systems offer more advanced features such as:
  • Treatment of unstructured documents that are different from each other (legal documents, contracts, correspondence, etc..). 
  • No need to separate using blank pages or barcodes, but automatically recognizing the pages in each document. 
  • Mixing documents of different types in the same batch. 
  • Manage autographed documents completely (not only forms that are "hand" completed). 
  • Automatically classify documents based on their content, overall appearance or specified rules (eg contains the word "contract" in the top half of the first page). 
  • Extracting metadata not only using positions (coordinates) on the page but by various criteria such as regular expressions, location markers, comparing with a database, etc..). 
  • File integrity check (do you have all the necessary documents? Are all metadata completed?  Is the field X the same all documents? Do the Y fields sum the amount Z?). 
  • Finally placing the document in the document manager of the institution, using the designated document type and the required assigned metadata defined.


Among the products included in varying capacities as I quoted might include:
  • Kofax KTM
  • Captiva Emc2 
  • Abbyy Flexycapture 
  • A2iA Document Reader 
  • IBM Datacap 
  • Ipsa Lectern 
  • Brainware Distiller 
    That is, we speak of systems to cover the following functions:
  • Receiving a "bunch of papers" 
  • Analyzing the content
  • Paging
  • Classifying each set of pages as a document of a given type
  • Extracting information from different parts of the document itself 
  • Creating the document entry with the extracted data to insert it into a document repository. 

What would we call that position if this was to be performed by a human?

In many of the projects, training and configuration of this type of software is performed by technicians with technology roots and minimal document management knowledge, when I think that the training of these "automatic document management systems" would fit perfectly into the profile and expertise of a specialist in documentation (in some complex cases with a developer team encoding operations that can be parameterized or train).

The point is classifying and systematizing a taxonomy, provide criteria to extract metadata defining when a record or file is complete and in short, give guidelines to a "documentation apprentice" to do its job.

I think it can be considered similar to the process with accounting applications, the accountant now does not "count", instead it will customize the accounting program (or even help you design one). Similarly document management professionals would define the rules and customize the document capture program.
Additionally there is a second phase: although high success rates can be achieved (80% -90%) there's always documents which classification or data capture fails. In that case, a person must review and decide what type of document it is and whether to include it as an example for future entries. Data extraction may require more or less depending on the document and training process, but identifying the type of document requires more knowledge and again it seems ideal for a document management professional.

This "looking away attitude" is not only in one side, manufacturers and companies that work with these products also focus their sights on computer professionals rather than technical document management experts.

What is the origin of this "divorce"? Ignorance? Does it seem like a wasted job opportunity for this DM professionals? Do you consider that this is the right profile?


I think both parties have to approach, allowing some professionals to work in an area poorly covered by them and where they have a lot to say, and companies should have experts in analysis and classification of documents that may be best suited to train their systems.

For more information: http://code.google.com/p/openprodoc/

Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.
Comparing ECM Systems (including Alfresco, OpenText, Documentum, Filenet, Sharepoint or Nuxeo).

LikeUs Yerbabuena Software on LinkedIn Share

No comments:

Post a Comment

AddThis