Beyond the types of personalized documents that we choose to work with in our business (government form number XXX, complaint forms, service termination forms, correspondence, etc.), or the everyday types which we frequently use, across all industries (invoices, purchase orders), there are some general types of documents which are important to keep in mind when it comes to digital document data extraction projects. Let’s not forget that data extraction is used to automatically fill in metadata or as a way of accessing other business software (such as ERP or accounting software) and that it’s a job for document capture software.
1. Structured documents
These are the types of documents in which we know what information we’re going to find, and the position that that information takes within the physical dimensions of the documents (coordinates). Some examples of these documents would include general forms, government forms, passports, and identity cards (national cards, citizenship cards and any other form of identification.) As an example, Spanish identity cards contain all the same information and the information is always located in the same place. With this type of document, it’s easier to find and extract data because we know where to look for it. In this video about our capture product, we can see how, with this type of document, the same users can define templates for data extraction, intuitively showing Athento’s OCR which data to extract and the coordinates where they can be found.
2. Semi-Structured Documents
The difficulty in managing things starts to increase as the information becomes less structures. Semi-structured documents are those in which we know what information we’re going to find; we just don’t know where, exactly, we’re going to find it. One classic example of this type of document is invoices. An invoice has to include a tax registration number, an amount for sales tax, a total amount and client data, among other things. Regardless of who (or what company) it comes from, we know that an invoice is going to include this information, but nobody can guarantee that that information is going to be put in the same location by all of our suppliers. The supplier’s tax registration number could be at the right; but it could also be on the left-hand side of the document. When this happens, we know what we’re looking for; we just don’t know where to find it.
The way in which we extract data from semi-structured documents can’t work in the same way as the previous example; here, we have to show the software what the information we’re looking for looks like. For example, the words “Value Added Tax” or “Sales Tax” are going to appear before the actual quantity of the tax itself. More advanced applications dedicated to data extraction and capture, such as Athento Capture, don’t just look for regular expressions within texts: they also try to contextualize the information being searched for, and indicate the relationship(s) with the smallest structures within a document, such as tables, images, paragraphs, etc.
3. Unstructured Documents
With this type of document, we don’t know what we’re looking for, much less where to find it. This is the most difficult type of document for extracting data. Included in this group are reports, letters, etc. I should clarify by saying that, according to some writers, the second group (semi-structured) doesn’t seem to exist, and documents like invoices should be included in this group.
Specifically, according to AIIM, an unstructured document meets three requirements:
- The structure of the document hasn’t been designed by the business which now wants to manage it (i.e. it’s an external document);
- The structure of these documents can vary, depending on who’s sending it (e.g. every company that sends an invoice has its own format);
- They can’t be processed by adhering to one particular template.
As you’ll have noticed, these points are more suited to the description of semi-structured documents which I’ve given you. Nonetheless, one has to consider the existence of other documents which have totally unclear structures, but which are still important for a business, such as letters of complaint, general correspondence, reports, etc. These days, extracting data from these types of documents is an arduous process which, in most cases, requires development and studying the documentation try to identify some kind of structure, or understanding, of the data which you want to extract. It also requires greater training of the system and 100% consistent application to understand? the greatest number of regular expressions. Maybe that’s the reason why, and why documents of this type aren’t frequently used within data extraction projects: unstructured documents are usually ignored.