For those people who are looking for a solution that allows them to automate the classification, storage and/or routing of scanned documents, it’s important to understand the methods of document classification, some of which currently include:
- Symbolic classification: This is the most primitive method. It’s called “symbolic” because, in reality, someone has had to identify the document before it’s uploaded to the capture system or document manager. The most obvious example of this method is the use of bar codes. Capture software or document management software reads this bar code and then directs the document to be with a document type, such as a “Service Contract” – but someone will have had to previously generate the bar code which indicates that this document belongs to that document type. Some ERPs provide the functionality to generate bar codes within documents themselves, but what’s certain is that businesses don’t just work with documents that they themselves create. For the remaining document types, this requires a fair amount of manual labor.
- Analysis of the graphical structure of the document: This method is based on the classification of documents according to their appearance. It requires the comparison of a document with a model (or models) learned by the system. In one way, this classification works much like a human would by trying to determine what a document “looks” like to confirm what type of document it is. For document identification to work, patterns are defined and the system is trained to learn to recognize them and be more precise in recognizing them. Many of these algorithms share patterns of colors, black and white tones, document layout, etc.
- Analysis of the graphic structure of the document, together with key words: Together with the techniques described in the previous paragraph, this method allows users to look for key words which would be indicative of a document type. For example: after analyzing the graphical structure, the system assigns a high probability that the document is an invoice, which makes it look for words like “Invoice” or “Tax Number”. This method adds a higher degree of precision for data than you would get by simply comparing the structure of the document. All of these mechanisms are based on statistical algorithms that compare the probability that a document would belong to a specific type.
- Analysis and text processing: This involves text analysis to find terms, the meanings of which help to describe the document that contains them. Decision trees, support vector machines, Bayesian algorithms, “closest neighbor” techniques, etc., are some of the methods used to extract relevant information from documents. These methods define classification schemata that are based on the idea that documents can be represented in vectors of characteristics (a group of characteristics that define the document and its relative importance) according to the words that appear in them. Each element within a vector represents the importance or relative weight of a characteristic of a document. Characteristics are simply words or groups of words taken from a group of documents which belong to a category. Using probability-based methods, the idea is to fit documents within classification schemata, in agreement with the information provided in the vectors.
There isn’t one specific solution which solves all of the problems with document classification. Powerful software classification applications combine several of these methods to achieve greater precision when it’s time to classify documents. Anyone who’s thinking of getting advanced document capture software should learn as much as possible about what mechanisms the system uses, given that the easier the mechanisms are that the software uses, the more human help the software will need to do the job.