In previous posts, we’ve told you about some of the new document capture features of Athento, such as being able to search by facets.
Today, we’d like to share more about the new functionality of our document capture software: the simple handling of regular expressions.
Regular expressions are texts, words, numbers, fragments of text, etc. which we know we’re always going to come across in certain types of documents, and which can help us find the data that we need to extract from documents, or even give the software clues about what kind of document it’s managing.
Let’s look at an example: We’ve got a pile of invoices that have been put through digital imaging and we need to get the “total payment” figure. This metadata has several unique characteristics:
- The figure always appears to the right of the expression “Total Payment”
- It’s made up of one complete word
- It can contain up to 30 digits
Here’s an example:
Defining these characteristics in Athento is easy, all we have to do is define certain parameters:
- Metadata: the name of the Metadata that you want to extract (for example, “Total Payment”).
- Expression: The fragment of text, number, word(s), etc., to be looked for within the text, as a reference for the extraction of metadata. In the example above: “Total Payment”.
- Position: Refers to the location in which the data to be extracted, relative to the expression. In the case of these invoices, the Total Payment figure is always going to be to the right of the word “Total Payment”.
- Number of words: The expression we’re looking for could be made up of one or more words. In the case of the Total Payment, there’s just one word that matches the number of words we want to extract.
- Maximum length: the number of digits or characters contained in the word or expression that we want to extract. In the case, there would be up to 30 characters.
These parameters are configured in Athento so the system processes the text and finds the information that we want to look for. Together with this method there are other mechanisms such as the definition of templates (a graphical description of the coordinates where OCR should look for metadata) which was already functional in previous versions of Athento. In 36 seconds, we’ll see how to configure these regular expressions: