Monday, March 11, 2013

Athento’s Text Analyzer versus the Stanbol Semantic Motor

“Let’s test how accurate Athento’s Text Analyzer is, compared to Apache’s Stanbol, which is used by suppliers of document management systems, such as Nuxeo.”

Over the past year, the concept of semantics, associated with document management, has been gaining strength. The importance of this concept applied to document management is found in the capacity of building relationships between documents, which then builds the paths towards the construction of knowledge. Put another way, it means transforming unstructured data into information that can you can use to understand a system, the business, a project, processes, etc.  At Yerbabuena, we’ve been talking about semantics and intelligent technologies for two years; and, these days, other providers of DMS systems are beginning to give these technologies more and more importance. Nuxeo is one of those companies which has taken an important step with its “Semantic Entities” package, which uses the Apache Stanbol semantic motor to find the names of people, places and organizations, which it then associates with their respective entries in DBpedia. This Nuxeo plug-in visualizes the entity found via an image (which could be, for example, a photo of the name of a person found, or a flag) and which allows us, with a link, to access all those documents in which the aforementioned word or object is found.




Athento contains a similar semantic module, though it’s one that is more advanced and which can find any term which is deemed important within the context of the text of a document, and can convert those terms into tags which allow us to relate documents which share a theme or have information in common. (Click here to see an example of Auto-tagging in action, as it manages résumés.)

We wanted to test the accuracy of Athento’s text analyzer, compared to the Stanbol semantic motor, when it came to extracting data from documents. To do that, we uploaded the same PDF document to a Nuxeo and Stanbol configuration, and to a Nuxeo and Athento configuration. The document is called “Seis Pasos Para Liberar A Mi Empresa Del Papel” (In Spanish); it was a document whose content many of our Spanish readers have already read, and which is found as an entry in our Spanish blog. The document talks about digitalization. For most people, the most relevant terms included in the text would be centered around these concepts:



Document management
project
costs
digitalization
investment
business
paper
expenses
documents
capture
benefits
Software
Hardware
OCR
scanner
information
digital
distributed
documentation
extract

We wanted to see how effective both Stanbol and Athento would be in extracting these words from the text, and, to be honest, the results were fairly surprising:

Words identified by Stanbol: “como” [as] y “espaa”.

We assumed that the first term came up for the number of times it had appeared in the text, and the second ought to be “España” [Spain], but because of some coding issue, the system only extracted “espaa”.

Words identified by Athento: 77


Key word
Found by Stanbol?
Found by Athento?
document management
no
yes
digitalization
no
yes
paper
no
yes
documents
no
yes
software
no
no
scanner
no
yes
distributed
no
yes
project
no
yes
investment
no
no
expenses
no
no
capture
no
yes
hardware
no
no
information
no
yes
documentation
no
yes
costs
no
no
business
no
yes
ICR
no
no
benefits
no
yes
OCR
no
no
digital
no
yes
extract
no
yes

Athento’s hit rate: 61.9%

Stanbol’s hit rate: 0%




Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.
Comparing ECM Systems (including Alfresco, OpenText, Documentum, Filenet, Sharepoint or Nuxeo).

LikeUs Yerbabuena Software on LinkedIn
Share

No comments:

Post a Comment

AddThis