Tuesday, December 18, 2012

Autotagging, giving context to our documents

In previous post we have told the pluging of Athento, Autotagging or self-labeling of documents, and how this plugin uses Semantics to find key terms in documents and turn them into labels that connect to other documents which include the same terms.

Semantic technology is based on the use of ontologies, which are nothing more than sets of definitions or concepts and relationships between them (a structured collection of information) that we use to describe things (domains).
Basically, what we do is put autotagging marks (labels) within documents that correspond to categories included in an ontology to improve access and retrieval of documentation (search).

As I have said before, the power is in identifying which terms are relevant and defining whether it is worth to convert them into tags. With very global ontologies or generic terms, the result might not be entirely  relevant to our context. As we saw in the resume example, a proposed tag was "Europass", but maybe, what we really wanted to know was which candidates had certain knowledge about computer technology, so "Europass" told us nothing. It is necessary, therefore, that we describe a domain or a narrow spectrum.

This is precisely one of the improvements that have been made ​​in the latest version of Autotagging. Today it is possible to ask Athento to label with the categories defined in our ontology, we only need to load the new ontology in the manager.

On the other hand, it has been added the ability to send label proposals to  a "blacklist", which indicates to Athento not to use those terms as labels. It is also possible to indicate  "synonyms" which should be replaced by a term when it is found, so that we ensure we keep under the same category terms that mean the same thing.

Lastly, but not less important, another improvement is the possibility of constructing document types using labels. A document is recognized under certain type when it has been tagged with a particular set of labels. This new categorization of documents allows us a new formula for virtual navigation of documents. Althoug those documents are stored in different locations, we see them grouped under the same type of document.
