Monday, September 30, 2013

Athento releases the 2.3.32 version of its software and moves towards version 3.0

Silicon Valley, September 26, 2013 Athento, the smart document management software,  is today releasing its 2.3.32 version as it moves ahead to version 3.0. This new release include corrections of bugs and new functionality.

Just over two weeks ago, Athento launched its version 2.0 beta as an “open beta” version. The aim of that release was to increase awareness of the product and to obtain feedback from a large number of users regarding the design and functionality of the product. Athento’s “release to manufacturing” version should be ready this coming November.

The 2.0 beta version of Athento includes various service packs, with software updates, correction of bugs and performance improvements. The first of these service packs was released last week under the release number 2.3.32 and is available to be downloaded from the Athento web page.

This new release corrects some minor bugs and adds functionality, such as the configuration of a CMIS path to download documents from repositories that support the CMIS protocol, like Alfresco or Sharepoint. It also includes a folder monitoring system that uses the same protocol; and the Athento Scheduler, which permits users to configure folder monitoring tasks, using Hot Folders. (Releases notes)

User feedback has been positive and highlight the ease with which the application can be brought into a workplace environment.

“We hope to carry out agile development of this product, which has been sustained by the experience of users of capture and document management” says José Luis de la Rosa, the CEO of Yerbabuena Software Inc., the company that created Athento.

Looking ahead to the next RTM, whose release will take place in November, this next version  will surprise the users of Alfresco, SalesForce and Box. In the case of Box, an updated integration already exists but the new version will go beyond the semantic auto-tagging feature, permitting the uploading of documents from this file-sharing application.

About Yerbabuena Software, Inc.:
Yerbabuena Software is made up of a large group of document management software experts and currently has offices in Spain and Silicon Valley, California, in addition to important partnership agreements in countries such as Spain, Argentina, Chile, Peru, Colombia and Mexico. Its Athento product is in charge of managing documents in businesses such as the DIA group, BNP Paribas or Leroy Merlin.


Friday, September 27, 2013

The market for collaboration on documents

This past July, Forrester published one of their reports, this time on the situation of file-sync and file-sharing applications. According to this report, the number of information workers world-wide who use file-sync and file-sharing applications has increased from 5% in 2010 to 25%, currently. There’s no doubt that this topic deserves the attention of all of us who dedicate ourselves to the document management market. Actually, a few weeks ago, we started our own study which, with your help, we’ll be wrapping up shortly.

This study also gave us other interesting facts which we’d like to share with you today:   

Determining factors that influence which of these services is picked:

  • Ease of use
  • Trustworthiness
  • Relationship with the provider
  • Price
  • Security

Main concerns of businesses concerning these products:

  • Modality of the service (on-premise or cloud)
  • Waiting for the large suppliers to offer the same functionality, or going provisionally with other providers (smaller, but with more functionality), or not knowing what to do
  • Costs vs. functionality

Main players in the market:

  • Box  (leader)
  • EMC (leader)
  • IBM  (leader)
  • Alfresco (performs well)
  • WatchDox (performs well)
  • Microsoft (performs well)
  • AirWatch (performs well)
  • Accellion (performs well)
  • Dropbox (popular with consumers)
  • Google Drive (popular with consumers)
  • Meanwhile, we will continue to find out how we can improve the experience and work for users of these applications, Don’t forget to collaborate and give us your feedback.

Create your free online surveys with SurveyMonkey , the world's leading questionnaire tool.
Discover how a smart document capture process it

Thursday, September 26, 2013

Yerbabuena Software releases more than 20 plug-ins for Nuxeo, the Open Source platform for Enterprise Content Management

Silicon Valley, September 26, 2013.- Yerbabuena Software Inc., the business that created Athento, the smart capture and document management software, is today releasing more than twenty plug-ins which incorporate functionalities to Nuxeo.

The company has announced today that it will make some twenty plug-ins, which add or improve functionalities to this software, available. Nuxeo is one of the most prominent Open Source code software solutions currently on the Enterprise Content Management market, thanks to its clients in more than 145 countries.

Most of the modules being released by Yerbabuena Software provide capture functionality, such as the AFM plug-in for monitoring folders (Hot Folder), for mass, automatic uploading of documents from folders; or the OCR plug-in, which allows users to include an OCR motor to Nuxeo. These plug-ins are the result of the path taken by Yerbabuena, which now has Athento, its own document and capture product, yet still remains an active part of the community of Nuxeo developers.

As well as characteristics for capture, these plug-ins provide Nuxeo with equally-attractive functionality, such as integration with Google Docs, which allows users to edit documents stored in Nuxeo  on-line using Google Drive. Several digital signature plug-ins have also been released, such as the one for integration with Realsec’s CrytpoSign server, which permits digital signatures using signatures that have been previously stored in a server containing signatures.

The plug-ins released by Yerbabuena Software are available for different versions of Nuxeo and can be downloaded from the company’s repository.

“At Yerbabuena software, we hold the community of Nuxeo developers, of which we’re a part, in very high esteem. With the release of these plug-ins, we’re trying to contribute to that community and to the thousands of Nuxeo users around the world who want to add functionality to their ECM platforms” says José Luis de la Rosa, the CEO of Yerbabuena Software Inc.

With an outstanding track record in the development of software, and which has become the partner of large suppliers of ECM software such as Nuxeo and Alfresco, Yerbabuena Software is currently focusing on developing its own capture and document management product, which is compatible with the above-mentioned platforms, and also works with SharePoint, Open-Text and Documentum. Barely a week ago, Yerbabuena announced the launch of version 2.0 beta of Athento, which can be tried and downloaded from the product web site.

About Yerbabuena Software, Inc.:
Yerbabuena Software is made up of a large group of document management software experts and currently has offices in Spain and Silicon Valley, California, in addition to important partnership agreements in countries such as Spain, Argentina, Chile, Peru, Colombia and Mexico. Its Athento product is in charge of managing documents in businesses such as the DIA group, BNP Paribas or Leroy Merlin.

Discover how a smart document capture process it

Wednesday, September 25, 2013

Extracting metadata by using bar codes or QR codes

Some businesses have decided to conduct data extraction from their documents by using bar codes or QR codes. As is the case with document classification, the use of bar codes is one of the most simple technologies for the available document capture software that’s available. With the use of bar codes, capture applications don’t have to search with data within the content of document, since it’s enough to “read” the information that is included in the code. 

  • Advantages: Faster data extraction. What’s more, many business applications such as ERPs can generate bar codes each time that new documents are created, meaning that it might be easier to adopt this type of data extraction.

  • Disadvantages: It might be expensive if the documents to be handled are generated from outside our business, because that means processing the bar codes manually and sticking them to documents before digitizing them.

As always, the recommendation is to analyze the documentation from which you want to get the data, as well as the amount of work that using bar codes or QR codes would involve.

For those businesses that are still working with codes, Athento is set up to be used with codes.

Working with bar codes and QR codes in Athento 

In order for Athento to extract the information from a code, it’s necessary to define the associated piece of metadata (or, said another way, indicate which piece of data is the one that we want to extract. This is done during the creation of Models. (“Model” menu -> “Metadata”).

In Athento 2.0, metadata can be one of four types: text (which Athento looks for in document content by using regular expressions), image (It is a clipping of a portion of the original image ),  zonal text (Athento looks for text in certain coordinates), or codes (when a bar code or QR code has been previously used on the document). The steps for a “code”-type metadata are straightforward: 

  1. Define a “code”-type metadata: In (“Model” > “Metadata”), indicate the “code” metadata and give the metadata a name. It’s not necessary to indicate any other data for code-type metadata.

  2. Define the location of the bar code: As you’re defining the template, define where the bar code is located within the document (“Models” > “Define Template”). Defining the location of the code allows the system to rule out codes that are not to be read, and increases the precision of the reading.

There you go: Athento is ready to process bar codes or QR codes within your documents!:-)
Click here to see the formats of bar codes and QR codes that are supported by Athento.

¡Try Athento 2.0 beta now! Share

Tuesday, September 24, 2013

Types of document imaging projects

There are basically two types of document imaging projects. The differences between the two are related to where the documents are going to go once they’ve been converted to digital format. Below, we’ll briefly explain the two classes:
  • Digitizing for long-term storage: Scanning documents which will be sent to a document repository for their long-term preservation. The main idea behind these projects is digital storage, whether it be for purposes of accessibility the information, or for its preservation. With this class of document imaging projects, the documents involved might be old documents or those which are worked with on a day-to-day basis. These are the most basic type of document imaging projects and, usually, their return on investment isn’t always clear; but they’re usually taken on because of the need to satisfy the consultation difficulties associated with paper-based documents, or for the desire to eliminate costs associated with storing paper-based documents. These processes don't normally require data extraction, since activity is mostly focused on converting documents to a digital format and storing them. That said, it might at least include the use of OCR to make it easier to search for information in the future. One example: a library wants to digitize its collection, or the staff of a hospital wants to digitize old patient histories, while it’s probably going to keep working with paper.

  • Digitizing business processes: These projects have a pretty clear return on investment for businesses because they mean automating tasks which normally are carried out by people. Aside from the conversion of these documents in digital format, these projects seek to obtain information from documents so that they can be used to feed different business processes. One example to look at is the example of travel allowances. The invoice for a train ticket is scanned; the system recognizes the invoice and extracts data such as the name of the worker that used the ticket and the total amount for the trip. Once extracted, the data is sent to the program that the business uses to approve the reimbursement of the total amount of the ticket. This kind of project needs advanced document capture software and integration with various business software applications used by businesses.

Discover how a smart document capture process it

Thursday, September 19, 2013

Document Management Glossary, Letter I

I have to apologize for skipping letter "I" in our glossary. So, let's back to "I".

Image Processing 
Image Processing is any kind of processing of a signal which brings in an image (or a video), and the output consists of another image, video or a series of data extracted from what came in. Currently, image processing is done digitally, although there could be specific applications in which it is done analogically, using optics. Image processing is a two-dimensional process, in which the image is represented in computer code by a matrix in which the cells are pixels. When the incoming information is a video, each frame or still image of the video is processed as an individual image, as a matrix, with a time relation existing between the images. Image processing is useful in a variety of fields, such as the automatic identification of document types, working with X-ray images to obtain information, image compression (.jpeg) and videos (.mpeg), etc. In document management, image processing can be used in document digitization or document imaging

Document Imaging (Image digitizing)
This is a category in information technology which uses image processing to reproduce, process and represent documents. Photocopies, multi-function printers or scanners are some of the hardware used in image digitization. Imaging (or document imaging) refers to software which is used within the capture cycle of Enterprise Content Management, and consists of  obtaining digital copies of paper documents.

In document management, indexing is used to create a list (or index) of terms which describes each document in the system so that documents can be located again in the future. To do this, electronic systems go over the content of a document and extract a summative representation of the meaning of the document (full-text) and include metadata in the index, such as the title of the document, or its author. The index tries to make consultations easier and to reduce the amount of time needed to carry out the search.

In document management, an index is an ordered list of terms or words that belong to a document. When a document is indexed, the idea is to describe its content by using a list of terms.

Document Inventory
A document inventory is a related group of documents of a company or an institution at a particular date. It is used to take count of all the documents stored by the institution to, for example, suppress or leave out those which are at the end of their document life cycles.
Discover how a smart document capture process it


Wednesday, September 18, 2013

Athento launches 2.0 beta version to add intelligence to the capture process and document management

Silicon Valley, September 18th, 2013 - Athento, the smart document management software, presented its 2.0 beta version today.

The 2.0 version of Athento offers advanced features for capture and document management, such as automatic classification and extracting information from documents by applying powerful algorithms for text and image analysis. The 2.0 version incorporates support by Machine Learning, as does an improved version of its semantic auto-tagging feature for document, which is already available through

Another feature offered by this version of Athento is an easy-to-use, friendly and fast interface, which allows users to configure practically all of the capture process from the front-end. The strengths of this beta version are rounded out by an API and a series of web services that increase the capacity for integration of the tool and which also allows for mobile capture and processing. 

“Athento combines intelligence, ease of use, precision and power to meet the needs of businesses that are not happy with the software that is currently available on the market,” says Jose Luis de la Rosa, CEO of Yerbabuena Software Inc. “For them, this product represents significant savings, thanks to process automation and a powerful yet flexible product for smart document management”

Starting from today, users can try the 2.0 beta version of Athento through the product’s website. The company plans on having version 3.0 ready for November, and will present the 3.0 system at various international ECM events. The idea behind releasing a 2.0 beta version is to get user feedback, learn about user opinions and needs, and to show users that they have an active role to play in the design road map for Athento. 

Athento is an innovative product that has become known for its partnerships with big players such as Nuxeo and Alfresco, as one of its most attractive points always been the way it has bet on interoperability. On its own, Athento also provides powerful solutions for business. Athento uses CMIS connectors which allows it to work with ECM platforms like the ones mentioned above, but also with SharePoint, OpenText or Documentum. This new version also offers the possibility of uploading documents from DropBox and other capture mechanisms such as capture from e-mail and hot folders

About Yerbabuena Software, Inc.:
Yerbabuena Software is made up of a large group of Enterprise Content Management experts and currently has offices in Spain and USA, in addition to partnership agreements in countries such as Spain, Argentina, Chile, Peru, Colombia and Mexico. Its Athento product is in charge of managing documents in businesses such as the DIA group, BNP Paribas or Leroy Merlin.

¡Try Athento 2.0 beta now!


Tuesday, September 17, 2013

Documental Management Glossary, Letter J

Letter A. Letter B. Letter C. Letter D. Letter E. Letter F. Letter G.

JEE stands for Java Enterprise Edition or Java Platform, Enterprise Edition, previously known as Java 2, Enterprise Edition (or J2EE). It is an application development platform for apps developed in the Java programming language.

Java is a general-use programming language developed in the 1990s by James Gosling of Sun Microsystems. Gosling based Java on the C and C++ programming languages, which had become widely used since their invention and development in the 1980s.

The advantage that Java offered, compared to other languages, is that it was created with the idea of having the fewest dependencies possible for implementation. Put another way: programmers would write the program once and it could then be executed on any platform. This has made it easier to use this language; and the applications developed in this language have expanded to the point where they’ve become one of the main (and most versatile) ones.

JPEG is the acronym for Joint Photographic Expert Group, the group of photographic experts who created the standard for compressing and codifying digital images, denominated by the same initials.

Image files which have been compressed and coded with this standard are said to be in JPEG format, using the .jpeg or .jpg extension.

JPEG and the Internet are closely related, both because of the time in which both were created, and because of the mutual needs between the two. JPEG has become widespread because of the Internet; but the Internet has also used JPEG to create files that are more visually pleasing without seriously affecting loading times of those pages. This is because images in JPEG format take up less space than other image formats since they’re compressed into smaller files.

The development of digital cameras and smartphones with cameras have established the dominance of this standard and its associated format.

In document management, JPEG is used, among other formats, for the digitization of documents, or document imaging. The Capture mode in Athento permits the processing of documents digitalized in various formats, such as JPEG, TIFF, JPG or PDF.

Joomla! is a free, open-source code Content Management System (CMS) for publishing content on the internet. This CMS is widely used, thanks to its wide-reaching community of users, who use it to publish things such as personal and professional pages and blogs.

Discover how a smart document capture process it

Monday, September 16, 2013

Why extract data from documents?

One of the keys of a smart document management process is data extraction: using advanced capture software to read and save part or all of the information contained in a document to be used for other purposes. So why do businesses undertake the work of extracting information from document? What can they do with it?

Below, we’ll talk about some of the uses which, according to the AIIM, businesses have for captured data:

  • Indexing: making documents more accessible; or, put another way, making recovery and searching easier to do. Indexing  a document’s content means that we can search for words included in the body of the text. Indexing its metadata means that we can look for documents by using distinct values which those metadata can take on. That’s the reason why most businesses capture data from their documents.
  • To use the information in business processes: One of the most obvious examples of why businesses extract information from documents is so that it can be used as a starting point for procedures for account payments, processing checks, etc. There are countless examples: let’s take that of the sales people who have client information on businesses cards -- but the business might need that information to be available to the whole team in a CRM system, and not in a card file. In that case, client information can be extracted from the cards and sent to an application like Sugar CRM.

The following image shows you more specific uses that businesses have for the information that they can get out of their documents:

Taken from the AIIM white paper, "Forms Processing - user
experiences of text and handwriting recognition (OCR/ICR)".)

Discover how a smart document capture process it

Thursday, September 12, 2013

How to get the documents separated once they’ve been scanned in one batch

Those of our readers who have worked on document imaging projects know that it’s impossible to individually scan documents if you’re in an environment that has reams of documents to be digitized. Scanners are becoming more and more powerful; some have auto-feed functions, and the most modern ones can process more than 200 pages per minutes.

It’s one thing, however, to scan those documents and quite another to store them. Most likely, if I scan 10,000 invoices in one go, I don’t want to save them as one unique file in my repository. This problem is a lot more obvious with businesses or organizations that work with files or case files. Normally, groups of documents are scanned and each file contains a set of distinct documents. In these cases, once the documents are digitized, you’d want to be able to access all the documents that form part of the same case file. 

Scanners are faster these days, but they’re not smarter. They can’t tell where one document ends and another one starts, or when it’s done with file 2013-A899 and has moved on to 2013-A900. That’s where document capture software comes in handy.

Separating Groups of Documents: How can you get the documents separated once they’ve been scanned in one batch?

Basically, there are three ways to do it: 
  • Separation by bar codes: This consists of putting labels with bar codes on the documents before they’re digitized. Bar codes establish the limits between documents or case files that need to be separated. 

  • Separation using blank pages: Blank pages are placed between document and document (or between case file and case file), to be used as a separator during the document imaging process. 

  • Smart separation of document lots: The previous methods are expensive. Someone’s got to print and stick those labels on the documents, or slip blank pages between the groups in a pre-digitizing phase. That’s what smart separation tries to avoid. Thanks to its ability to analyze the structure of the document and the text the document contains, smart capture systems are capable of identifying where one document starts and where it ends. These systems do need training: that means that every time identification and separation of documents is performed, the system continues to learn and become more specific.

It goes without saying that Athento is able to carry out smart separations of groups of documents! In this success story (free to download), you’ll learn more about how separating groups of documents works.
I hope that you find it useful! :-)

Discover how smart document capture can help your it

Wednesday, September 11, 2013

Athento Answers – for your questions

Over the last few months, we’ve been making great efforts so that developers and Athento users can count on all the help they need for when they work with Athento. Our first step was to open the Athento Documentation Center to the public. The Center is a place where developers, system administrators and users in general can find all the documentation connected to Athento.

Now it’s time to create a space in which the community’s doubts can be addressed. Athento Answers is a space where you can ask questions about how Athento works, ask questions relating to development matters, rollouts, integrations, and anything that could be useful to the Athento community. You’ll need to register to ask questions, although you can check the existing questions and answers without needing to register.

We’d like to encourage you to take full advantage of all of the material that Athento has made available for you to use.

You can also follow the updates to the Athento Documentation Center and Athento Answers by following Athento’s Twitter feed: @athento.

Discover how a smart document capture process it

Tuesday, September 10, 2013

Why do businesses undertake processes of document imaging and capture?

Interest in document imaging projects continues to grow. We’re not just talking about simple projects of “massive scanning”, which is what we typically think of when we think of digitization. Truth is, this part of the document management market is shrinking. According to PwC, negative growth of between 5% and 3% is expected in this business (according to the "White Book of Document Management 2011"). The main reason provided by this consultancy to explain the decline of this sector is because government bodies (who, traditionally, have been the ones who have most needed to digitize old documents) have stopped asking for those services, due to cutbacks.

These days, both the public and private sectors are interested in projects that offer real value to their business processes, and this “value”, doubtless, is found in being able to automate processes to save money with time and labor costs. Digitization, in and of itself, doesn’t offer these benefits. That’s where document capture software comes in: it gives users the chance to obtain information from documents, to be used to trigger automatic processes.

Knowledge, productivity and cost reductions are, therefore, the motivating factors that guide the increase of “intelligent” digitization processes”. That’s how the AIIM defines it in its study called "The Paper-Free Office – Dream or Reality". According to the study, the main reason why businesses undertake document imaging and capture processes is to increase the capabilities for information searching within its documents, which is also connected to Knowledge Management. Businesses also cite other various factors, such as cost reduction (perishable goods, storage space) and improving processes (productivity).

In the following diagram, you can see for yourself the reasons that guide users when they undertake document imaging and capture projects:

Discover how a smart document capture process it

Thursday, September 5, 2013

Why use open-source code in document management?

The first answer to be given to this question is because it makes things easier. In its current state, the document management sector, which is still counting on private solutions such as SharePoint or Documentum, is becoming more and more integrated by open-source solutions, which, by using an interoperability component, manages to share resources with other existing applications (free or proprietary), making it easier to carry out transactions in collaboration with business and business can save money with lower development and programming costs.

Some defenders of proprietary software place value on the safety and stability you get from always having a company who can answer questions about the software. This couldn’t be further from the truth, because in the world of document management there’s more value in having scalable solutions that can be personalized, according to the needs of each business. What’s more, proprietary programs very rarely include service agreements (and then they do, they are, in most cases, just to provide client testimonials). The business model behind proprietary software is based on the assumption that a large number of users will have identical needs, though this isn’t the case with businesses that implement document management solutions (each of which would have its own problems and interests). What would be both useful and practical would be the creation of document management solutions for specific sectors. Nonetheless, this practice, from an open- source code point of view, would be far from a generalist sense of a production model, whose main source of profitability consists of the sale of user licenses (which favors the producer, and doesn’t think of the satisfaction of each client.) Consumers of proprietary software don’t usually need specific adaptations to the program, something which would happen to businesses who are interested in using document management. 

To satisfy the needs of each business, the best thing to do would be to bet on open-source code software, which offers significant advantages. According to some open-source code evangelists, business opt for open-source software because having free access to source code lets them scale their applications themselves and, in that way, keep control over growth, without having to depend exclusively on the business that originally created the program. What’s more, the possibility of incorporating open-source tools in the program guarantees that it will always be up-to-date, in terms of functionality. Done another way, it would be a lot more difficult because it would mean having to develop each one of the product components from nothing.
The use of open-source code adds a component of innovation (thanks to contributions from the community) in the area of business information management, as well as better flexibility (since it counts on the possibility of using functionalities from other applications) and considerable support savings that come from reduced costs of previous production.

Counter to what some skeptics might think, to be more affordable doesn't mean lack of quality. This type of software is being used in the banking sector, in aerospace projects and in big businesses for management processes that require high performance.

From a technical point of view, the adaptation which the vast majority of solutions have carried out by offering supported use in the cloud makes the use of open technology the most interesting use, especially with questions of technological compatibility. Apart from a large number of servers around the world that are administered with version of Linux and that many web technologies (such as PHP or CSS) are free, there should be some atmosphere of freedom which favors the creation of solutions that can be adapted to each specific case, with the possibility of sharing functionalities via interoperability (thanks to the CMIS standard) and with an accessible API that permits new integrations and future development

I’d like to finish with something that seems important to me. The context of the information society with the cloud and cloud computing projects as protagonists is an atmosphere of cooperation and shared knowledge in which the collaboration among professionals and among machines, applications and digital objects is vital to guarantee the survival of business. In my opinion, proprietary software, with its restrictive model of production and gains, is left out of this atmosphere in which scalable applications, the sale of software as a service, community contributions (thanks to APIs) and the most comprehensive models with the possibilities of investment for business (such as pay-to-use) form a part.

The key lies in regulatory norms, which are responsible for the opening or impasse of one or more models. If we choose open-source code software we will find, among its licenses, one that will allow us to integrate the solution with other applications. Though some of those solutions might be proprietary (such as using Microsoft Office for documents or being able to communicate with SharePoint), it’s not about using an austere legal instrument that is destined to watch over exclusivity. Quite the opposite: in the case of proprietary software, which is normally protected by intellectual property legislation, opening is neither possible nor considered legal, and anything provided from the community doesn’t exist, which means that the innovation component is reduced to whatever the business that produces it can bring to it, themselves. 

This way of thinking goes against the free and open spirit of the web and doesn’t permit users to take effective advantage of existing technologies, nor of the creative potential of other people. Document management unfolds in businesses as a complex system of information which includes various interconnecting processes such as document capture, information recovery and the use of functionalities of produced applications by third parties, as well as software installed in the cloud which is advantageous for businesses who don’t have much technological infrastructure. The way in which all of these components work together in one system depends on many parties (the scanner manufacturer, the programmer of the main repository, the creator of the invoice management application, etc.) Within the CMIS interoperability standard, it’s possible to get the system to integrate everything and the use of open-source code licenses lets us have the necessary attitude to allow the integration of current technologies and those that might come in the future. That would not be allowed (and not even legally possible) if we produced propietary document management software – not forgetting the question of the huge investment in infrastructure and the hours of programming involved, which wouldn’t be guaranteed, in favor of earning higher profits.

About the Author:  This article was written by Adrían Macias, Managing Director of, which provides resources to Spanish and Latin American professionals working with information and documentation.
Discover how a smart document capture process it

Wednesday, September 4, 2013

How does Athento classify documents and extract the data?

Some of you are already familiar with Athento’s document capture features. We’ve also told you that, very soon, version 2.0 of Athento will be available. But for some of you who are new to the platform, we’d like to explain how Athento works.

Athento basically works using the definition of models. A model is a type of document and indicates various things to Athento:

  • The physical appearance of a document and its content;
  • The metadata that should be extracted from a document type;

Defining a model in Athento makes classification possible in two ways:
  • Athento can identify a document that is uploaded to the system and determine that it belongs to a particular document type (such as “invoice from Amazon”)
  • Athento can extract the metadata that have been previously defined as data to be extracted from this particular type of document (such as the total invoice amount, for example).

That means that it’s necessary to create models before beginning to classify documents and extract data from them. Creating a model means defining the following characteristics for a certain class or type of documents:

  • Basic data: A name and, what’s most important, a document that Athento can use as an example to know the physical characteristics of this type of documents (layout, colors, limits, etc.) 
  • Key words (regular expressions): These are expressions, words or groups of terms which normally appear together in a document of this type.
  • Metadata: Points out expressions that help us find the metadata to be extracted within the text of the documents. 
  • Extraction templates: Templates which define the physical location (coordinates) of the metadata within a document, so that the OCR system can extract them.

It’s really easy to create models in Athento. To see how it’s done, we invite you to consult our Athento Documentation Center; and, specifically, the entry called “How to create a new model in Athento.

Discover how smart document capture can make any document imaging process more efficient.
Discover how a smart document capture process it

Tuesday, September 3, 2013

The capture life cycle in Athento

When we talk about the life cycle of documents (or content, in general), we’re talking about the way in which document are always subjected to a process (or evolution), with recognizable, predictable states. Those are repeating processes. "Life cycle" term is familiar within document management applications, or generally with Enterprise Content Management platforms. Capture applications, however, can also work with life cycles for the documents that they process.

In the case of Athento, every time that we upload a document to the application to be automatically classified and to have its data extracted, the document goes through a two-stage life cycle:

  • Review: This is the state of the documents when they’re uploaded to the system and during processing. This state basically makes reference to the information obtained by Athento (document type and extracted metadata) not being certified as correct by a human user.)
  • Validated: This is the state a document gets when a person revises the information obtained by Athento and approves it as valid. This is the final state that documents receive in Athento. 

To see the state of a document, consult the list of documents:

For more information on capture life cycles of documents in Athento, please visit our Athento Documentation Center.

In the video below, you can see how to carry out moving between states when we’re working with documents in our capture software.

Processing documents on Athento from Athento on Vimeo.

Discover how a smart document capture process it

Monday, September 2, 2013

Regular Expressions in Athento 2.0

In previous posts, we’ve told you about some of the new document capture features of Athento, such as being able to search by facets.

Today, we’d like to share more about the new functionality of our document capture software: the simple handling of regular expressions.

Regular expressions are texts, words, numbers, fragments of text, etc. which we know we’re always going to come across in certain types of documents, and which can help us find the data that we need to extract from documents, or even give the software clues about what kind of document it’s managing.

Let’s look at an example: We’ve got a pile of invoices that have been put through digital imaging and we need to get the “total payment” figure. This metadata has several unique characteristics:

  • The figure always appears to the right of the expression “Total Payment”
  • It’s made up of one complete word
  • It can contain up to 30 digits

Here’s an example:

Defining these characteristics in Athento is easy, all we have to do is define certain parameters:
  • Metadata: the name of the Metadata that you want to extract (for example, “Total Payment”).
  • Expression: The fragment of text, number, word(s), etc., to be looked for within the text, as a reference for the extraction of metadata. In the example above: “Total Payment”.
  • Position: Refers to the location in which the data to be extracted, relative to the expression. In the case of these invoices, the Total Payment figure is always going to be to the right of the word “Total Payment”.
  • Number of words: The expression we’re looking for could be made up of one or more words. In the case of the Total Payment, there’s just one word that matches the number of words we want to extract.
  • Maximum length: the number of digits or characters contained in the word or expression that we want to extract. In the case, there would be up to 30 characters.

These parameters are configured in Athento so the system processes the text and finds the information that we want to look for. Together with this method there are other mechanisms such as the definition of templates (a graphical description of the coordinates where OCR should look for metadata) which was already functional in previous versions of Athento. In 36 seconds, we’ll see how to configure these regular expressions: 

Discover how a smart document capture process it

Document Management Glossary, Letter G

In 1983, an American programmer named Richard Stallman came up with the idea of a open operating system which he called GNU (the acronym of which stood for “GNU’s Not UNIX.”) UNIX, at the time, was a popular operating system with a stable architecture, but it wasn’t open. What Richard Stallman was creating wasn’t just an operating system: it was also a social movement. By creating GNU and liberating it as copyleft (for free distribution – the opposite of copyright) under a GPL license (see below), a global community of developers was created who, to this day, share code and collaboratively develop systems. Years later after Richard Stallman showed his ideas through the internet, Linus Torvalds, who, in 1991 was an engineering student in Finland, created the LINUX nucleus with the collaboration of many programmers, which Stallman then published under a GPL license. Afterwards, GNU and LINUX were combined to create an open operating system that was completely functional. It’s typically known as "GNU/LINUX" or "LINUX distribution". 

GPL (General Public License) was the first copyleft license: derived works could only be distributed under the same license. GPL is the most frequently-used license in the world and guarantees that the rights for end users allows them to use, share, study or modify software licensed under GPL. Companies or users who distribute their software under a GPL license can do so without charging users, but commercial use is also allowed, as well as using it as a proprietary software tool.

Smart Document Management helps Public Transit Companies to Automate Ticket Refunds Share