Friday, June 28, 2013

How to install and configure WebDav with Nuxeo

WebDAV (Web Distributed Authoring and Versioning) is a web communication protocol (an extension to http/1.1) that allows quick access to a server or web site content from a computer folder, as if it were local content. This means that you can access folders and files in your server as if they were in your local drive as long as you have access rights.

We would like to teach you today how to install and configure WebDav with Nuxeo on Linux and Windows systems. We'll start with Linux users but in the next post, Windows users will be able to learn how to do it too.

How to install and configure WebDav with Nuxeo on Linux

You will need some software. The software required is DAVfs, a Linux file system driver that allows you to mount a WebDAV server as a disk drive.
You can obtain DAVfs from official repositories by using the following command in a console window:

apt-get install davfs2
yum install fuse-davfs2

Now you should be ready to mount your Nuxeo drive on your local system. Use the following command:

sudo mount.davfs -o 'uid=linux-user-id gid=linux-group-id' http://nuxeo-repository-url /mnt/where-you-would-like-to-mount-nuxeo-on-your-computer/

If you have done everything right :-), you should be able to access to your Nuxeo reporsitory from /mount: 


For detailed information, please visit our Athento Documentation Center.

Discover how to calculate the ROI of a paper document capture it

Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.
Comparing ECM Systems (including Alfresco, OpenText, Documentum, Filenet, Sharepoint or Nuxeo).

LikeUs Yerbabuena Software on LinkedIn


Monday, June 24, 2013

Document Types and Data Extraction

Beyond the types of personalized documents that we choose to work with in our business (government form number XXX, complaint forms, service termination forms, correspondence, etc.), or the everyday types which we frequently use, across all industries (invoices, purchase orders), there are some general types of documents which are important to keep in mind when it comes to digital document data extraction projects. Let’s not forget that data extraction is used to automatically fill in metadata or as a way of accessing other business software (such as ERP or accounting software) and that it’s a job for document capture software.

1. Structured documents
These are the types of documents in which we know what information we’re going to find, and the position that that information takes within the physical dimensions of the documents (coordinates). Some examples of these documents would include general forms, government forms, passports, and identity cards (national cards, citizenship cards and any other form of identification.) As an example, Spanish identity cards contain all the same information and the information is always located in the same place. With this type of document, it’s easier to find and extract data because we know where to look for it. In this video about our capture product, we can see how, with this type of document, the same users can define templates for data extraction, intuitively showing Athento’s OCR which data to extract and the coordinates where they can be found.

2. Semi-Structured Documents
The difficulty in managing things starts to increase as the information becomes less structures. Semi-structured documents are those in which we know what information we’re going to find; we just don’t know where, exactly, we’re going to find it. One classic example of this type of document is invoices. An invoice has to include a tax registration number, an amount for sales tax, a total amount and client data, among other things. Regardless of who (or what company) it comes from, we know that an invoice is going to include this information, but nobody can guarantee that that information is going to be put in the same location by all of our suppliers. The supplier’s tax registration number could be at the right; but it could also be on the left-hand side of the document. When this happens, we know what we’re looking for; we just don’t know where to find it.

The way in which we extract data from semi-structured documents can’t work in the same way as the previous example; here, we have to show the software what the information we’re looking for looks like. For example, the words “Value Added Tax” or “Sales Tax” are going to appear before the actual quantity of the tax itself. More advanced applications dedicated to data extraction and capture, such as Athento Capture, don’t just look for regular expressions within texts: they also try to contextualize the information being searched for, and indicate the relationship(s) with the smallest structures within a document, such as tables, images, paragraphs, etc.

3. Unstructured Documents
With this type of document, we don’t know what we’re looking for, much less where to find it. This is the most difficult type of document for extracting data. Included in this group are reports, letters, etc. I should clarify by saying that, according to some writers, the second group (semi-structured) doesn’t seem to exist, and documents like invoices should be included in this group.
Specifically, according to AIIM, an unstructured document meets three requirements: 
  • The structure of the document hasn’t been designed by the business which now wants to manage it (i.e. it’s an external document); 
  • The structure of these documents can vary, depending on who’s sending it (e.g. every company that sends an invoice has its own format); 
  • They can’t be processed by adhering to one particular template. 

As you’ll have noticed, these points are more suited to the description of semi-structured documents which I’ve given you. Nonetheless, one has to consider the existence of other documents which have totally unclear structures, but which are still important for a business, such as letters of complaint, general correspondence, reports, etc. These days, extracting data from these types of documents is an arduous process which, in most cases, requires development and studying the documentation try to identify some kind of structure, or understanding, of the data which you want to extract. It also requires greater training of the system and 100% consistent application to understand? the greatest number of regular expressions. Maybe that’s the reason why, and why documents of this type aren’t frequently used within data extraction projects: unstructured documents are usually ignored. 

Discover how Athento's intelligent document capture technologies workdownload it


Wednesday, June 19, 2013

Café in Silicon Valley with Antonio de las Nieves

English: AOL's Silicon Valley office in Mounta...
English: AOL's Silicon Valley office in Mountain View. (Photo credit: Wikipedia)
A few days ago, our VP of America at Yerbabuena Software was interviewed by the California Chamber of Commerce. In this interview, Antonio de las Nieves told us about our business story here in Silicon Valley, why we ended up here and what makes athento different compared to other document capture software.

Let's watch the video!

Enhanced by Zemanta

Tuesday, June 11, 2013

Intelligent Document Capture: The first step towards managing Big Data (Part 2)

Part Two 
In the first post, we attempted to contextualize the problem of Big Data, and in this second part, we’ll take on the job of explaining why document capture is the first step toward a solution. 

Go to part one

The first step: Make the information immersed in our digital content accessible
Computers are machines. They can’t “understand” or give context to the content of our digital assets. If the content of a digital asset or the data which explain the content and the nature of said asset (metadata) aren’t added to something like a index table in our applications, it’s impossible for the machines to find relationships between data and put things, like what we humans would like to understand out of our digital content, in context. Nowadays, we debate whether these index tables can be carried out in the immediate future by relational databases, when we haven’t even taken the first step to index the existing content.

So, we’ve taken on the work of digitalizing everything which we don’t have in digital format, and we’ve come up against the same problem. If we’re not able to let the machine access the real information that our digital assets have, we’re just throwing them into a bottomless pit. We’re forgetting that a scanned document is nothing more than an image, which our human brains can read -- but the same isn’t true of the processors of our machines.

We can use OCR software to take care of part of this problem. We add the content of our documents and our digital assets to the index table of our applications, but if there’s specific data which need to be shared or recovered to be used in specific applications, we can only get OCR to carry out a huge amount of work looking for data in that vast amount of content of the digital assets. Why not make these data easier to access, working with them as if they were metadata? For example, if our accounting software needs to know the number of each invoice, why put the software to work looking for this number inside every invoice, every time it’s required? Wouldn’t it be easier to find it the first time, storing it as the digital asset metadata?

Right, so we’ve fixed the problem of fast access to specific data contained within our digital assets, but there’s still one more problem to solve. With the quantity of documents which we receive every day, is it viable to spend time search through each of these documents for data? No; if we’re talking about Big Data, it certainly isn’t viable. But if we’ve already managed to get the machine to read content inside our scanned documents, why not also get it to extract the information to extract

The most accessible data that we need?
That, folks, is the key: to get the machine to work for us, and that’s the first place where we should invest our money. When we know how to get the most out of the data in our digital content, we’ll discover that the problem of Big Data becomes more of a dilemma of hardware because we can provide the software with the entry point for information that it needs. Document Capture Software is just the beginning, but without it there's no place where to go trying to deal with Big Data.

DOWNLOADSWe explain how Athento helped Crisa manage technical documents.

Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn
Enhanced by Zemanta

Monday, June 10, 2013

Intelligent Document Capture: The first step towards managing Big Data

Part One
In this first post, we’ll go over the challenge that Big Data represents for businesses, in a conceptual manner; in the next post, we’ll explain how document capture could be the first step towards resolving this problem. 

What is Big Data?
Up to now, we haven’t been clear about when we started to talk about Big Data. Some speak of petabytes, others of exabytes; still others don’t need to get to capacities that are that big. The limit between what is (or isn’t) a Big Data problem can be determined by the capacity of their information systems to manage said information.

How can we understand the problem of Big Data?
Gartner lays out the Big Data problem as a three-dimensional solution:

  • Enormous volumes of information: With the rise of new technologies, we’re not just creating more digital content; we’re also creating digital content that takes up more volume. According to the “Digital Universe” report published by IDC, by 2015 we will reach the quantity of 7,910 exabytes of information in the whole, entire world.  
  • Digital content is becoming increasingly varied: These days, we use diverse gadgets and formats to create and store information: voice messages, text messages, videos, e-mails, social networks, bank transactions, etc. All of these different pieces of data can’t be treated in the same way. 
  • These volumes, and the variety of information, need increasingly faster means of processing and recuperation. Many businesses have had to confront system collapses or, simply, users who balk at using information applications because they’re just too slow. Speed, however, has to be seen from another point of view, and that view is how quickly we are creating and storing information now.

The biggest challenge in the era of Big Data
It’s not just so we can get our applications can survive the size of the data; it’s also so that our information systems don’t become black, bottomless pits where we toss our digital content every day. We need to be able to generate real information or knowledge, starting from our digital content.

Don’t miss the second part of our post.

Discover how Athento's intelligent document capture technologies workdownload it

Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn

Wednesday, June 5, 2013

Improvements with Digitized Document Images

As you know, we here at Athento have been dedicating ourselves tirelessly to investigating document capture for some time. Our objective is to get many manual tasks, such as data extraction or document classification, to be done in a completely automatic way, with the highest precision possible.

In order that these tasks can be automated, especially the extraction of data, the images have to meet certain minimum quality criteria. Anyone who’s ever had to scan a document knows that once the thing’s been scanned, the document can end up with defects like blurring, black (or white) edges, being off-center, etc.

When data has been extracted from a document, one of the base technologies applied to it is OCR (Optical Character Recognition). Current OCR motors have problems reading the content of the document when the document ends up with quality defects like noise. “Salt and pepper” noise, which isn’t anything but a bunch of grainy spots spread throughout the image, negatively affects the performance of OCR.

Below, you can see a digitalized image which is grainy and contains a fair bit of noise:

In order for data extraction to be the most precise possible, noise has to be eliminated from the image. Francisco González, one of our engineers (affectionately known as “Kurro”), has made it possible for Athento to significantly “clean the noise” from digitalized images.
Here’s the same image, but after being improved and cleaned up by Athento:

Congratulations, Kurro: impressive work!

DOWNLOADSWe explain how Athento helped Crisa manage technical documents.

Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn

Tuesday, June 4, 2013

Infographic: Searches on Document Management Systems and ECM Platforms

A promise made is a debt unpaid. As we have promised,  let's have a quick look at our awesome infographic.

How do users look for documents on DMS and ECM platforms?
Without a doubt, the most commonly-used method to find documents is by using search forms. Even then, a significant number of users still find documents by browsing within folders and files.

What should you know about searches?
The first thing you should know about searches is that they cost you money. Every minute you spend looking for a document costs at least $0.09. The slower the system is, the more expensive it becomes. Another interesting fact about searches is that more than half the respondents of our survey said that they usually search  for documents by words within its content. The bad news for those users is that not all ECM platforms and DMS allow content full-text indexing.

If you would like to check out the whole infographic, you can download it for free from our website.

DOWNLOADSDownload this infographic and learn more about searches on DMS and ECM systems.

Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn