In the first post, we attempted to contextualize the problem of Big Data, and in this second part, we’ll take on the job of explaining why document capture is the first step toward a solution.
Go to part one
Go to part one
The first step: Make the information immersed in our digital content accessible
Computers are machines. They can’t “understand” or give context to the content of our digital assets. If the content of a digital asset or the data which explain the content and the nature of said asset (metadata) aren’t added to something like a index table in our applications, it’s impossible for the machines to find relationships between data and put things, like what we humans would like to understand out of our digital content, in context. Nowadays, we debate whether these index tables can be carried out in the immediate future by relational databases, when we haven’t even taken the first step to index the existing content.
So, we’ve taken on the work of digitalizing everything which we don’t have in digital format, and we’ve come up against the same problem. If we’re not able to let the machine access the real information that our digital assets have, we’re just throwing them into a bottomless pit. We’re forgetting that a scanned document is nothing more than an image, which our human brains can read -- but the same isn’t true of the processors of our machines.
We can use OCR software to take care of part of this problem. We add the content of our documents and our digital assets to the index table of our applications, but if there’s specific data which need to be shared or recovered to be used in specific applications, we can only get OCR to carry out a huge amount of work looking for data in that vast amount of content of the digital assets. Why not make these data easier to access, working with them as if they were metadata? For example, if our accounting software needs to know the number of each invoice, why put the software to work looking for this number inside every invoice, every time it’s required? Wouldn’t it be easier to find it the first time, storing it as the digital asset metadata?
Right, so we’ve fixed the problem of fast access to specific data contained within our digital assets, but there’s still one more problem to solve. With the quantity of documents which we receive every day, is it viable to spend time search through each of these documents for data? No; if we’re talking about Big Data, it certainly isn’t viable. But if we’ve already managed to get the machine to read content inside our scanned documents, why not also get it to extract the information to extract
The most accessible data that we need?
That, folks, is the key: to get the machine to work for us, and that’s the first place where we should invest our money. When we know how to get the most out of the data in our digital content, we’ll discover that the problem of Big Data becomes more of a dilemma of hardware because we can provide the software with the entry point for information that it needs. Document Capture Software is just the beginning, but without it there's no place where to go trying to deal with Big Data.
Document Capture Software: Athento
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.