Monday, November 28, 2016

Elasticsearch fine tuning for optimal ECM performance in Nuxeo

One of the key elements for any Enterprise Content Management tool is its ability to scale. At Athento, we base the core ECM in Nuxeo, precisely for scalability reasons.

In October of 2014, Nuxeo reached an important milestone in ECM scalability and performance, achieving One Billion Documents with Elasticsearch and PostgreSQL as the database.

Alfresco, another key player in Enterprise Content Managament and perhaps one of the leaders according to Gartner published a similar milestone achievement, but they had to wait one full year later (October 2015).

Nuxeo has kept making progress ever since, innovating in providing NOSQL support for the MongoDB Database. But recently we found an article named "Goodbye MongoDB hello PostgreSQL" about transitioning back from Mongo to Postgres once again.

There are some hints in Nuxeo documentation about recommended tuning of Elasticsearch node, in which it includes having half of the total memory (RAM) size for the Elasticsearch Heap Size (in the example, 6g in a machine with a total of 12g). But this post is mainly about setting Page Provider to query over Elasticsearch instead of a less efficient SQL Database. In our experience, this can improve slow queries of 5, 7 or 10 seconds to under 1 second, and typically in the range of 0 to 200 ms.

In this sense there is documentation: How to make a Page Provider or Content View Query Elasticsearch, but the Page Providers can be set up also from the Administrator Panel:

And scrolling all the way down you can select "Advanced setup" button here:

You'll be notified the advanced setup is for advanced users.

And inside the advanced setup area scroll down to the elasticsearch.override.pageproviders

You can see by default "default_search" appears, but you can add most providers, here is a complete list based on our experience:

elasticsearch.override.pageproviders=default_search,document_content,section_content,document_content,tree_children,default_document_suggestion,simple_search,advanced_search,nxql_search,DEFAULT_DOCUMENT_SUGGESTION, GET_TASKS_FOR_ACTORS, GET_TASKS_FOR_PROCESS,GET_TASKS_FOR_PROCESS_AND_ACTORS, GET_TASKS_FOR_PROCESS_AND_NODE, GET_TASKS_FOR_TARGET_DOCUMENT, GET_TASKS_FOR_TARGET_DOCUMENT_AND_ACTORS, GET_TASKS_FOR_TARGET_DOCUMENTS, GET_TASKS_FOR_TARGET_DOCUMENTS_AND_ACTORS, GET_TASKS_FOR_TARGET_DOCUMENTS_AND_ACTORS_OR_DELEGATED_ACTORS, SAVED_SEARCHES, user_sections, user_workspaces, user_documents, user_favorites, domain_published_documents, GET_TASKS_FOR_ACTORS_OR_DELEGATED_ACTORS, domain_documents

One of the few Page Providers NOT to use with Elasticsearch is the one used by the content view orderable_document_content.

The reason why is because this involves reindexing the position of documents, which has a high cost.

To summarize: enabling a separate physical or virtual machine with an Elasticsearch separate instance (it's not recommended to use the by-default embedded version that Nuxeo provides) will enhance performance a great deal, by adding the Page Providers discussed.


Thursday, February 11, 2016

Athento integrates its document capture software with Alfresco 5.x

Now you can export processed documents from Athento’s document capture module to Alfresco Enterprise and Community editions.

Thanks to Athento’s document capture module -Athento Smart Engine-, auto-classified documents can be sent to Alfresco with their metadata already extracted. 

San Jose, CA. February 10, 2016

Athento Smart Engine, the document capture software by Athento, is now integrated with Alfresco. The integration, available for versions 5 and up, will provide Alfresco  ECM software users with a data capturing application for carrying out automatic operations related to digital document handling, including dividing batches, automatically classifying documents, OCR and data extraction.

Integration is carried out using CMIS 1.0, which lets users export documents and data from Athento to the Alfresco platform.

According to the Athento CEO Jose Luis de la Rosa,

“The benefit of using CMIS as the vehicle for integration is that it allows integration to become easily configurable and can potentially be used for any version of Alfresco that supports CMIS”

Athento’s Documentation Center now provides information on how to integrate Athento and Alfresco. The integration between the two platforms means storage folders and routes in Alfresco can be dynamically defined, according to metadata values or document types.  It also allows data extracted from documents to be sent to Alfresco.

In addition to the CMIS integration, Athento Smart Engine has an API that provides access to its features as services. Athento SE is available as a cloud service and also for on-premise deployments.

About Athento:

Athento incorporates cutting-edge technology such as Machine Learning, Semantics and Image Processing to automate processes related to working with document capture, document management, storage and all those operations needed to cover the complete life cycle of documents. Athento currently works with more than 100 clients in Europe, Africa and the Americas. It also works with a wide-reaching network of authorized partners, and is the product that has been chosen by Barclaycard, Reed Elsevier, Leroy Merlin, Yellow Pages and the Spanish General Traffic Directorate to manage documents.


Tuesday, December 30, 2014

Tuesday, September 30, 2014

The driverless ECM

Humans have been convinced for years that the car industry was really advanced. We have been proud of "smart" cars that were able to read traffic sings, alert in short distances and tell you to buckle up.

That was until the presentation of driverless car in 2011 by Google, when we all saw the real automation in car industry. This is a really smart car, this is a real milestone for an industry, giving humans the possibility to move without manual tasks that lead to many errors and dangerous situations. Since 1886 when the Motorwagen was invented, it has taken the industry more than 130 years to reach this magic.

The ECM industry is evolving really fast, but we are still anchored in "smart" features that are really far from the real "driverless ECM”.

Current ECM solutions need you to drive them during all their life to go anywhere you want to go:
  • They need you to create, and sometimes code your document types
  • The same goes for metadata, types of metadata, schemas, layouts, etc.
  • They require you to do the whole hierarchy and structure of your documents organization
  • And all of this, you have to agree with your company’s team to create an effective organization of the brand new ECM solution :-)

Many of you surely know what I mean. The process to set up an ECM system in any organization is like driving the old Route 66 in a Chevrolet bel air. Nice but zero automation.

Many experts would mark the ECM market a mature one, but until we get the real automation, we can compare it to the mature industry of manually driven cars.

ECMs current industry lacks a few gadgets for the driverless challenge:

A high definition map of the document type, metadata and features related to the ECM world map.

An important step for this milestone will drive us through a deep process of “feature engineering” and to have the result of it as a “knowledge base” from which any of the existing ECMs would be able to take information and allow users to reuse and even propose them their document types, metadatas and overall configuration.

A good light radar to measure the distance (and other things) of documents, metadatas and features.

In ECM operations we are commonly asking “are these documents the same?”, “are they similar?”, “do they have the same value for some metadata?”, many of this questions are typical searches for any ECM user. The ECM industry, still lacks a standard in comparison algorithms, solutions and standards.

A very precise range finder laser to model the real world around a document.

This is what today we call “Document Analysis Systems”. There is a really incredible work done right here, but the only solution that seems really industrialized is OCR. For layout recognition, language detection, decoding and some other features, there are solutions, but none of them really reaches this milestone yet.

Some computation pieces to decide where to go and what to do with the documents.

ECM and BPM worlds have a permanent affair with some ups and downs. There are really complex workflows in the nature of ECM, not because of the workflows but the elements that interact, security issues, external and internal users, metadata updates, tasks and many side-effects that take when a document goes from one state to another.
The ECM industry really needs to create a “driverless experience” in this affair. Users deserve to count with existing workflows, ECMs that suggest a workflow to use, a route to take. Imagine your ECM telling “This is an invoice, do you want to send it to the accounting department?”

A remote farm of computers to do the complex tasks

Driverless car has been possible thanks to many of the now existing technologies and one of them is the ability to do distributed computing for complex tasks. If the driverless car is doing it now, the ECM world must do it now. Some of the “smart” features that we expect from a driverless ECM really require a hugh amount of computing, lots of image processing, lots of text processing, lots of machine learning that needs to be done outside. The ECM world, needs a standard to get rid of heavy tasks and focus on the driverless experience of the user.

Real automation is still very far from ECMs, how much time will it take us to see the driverless ECM?

Jose Luis de la Rosa
CEO at Athento 


Tuesday, August 26, 2014

Press Release: Athento Becomes an HP Silver Partner

Today we want to share some awesome news!!!

Miami, August 26th, 2014: Athento, the smart document management software, has become part of Hewlett-Packard’s prestigious Solutions Business Partner (SPB) program, as a Silver Partner for Europe and the Middle East (EMEA). (Source: Yerbabuena Software Inc.)

By means of its Solutions Business Partner Program, Hewlett Packard is looking to expand the value it offers its customers through its LaserJet multifunction turnkey solutions that are integrated with its technology. HP Silver partners are selected according to their business potential, technology, their contributions to HP’s product line, their presence in the market and their development capabilities. 

After having gone through the selection process of the program during the last quarter, Athento joins this select list of software providers that work with HP in EMEA. The program also includes other companies, such as OpenText.

Thanks to this partnership, customers of the LaserJet Flow (MFP) range can count on an embedded document capture client that enables them to enjoy the intelligent functionality of Athento’s Smart Engine. That means that these customers can carry out operations from their scanners, such as automatically classifying documents, extracting semantic tags from a document’s content or extracting metadata.

Access to these features is available directly from HP touch screen devices, and is linked to user accounts in Athento’s cloud service.

According to José Luis de la Rosa, 

"For us, becoming a Silver partner means great support for the work we have done in the field of document capture. From now on, customers of HP’s Flow range will be able to enjoy Athento smart technology with the touch of a button."


Monday, July 28, 2014

Athento Applies Intelligence to Correspondence Management in Public Entities

Athento, the smart document management software, helps provide public administrations with digital correspondence management that covers the entire life cycle of official correspondence

Thanks to the growing interest shown by government agencies of all levels, in various countries, in the digital management of official correspondence, the company that created Athento has issued a series of case studies and use cases to help such entities learn how they can solve their challenges in managing their mail.

Athento has helped several customers in Spain with receiving, sorting, opening, routing, controlling, and distributing incoming and outgoing mail. One of its clients in Spain is CEDER La Serena, a public agency dedicated to rural development. José Luis de la Rosa, the company’s CEO, says: 

"What these companies want is to improve the time it takes them to respond to the public and other organizations, and Athento can help them do that."

Most customers who are still managing paper-based correspondence report that the most serious problem they have with these processes is that paper correspondence is slow to reach the official who has to deal with it.  This also has another problem of its own: in many cases, public entities are required to meet strict response times. What’s more, routing all of this paper makes increases the potential risk that it could be lost.

What Athento is helping to make possible with its new version is complete management of the life cycle of this correspondence. Its functionality allows users to scan correspondence directly into the system and store the correspondence in a records management structure (under appropriate Classification Schemes and Business Retention Schedules). According to de la Rosa, Athento’s  contribution is that:

 "it not only offers traditional document management functionality, but it also enables the automation of the capture of correspondence and preserving information within the context of records management."

Among the pieces of information that Athento is making available to government agencies that need to improve correspondence management is a use case which explains the challenge in detail that also shows how Athento can help them. This case study can be downloaded for free from the product’s web site. 


Monday, June 2, 2014

Press Release: Athento helps the Andalusia Technology Park with the management of more than 36,500 documents

Andalusia Technology Park, one of the most prominent business parks in Spain, has placed its trust in the management of its documents in Athento, the smart document management software.

Popularly known as “Malaga Valley”, as a tip of the hat to the legendary California valley that is the home of the world’s most important tech companies, Andalusia Technology Park (known by its Spanish initials, PTA) is one of the important economic engines in the south of Spain. This space, which measures 222 hectares in area, is home to 600 prominent businesses such as Oracle, Anovo, ADIF, Vodafone, or China’s Huawei. 

Athento helps the PTA with the organization and control of documents in its key areas. Having better control over documents is fundamental for the Park because it gives Park staff a more accurate idea about the life of their projects and the businesses they provide service to. Users of the system also point out that the system is easy to use when it comes time to search for and recover information. For the Park, working with Athento means having more than 36,500 documents that are accessible, safe and well-organized.

For its part, for Yerbabuena Software, the company that created Athento, “Having notable clients such as PTA, and the important work they do for the region, is a source of pride for us,” says Jose Luis de la Rosa, the CEO of Yerbabuena Software. The technology firm, with offices in Spain and the United States, has published a report containing all the details of the PTA success story. The report can be accessed from the Athento web page.

About Athento:
Athento incorporates cutting-edge technology like machine learning, semantics and image processing to automate processes that are related to document capture, management, preservation, and all of the operations necessary to cover the entire life cycle related to documents. Athento has over 100 customers in Europe, Africa and the Americas. It also has a wide network of authorized partners, and is the product of choice for institutions such as Grupo Día, Leroy Merlin or the General Traffic Directorate of Spain to manage their documents.