Monday, October 3, 2011

Use example of Autotagging in Athento IDM: From 8 minutes to 8 seconds extracting important information from a Resume

It is not the first time we told you something about our labeling module. For those who, however, did not know what this is about the labeling, as we name it shortly, is one of the features included in our solution for enterprise content management, Athento IDM, an intelligent document management solution. Basically, this tool works together with OCR to extract the keywords of a document that can be used as labels (tags) and to help us find documents from a tag cloud. It is a way to create quick access to documents that share a certain theme.

Is that all? Yes, simple right? Simple, but extremely useful, we will explain with an example, which is the best way to understand how we can take advantage of something. Putting it into USE!

Use the case of a temporary agency (employment agency) such as Manpower, or any company (virtual or physical) or HR department that is dedicated to providing companies with qualified personnel to fill vacancies.

Such companies often receive curriculums (or resumes to our American readers) in paper or digital files. Some, especially those that are web based, make the candidate fill in the data that the application will need to relate to applicants with vacancies through their qualifications. However, they still leave the possibility for the candidate to attach your own resume as a file because they know much more information would be on the resumes than what can be collected through the inflexibility of web forms.

Either way, getting important information, even if it's external users who carry out the process, it remains a manual, tedious and long process.

For example, fill the first form in the famous (in Europe) InfoJobs portal takes an average user accustomed to the web an average of 2 minutes (the form only collects information about your account to be created) and the user still has at least 3 major sections to fill (Studies, Experiences and Future Use). At the very least the total process will take 8 minutes a user.

Americans (who know a lot about web usability and many other topics) know that the time is long enough to lose many users. LinkedIn is a wonderful example of how we can help reduce the time a user takes to complete its resume. LinkedIn offers users the ability to upload a resume in PDF, Microsoft Word or other formats to complete their profiles. The application extracts data from the resume and adds it to the content of the user profile. We will not get to study the effectiveness of this particular tool, let's just say that in most cases it provides help to complete a resume.

In the case of the employment agency and Human Resources departments it is even more common that the process of extracting information from resumes in paper or digital format needs to be made by an employee.

If for example 50 resumes received daily by any route (via e-mail attachments, paper, included in a created profile, etc) and assuming that extracting important information from a resume take an employee the same as it would take a user in a job portal, we are talking about a little over 6.5 hours daily consumed in the process.

And when we want to find someone to fill a position? Companies with the digitized data of the candidates have it a little easier, your applications should offer a way to query the database and cross position requirements with user skills. However, we would have the problem that in many cases the most comprehensive information is found in the resume of the users attached as files to a profile. In companies where the curriculum is still handled on paper, someone will have to review these documents one by one to find out if they meet a certain requirement or not.

So as to show two problems: obtaining information from resumes is still a manual job that takes up too much time and quick and accurate access to those candidates who possess some knowledge or skill is not an efficient process (and sometimes not even useful). Let us study now how someone with Athento iDM could dramatically improve both processes using OCR and autotagging modules. Let's see it step by step.

1. Obtaining and indexing the entire content of the resume
Through its OCR engine (Tesseract) Athento extracts data within files that are images (TIFF, PNG, PDF, DOC, XLS, GIF, JPEG). Extracting data from other text documents ( .doc, .odt, etc.), not being images, has no problems either and is fast. This process is almost immediate (takes a few seconds per document) and the best, is transparent to the user all you have to do is upload a file to the repository (either a resume scanning, emailing, add it through WebDav -drag and drop-, etc..). From 8 minutes to get all the data of a resume to no more than 5 seconds. The OCR used in Athento has an average success rate in data extraction of 96%.

2. Generating Labels
Athento iDM uses its Autottaging module to search inside the indexed content the most relevant words. These words will become labels that will gather all documents containing it. For example, in a programmer's resume the word JAVA is relevant. It is important to note that a document contains many words such as articles, prepositions, etc.. These words have no relevance. If we group together in the same category, label or tag all the documents that contain for example the word "by", the group will surely contain all the documents in the repository, so we do not do any good here... Thus we see something we call "Document Intelligence" since Athento can reason about what terms are relevant or not within the content.

3.Searching Content by tags
Following the example of the word JAVA in a resume of a programmer, clicking inside this tag in our tag cloud would get all CVs of developers that have included the programming language in their skills and knowledge. Surely, we would also have also within our tag cloud the tag "programmer" that would give us access to all programmers who have a curriculum with just one click. The search for candidates with particular knowledge would be reduced later to a simple click by the user on a label. As an added bonus, Athento would offer a link to Wikipedia for each tag in the system, in case we want to know what every label means.

With this example we've seen how Athento iDM reduces from 8 minutes to 5 seconds the information extraction contained in a resume and turns the manual search process into an automatic process or search using a form (which is commonly offered by applications) to that of one click. We hope that the example has proved illuminating.

LikeUs Yerbabuena Software on LinkedIn

No comments:

Post a Comment