Information Extraction - A Query System for UNESCO’sWorld Heritage at the WWW

PART V. APPENDICES

5.3 Information Extraction

Basically, Information Extraction is a method of filtering information from large volumes of text. Gathering information involves retrieval of documents from relevant sources and the tagging of certain terms in the text. The output that one would want from an IE tool is a structured representation, such as a database of selected types of information from the texts.

In this section some of the basic techniques and tools of Information Extraction will be presented. It is not intended as a fixed or exhaustive classification but just as a basic easy-to-understand one. For more information about this wide field refer to specific literature. The same can be applied to the tools classification that follows the techniques classification.

5.3.1 Why the interest in IE?

Usually, Web data is retrieved by browsing and keyword searching, which are intuitive forms of accessing data on the Internet. However, these search strategies present some limitations.

Browsing, for example, is not very suitable for finding particular items of data because following the links is tedious and can easily lead to get lost.

Keyword searching is usually more efficient than browsing. On the other hand, it often returns vast amounts of data that the user can hardly handle, so a more structured output is needed when searching for information.

SYSTEM IE

SYSTEM

I need some information

TEMPLATES ORIGINAL DOCUMENT SET

REDUCED DOCUMENT SET QUERY

TEMPLATE APPLICATIONS

Information Retrieval & Information Extraction Survey

During the last decade Information Extraction has become increasingly more interesting, mainly due to the explosive growth of the Web. Vast amount of information is available on the Web today but most of it only exists in natural language form as for example in newspaper articles and such. Extraction of information from such natural language sources into traditional database form could provide easier access to information that is already present on the Web.

5.3.2 IE Techniques

Information Extraction relies on several different techniques which will be mentioned in the following.

It is important to note that one IE tool would normally be tailored to perform well in one specific domain. A domain in this sense could be for instance a collection of newspaper articles, police reports, medical reports, a branch of scientific journals or fiscal reports. In each of these domains information is structured in different ways, so the same IE tool would perform differently in these domains. An IE tool needs of course to be configured to the type of domain it is supposed to operate in.

Below is a brief description of the main techniques:

Pattern matching

In many Information Extraction systems most of the text analysis is performed by matching the text against regular expressions. If a text segment matches one of these regular expressions the particular text segment is given a label, this might be a “name”,

“time”, “place” or similar label.

Syntactic structure

The identification of the complete syntactic structure of a sentence is a difficult task, but the identification of some syntactic structure can simplify the information extraction phase itself. Often the arguments to be extracted are noun phrases and corresponding relations, so it is very important to be able to identify noun groups in the text. Also verb groups should be identified as these contain information on tense (past, present or future).

Name recognition (NE)

Names appear very often in natural language texts and identifying and classifying them as person names or place names are important as argument values for many extraction tasks.

Names can be identified in several ways; either by using large dictionaries or by using a set of patterns common for names. Such patterns could be capitalization of the first letter in a name, a preceding title such as “Mr.”, and often companies can be recognized by their final token such as “Inc.” or similar. It is also important to be able to identify aliases, i.e. to identify IBM as Industrial Business Machines.

Name recognition systems that work at nearly the same level as manual name recognition are available today.

Information Retrieval & Information Extraction Survey

Applying an ontology

Before start explaining this technique an important concept has to be defined for a better understanding. A minimalist introduction to ontologies is given below. More details about ontologies will be given in further sections (see chapter 4).

What is an ontology?

The word "ontology" comes from the world of philosophy. It has also been used for a long time within the artificial intelligence and knowledge representation community.

In the context of knowledge sharing a short answer for this question could be that an ontology is a specification of a conceptualization. That is, an ontology is a description of the concepts and relationships that can exist for an agent or a community of agents.

This definition is certainly a different sense of the word than its use in philosophy.

Apart from other uses, ontologies help to guide and detail what kind of knowledge to harvest from unstructured text on the Web. They use concepts and relations for classifying domain knowledge. Those are basic elements in an ontology.

Ontology-based technique

To use this technique an ontology has to be previously constructed to describe the data of interest, including relationships, lexical appearance and context keywords. Tools that use this technique parse the ontology to automatically produce a database by recognizing and extracting data from web pages given as input. Before applying the ontology it is necessary to automatically extract chunks of text containing data “items”

of interest.

NLP-based Tools

A couple of existing Natural Language Processing tools for Information Extraction will be mentioned here.

Robust Automated Production of Information Extraction Rules (RAPIER) is such a tool that extracts information from free text. The input for the tool is the document from which to extract information and a filled template that tells the tool which data to extract. From this template RAPIER “learns” data extraction patterns to be followed during information extraction. This is a “single-slot” tool as it generates one record per document.

WHISK is another tool for extraction of information from natural language text.

WHISK starts out with an empty set of extraction rules and the user induces the relevant extraction rules through a series of training documents, where the user tags all information to be extracted. From this tagging WHISK creates a set of extraction rules to be used. This is a “multi-slot” tool as it creates several records from one document.

Information Retrieval & Information Extraction Survey

5.3.3 Evaluation metrics for Information Extraction

A lot of the research about IE in the last decade has been connected with the MUC (Message Understanding Conferences) competitions. These competitions were sponsored by the Defense Advanced Research Projects Agency (DARPA) from 1991 to 1998. They consisted in competitions where the participants compared their results with each other and against human annotators’ key templates. By doing this, a lot of IE systems and methods for formal evaluation of IE systems were developed (some of them are still in use by the US government).

So it is not surprising that the MUC evaluation metrics of precision and recall are still tend to be used with slight variations. These metrics have a very long tradition in the field of IR [22].

“Precision measures the number of correctly identified items as a percentage of the number of items identified [22]”. In other words, it measures how many of the items that the system could identify were actually correct, regardless of whether it also failed to retrieve correct items. The higher the precision, the better the system is at ensuring that what is identified is correct.

There is another metric called Error rate that is the inverse of precision, and measures the number of incorrectly identified items as a percentage of the items identified.

Sometimes it is used as an alternative to precision.

“Recall measures the number of correctly identified items as a percentage of the total number of correct items [22]”. In other words, it measures how many of the items that should have been identified were actually identified, regardless of the number of false identifications made. The higher the recall, the better the system is at not missing correct items.

Obviously, there must be a balance between these two rates because a system can easily be made to achieve 100% precision by identifying nothing (and so making no mistakes in what it identifies), or 100% recall by identifying everything (and so not missing anything).

Annotations Survey

Chapter 6: Annotations Survey

6 Annotations Survey

Please note that the annotations survey presented in this chapter will be carried out having GATE software tool as a basis. This survey is part of the research phase done along the thesis and has been included in this document to help the reader to acquire a better comprehension of the tools being involved in the system development.

In document A Query System for UNESCO’sWorld Heritage at the WWW (Sider 61-65)