• Ingen resultater fundet

How to transform from multiple-inheritance to simple-inheritance

II. Build a parser for the source pages

21.2.2.3 How to transform from multiple-inheritance to simple-inheritance

Duplicating the boundary entities

This is a very simple way of transforming multiple into simple inheritance, duplicating the entities that inherit from more than one entity.

For example, in the case of the butter multiple inheritance classification showed in picture 24, the result tree classification will look like the one showed in picture 25

Remaking the schema by swapping the troubled criteria to attributes instead of a classification criteria

For example, in the case of the butter, a possible solution is to eliminate the fat group and put an attribute (fat percentage, for example) in all the ingredients (also an attribute stating if an ingredient is a dairy product or not would be possible, all depends in the system purpose as explained above)

The simple-inheritance classification will look like picture 26:

6 0 !

0 / /

$ 00

Other examples of how to transform multiple-inheritance into simple-inheritance are shown in

[Appendix-8]

( ( ( (

A decision has to be made in order to make the final classification: which criterion/criteria is the most suitable one/s for the project purpose?

It has to be clear what the purpose of the project is: extract (will guide the IE process) and structure information (will state the database where the extracted information is going be structured in) from recipes web pages. As long as the ontology is the main structure of these tasks, it has to fit the IE purposes.

Dairy product

Ingredient

Butter Fat_percentage

Dairy product

Fat

Butter

Ingredient

Butter Dairy product

Fat

Butter

Ingredient

As explained all along this chapter, there are several possible classifications to describe the ingredients context. All of them are valid, but they are useful for different purposes. So, the classification has to select focusing on the system requirement specifications. [Chapter 13]. The important thing this classification has to focus on is:

The kind of information the system will manage The database constrains

The information extraction constrains.

The final classification is not presented here. Several classifications were selected and afterwards discarded as the database and IE tools characteristics were found out. I would like to firstly define these characteristics and present afterwards the final model. In order to make this project more understandable and easy to read.

.

This chapter will explain the choice of the technologies that fulfill the Information Extraction task basing on the Ontologies approach.

The analysis and the requirements specification phase have already stated the scope of the project, the domain, the objectives and functionality. So it is time now to design the system; to make a planning to accomplish this task.

As briefly introduced in chapter [12.1.2] there are several utilities of the Ontologies in the semantic web context. This chapter will study each one highlighting the kind of system it is going to be implemented.

The web resources integration [12.1.2] is not a feasible task within the recipes domain. It can be done within restricted domains that state some rules or standards about their contents. For example: in the World Heritage domain, if different pages about the same site are found, they can (and should) be integrated in only one, using the Ontology if a middleware approach is followed, or consolidating the database looking for duplicates if the data warehousing

approach is implemented. However in the recipes context the duplicates can not be treated as exactly the same thing. If more than one recipe is found with the same title, they can not be fused in only one, because the probably have different ingredients, different way of doing, cooking times, etc as different people have their particular way of doing the same recipe. The only possible duplicate management in the recipes context is to provide the user a list of the recipes with the same title, or the same kind of ingredients, or the same cooking time, etc, and he/she will decide which one fits his/hers better.

Restructure current sites: [12.1.2]. This feature has been implemented in this project in

somehow. Once the relevant information has been extracted and structure, the user can make several queries, so different views of the same page can be presented as the result. It can be said that this utility is a consequence of the main purpose of this project: query the web.

This master thesis project focuses on the Web Querying use of the Ontologies. It aims is to extract information from several webs.

Now that the main task has been stated as extracting and structuring relevant information about the recipes context, what should be thought next is how to accomplish this mission.

There are two different ways stated in the next chapter.

- - -

- # < * # < * # < * # < *

There are two ways of designing an IE system. The first one refers to a middleware: the information is extracted from the Web each time the user request it. The second approach is to create a data-warehouse: the information extraction is made only once; all these information stored in some structure way for its subsequent use (each time the user makes a request, the system fetch the desired information from the storage device)

This project follows the second approach: the data-warehouse: the Ontology will be used to guide the information extraction from some web pages, structure it and populate a database.

The two first approaches have been taken into account. Finally the warehousing approach has been chosen because of some reasons:

The middleware approach looses a lot of computational time each time a request is made. The entire web (or a certain domain if the searching is restricted) has to be parsed looking for the desired information.

In the warehousing approach the web pages are only parsed once when creating the database.

Once they are stored into the database the information extraction is faster and it can be made by normal queries. But this approach also presents some inconveniences; some flexibility is lost, and some problems about inconsistency may appear: if a web source changes the

database may still have obsolete information. Also if a web page disappears or new interesting ones appear the database will be behind the times. This should be taken into account and the web should be tracked every so often following some defined criteria.

Besides this inconvenience, the data warehouse approach is the most common method to implement the information extraction from the web (as I could see in all the articles consulted)

( "&-00

Now the way of acting is clear: extract information, structure it, storage and retrieve it. The next step is to design how to accomplish these tasks using an Ontology. Can it help guiding this process? Next chapter will explain how to fulfill this task:

Information Extraction

--- --- -4

--- ---

-Unstructured web

pages Structured

information

- - -

- = = = = 3 # 3 # 3 # 3 # 4 4 4 4

At the present time, there are two ways of storing knowledge: the classical approach of a relational database or the one is to use a semi-structured database.

A classic relational database is always composed by the same structure: A set of tables, each table is a set of records, each record is a set of fields and each field consists of a pair:

name/value.

Every record of the same table has the same number and type of fields. The relational databases have a fixed structure given by the ER-diagram. There resides its main inconvenience, in their rigid structure.

In a relational database approach the data integration among different sources might be a difficult task, because of its rigid structure.

The Semistructured databases are the newest generation of databases.

In the Semistructured databases the structure is more flexible as they have more freedom than the relational ones, because their only partial structure specification. A relational database might be incomplete; some records can have some missing or incomplete fields, which is not possible in a relational database.

Another nice feature of the semistructured databases that the relational ones do not have is that they are web-oriented. Their main purpose is to ease the exchange of information through world area networks, especially through the Internet. Because of this, the data are stored together with the structure to easily exchange it trough the net. In this way, the data embeds its meaning.

With Semistructured data it is easy to integrate different documents, since the data are coupled together with information as regard their meaning,

The weak point of the semistructured data approach is its immaturity. The standardization is still in the definition phase, there are some standards but there is still a lot of work to do in this field. Thus the available tools for this incoming approach do not always follow the same specifications. Another problem is that, since all the available tools are quite new, the

performance parameters (like time response) are worse than the ones obtained with relational database tools.

The Semistructured databases have several nice features, like their flexibility, or integration facilities, but the most important one for this project purpose is the web orientation. Storing the information in a semistructured database, it can be easily exchanged all over the Web. The data and its meaning can be traveling around helping to constitute to the next Web generation.

XML language is the one chosen to represent the information of the database. This tagged and flexible language has several nice properties that make it more useful in some applications.