• Ingen resultater fundet

Composition properties: Transitivity

Annotations Suggestions

6.3.3.2 Composition properties: Transitivity

What makes so important the composition relationships is the transitivity relationship it sometimes has. Additional knowledge can be inferred from a transitive relationship:

Definition: “If A has a relationship with B and B has a relationship with C then we can infer that A has that relationship with C.”

Composition relationship is sometimes transitive. But we have to check carefully if both premises are based in the same kind of composition relationship.

If they have the same kind of composition relationship they often produce a valid composition-related conclusion (but sometimes it may be wrong)

Two different kinds of composition relationships in the premises often conclude a wrong sentence, but it can be sometimes right.

Example1:

An ingredient is partly vitamins (Material-object composition) An ingredient is part of a course (Component-integral object)

Conclusion: A vitamin is part of a course. This is wrong conclusion, because although some vitamins are part of the final dish, some others are lost during the cooking.

Example2:

The loaf is partly flour (material-object)

A slice of bread is part of a loaf of bread (portion-object)

Conclusion: A slice of bread is partly flour. This conclusion is correct although it is different relationships

The transitivity property is very useful to propagate operations in composition relationships.

If some ingredients are part of the ingredient description, And that description is part of a recipe,

It can be concluding by the transitivity property that ingredients are part of the recipe.

When the composition relationship is the same the propagation can be made to as many levels as we want. But if they are not the same kind, each level has to be studied carefully for

validity.

When modelling the domain, the first step is to identify the different kinds of relationships related above. It will help to find out the utility of these different relationships (to identify whole-part associations and help us while developing the system)

The inheritance relationship is modeled in the diagrams as the IS-A relationship. This is normally the main bone of a diagram. Then the other kind of relationships can be added in order to complete this taxonomy. This approach, presented in [Andreasen et al., Nilsson 2001] is to divide the domain having the ISA structure as the backbone of the classification, and

afterwards divide the generated groups basing on other kind of relationships, such as partitive, causative, locative, temporal, etc…

The parent elements have the common basic structure and characteristics that their children will inherit. The children elements will have their parents’ characteristics plus their own ones (they specialize the parents’ elements)

In this process the most general top category is analyzed and progressively specialized in more concrete concepts, ending up in the leaves of the ontology, which will be the instances of the lowest entity in the hierarchy they inherit from.

Example:

T == Ingredient animal_origin sea_animal_origin fish flesh salmon The sequence above represents a path down through the ontological ingredient structure, where the entity ingredient represents the top element in the hierarchy and salmon is an instance of the last entity flesh

But the ISA relationship has to be used carefully in order to make a consistent diagram, many a very common mistake while developing these kinds of diagrams is to inherit from a class with different characteristics like ours.

Summing up: the most important kinds of decomposition are the inheritance and the composition ones, and are the ones in which the Ontology model is based.

Course classification course

beverage cocktail infusion juice milk_shake

dessert

non_sweet_dessert sweet_dessert fish_course

fatty_fish_course lean_fish_course seafood_course meat_course

pasta_course

pasta_with_red_sauce pasta_with_white_sauce

pasta_with_cream

pasta_with_non_cream_white_sauce rice_course

soup

vegetarian_course Ingredients classification food

baking_supply leaven yeast cereal_grain

cereal

wheat grain

cacao coffee condiment

oil sauce

tomato sauce vinegar

dairy_product butter cheese cream milk

yogurt

dairy_product_substitute butter_substitute cheese_substitute cream_substitute milk_substitute yogurt_substitute egg

fat_oil butter margarine oil

fish

caviar_roe crab fatty fish lean fish shellfish

smoked_dry_fish flavoring

sweetener cacao chocolate

milk_chocolate dark_chocolate honey

jam sugar syrup salt

spice herb

tea fruit

fresh_fruit nut legume meat

cured_precooked_meat mammal_meat

poultry_meat reptile_meat pasta_bread

bread pasta stimulant

cacao coffee tea vegetable

herb tea sea_vegetable common_vegetable

soy tomato drink

beer hard_drink infusion

tea coffee juice

fruti_juice vegetable_juice milk

soda water wine

! ! ! !

The following example shows the ingredient taxonomy divided with the IS-A relationship kind of origin (animal or vegetal). Some knowledge about the place where the animal lives has been added as well, in order to improve the classification. The resulting taxonomy is shown in the next picture:

This classification has a poly-hierarchical structure. The leaf entities inherit from a kind of animal entity and from a place of living entity.

Some reorganization can be done to avoid some of the multiple inheritance:

But the classification still has a poly-hierarchical structure due to the entity Mammal.

Mammals normally live in the earth, but the instance whale lives in the sea (this is called a boundary case). The objective is to create a classification where the instances inherit from only one entity.

Animal Vegetal

Air Sea

Bird

Earth

Poultry Reptile Mammal Fish Seafood Ingredient

ingredient

Animal Vegetal

Kind of animal Place of living

Poultry Reptile Mammal Fish Seafood

Bird Air Earth Sea

Chicken Cow Whale

At this point two solutions are possible: to duplicate the entity whale in both earth and sea entities, or to transform one of the classification criteria (place of living or origin) into an attribute of the other.

When this classification was reconsidered, place of living was found not an important feature to bear in mind when studying a recipe content.

For example: also for the boundary case whale, it does not matter where it comes from.

Although it lives in the sea, its taste resembles more to meat than to fish (which is the culinary point of view). (Moreover, the place of leaving of a vegetal ingredient has no sense)

Due to the place of living is not a basic feature to bear in mind, it can be removed as a classification term and put as an attribute. The classification will now look like in the following picture:

The next diagram shows an example of how to transform a multiple-hierarchical classification made in the sea-animal context.

Imagine a first stage, when the classification is made following three different criteria. The next picture shows how the taxonomy would look like:

Ingredient

Animal Vegetal

Poultry Reptile Mammal Fish Seafood Bird

Place of living

If the multiple-inheritance should be removed, the first step is to analyze the subdivision relationships and find-out the less important one. In this case the kind-of-water

relationship is considered no crucial. In this first stage the modelling as an attribute has been chose (like in the first example), as shown in the next picture:

Finally, the remaining classifications are compared to find the least important one. As explained in the project, the part-of relationship varies depending of the kind of animal it is applied to. This time instead of swapping this characteristics to an attribute, it has been decided to put it as a sub-classification of the IS-A classification.

Fatty fish Sea-animal

Shellfish Fish

Kind-of-animal Part

IS-A IS-A

Lean fish Mammal

IS-A

Mollusk Crustacean

Flesh Roe

Salmon flesh Salmon roe Shark flesh Shrimp Whale flesh Lobster roe Clam Octopus

Kind_of_water:

Fresh|Salty

Kind_of_water:

=Salty Kind_of_water:

=Salty

Fatty fish Sea-animal

Shellfish Fish

Kind-of-animal Part

IS-A IS-A

Lean fish Kind-of-water

Mammal IS-A

Mollusk Crustacean

Flesh Roe

Salmon flesh Salmon roe

Sweet Fresh

Shark flesh Shrimp Whale flesh Lobster roe Clam Octopus

Sea-animal

Shellfish IS-A

IS-A

Mammal IS-A

Mollusk Crustacean

Kind_of_water:

Fresh|Salty

Kind_of_water:

=Salty

Kind_of_water:

=Salty Whale

Shrimp Lobster roe Clam

Octopus

Salmon

flesh Salmon roe Shark

flesh

Flesh Roe

Fatty fish Fish

Lean fish

Flesh Roe Flesh Roe

"

"

"

" # # # # $ % # $ % # $ % # $ % #

& ' (

& ' ( & ' (

& ' (

Retrieval performance is measured in terms of precision and recall.

First of all here are some definitions about Precision and Recall:

Precision: “It is the measure of the purity of retrieval”

Recall: “It is the measure of the completeness of retrieval”

It has been proved that precision and recall are inversely related: Precision seems to turn down as Recall augments. Although all the researchers would prefer high precision and recall at the same time, its relationship nature is a handicap for this objective. This relationship is a tangent parabola

It is also proved that if the information retrieval is made in more than one step it is possible to improve both parameters at the same time.

Some studies have been made in this field, ones considering precision and recall as continuous functions (Heine-1973, Robertson-1975, Gordon and Kochen-1989) and others considering them as a two-Poisson discrete model (Bookstein-1974)

The precision and recall definitions are based on two traditional assumptions (although some authors question their veracity)

Every retrievable item in a text of study is “relevant” or “not relevant”

Information retrieval is an extensive process; the retrieving can be augmented in order to retrieve more items, and hence increasing the recall.

Basing on the first assumption, all the retrievable items are classified following the table below (they belong to one and only one cell)

Retrieval matrix Relevant Not relevant TOTAL

Retrieved N(retrieved∩relevant) N(retrieved∩~relevant) Nretrieved Not Retrieved N(~retrieved∩relevant) N(~retrieved∩~relevant) N~retrieved

TOTAL Nrelevant N~relevant Ntotal

& + 5

This matrix states two possible classifications for an item:

If it is retrieved or not retrieved and if it is relevant or not relevant.

Definition: Generality: it is the percentage of texts that are relevant among the whole texts collection to be retrieved information from.

After being familiar with these notions it is easier to understand the following definitions, which are more complete than the first ones:

Recall: “Is the number of retrieved relevant items (N (retrieved∩relevant)) among the all relevant items in the texts (Nrelevant)”.

So it is calculated by the formula:

Recall = N (retrievedrelevant) / Nrelevant

6 )

It is a measure of the effectiveness in retrieving relevant information from the texts.

The number of relevant items in a given set is a fixed number, so it is clear that the higher recall the bigger the relevant retrieved set.

It can happen that the retrieved information set augments (more information is retrieved) but this information is wrong or non-relevant information, so the recall remains the same. But if the entire document is retrieved then a 100% recall is always achieved.

But this is clearly a non-sense, as long as the user is only interested in some relevant information, not the entire text. So a 100% recall is not necessarily good, there is a need to measure the Precision of the retrieved information.

Precision: “Is the number of retrieved relevant items (N (retrieved∩relevant)) among the whole retrieved items (Nretrieved) that exist in the document”

So it is calculated by the formula:

Precision = N (retrievedrelevant) / Nretrieved

7 )

It is a measure of the purity in retrieving relevant information from the texts.

It measures the efficiency in extracting relevant information, nor to retrieve irrelevant one.

Like the Recall, a high rate of Precision is desired, but a rate of 100% Precision, does not mean that the information retrieval is necessarily good. Because it can be achieved retrieving just a very few items from the text but all of them relevant, and a lot of useful information can be being lost.

So the objective is to combine high rates of Precision and Recall at the same time. The ideal situation would be to obtain a 100% rate of both simultaneously.

Precision vs. Recall

The empirical cases show the inverse relationship between Precision and Recall. If one of them improves the other one tends to decrease. Here it is the big challenge all the information extraction experts deal with: evade this behavior and obtain good rates of precision and recall.

' ) ' ) ' )

' ) * * * * + * $ , + * $ , + * $ , + * $ ,

*

-* - *

-* -

The input corpus is a set of texts annotated by the user. Each relevant element is surrounded by SGML (XML in fact) tags. For example:

<quantity> 5</quantity> <measure>kg</measure> <ingredient>rice</ingredient>

The algorithm introduces all the instances annotated by the user in what is called the positive pool (contains the positive or relevant examples). There is also a negative pool, which contains all the negative examples (the rest of the words in the text)

The algorithm covers all the training examples sequentially; when a new induced rule covers some positive examples, these are removed from the positive examples pool. The induction finishes when the positive pool is empty.

The algorithm makes use of different techniques: Lemmatization [see Glossary], Upper/lower case letters [see Glossary], POS tags [see Glossary], and the gazetteers (synonyms and

acronyms)

The training is carried out two steps:

1. The first set of induced rules make no use of linguistic information

1.1. First of all it induces tagging rules. These rules are the ones that will tag information (in order to annotate and then extract it) from new untagged texts.

The tagging rules are of two kinds:

1.1.1. Initial tagging rules: Are the rules that will annotate future texts 1.1.2. Contextual rules: Complete and improve the initial tagging rules

1.2. Afterwards correction rules are induced. These rules remove or correct mistakes and imprecitions that the previous rules could make while annotating.

2. The second set of induced rules make use of linguistical and aditional information

10.1.1.1 Tagging rules

The algorithm iterates in the following way: it selects a positive example from the positive pool. Then it builds an initial tagging rule, then it generalizes the rule and finally keeps the k best generalizations (k is set by the user).

All the elements in the positive pool covered by this new rule, are removed from there.

An intial taggin rule have a pattern of conditions in the left side and the right hand is an action to insert XML tags in the texts.