Annotations Suggestions
6.3.3.2 Composition properties: Transitivity
What makes so important the composition relationships is the transitivity relationship it sometimes has. Additional knowledge can be inferred from a transitive relationship:
Definition: “If A has a relationship with B and B has a relationship with C then we can infer that A has that relationship with C.”
Composition relationship is sometimes transitive. But we have to check carefully if both premises are based in the same kind of composition relationship.
If they have the same kind of composition relationship they often produce a valid composition-related conclusion (but sometimes it may be wrong)
Two different kinds of composition relationships in the premises often conclude a wrong sentence, but it can be sometimes right.
Example1:
An ingredient is partly vitamins (Material-object composition) An ingredient is part of a course (Component-integral object)
Conclusion: A vitamin is part of a course. This is wrong conclusion, because although some vitamins are part of the final dish, some others are lost during the cooking.
Example2:
The loaf is partly flour (material-object)
A slice of bread is part of a loaf of bread (portion-object)
Conclusion: A slice of bread is partly flour. This conclusion is correct although it is different relationships
The transitivity property is very useful to propagate operations in composition relationships.
If some ingredients are part of the ingredient description, And that description is part of a recipe,
It can be concluding by the transitivity property that ingredients are part of the recipe.
When the composition relationship is the same the propagation can be made to as many levels as we want. But if they are not the same kind, each level has to be studied carefully for
validity.
When modelling the domain, the first step is to identify the different kinds of relationships related above. It will help to find out the utility of these different relationships (to identify whole-part associations and help us while developing the system)
The inheritance relationship is modeled in the diagrams as the IS-A relationship. This is normally the main bone of a diagram. Then the other kind of relationships can be added in order to complete this taxonomy. This approach, presented in [Andreasen et al., Nilsson 2001] is to divide the domain having the ISA structure as the backbone of the classification, and
afterwards divide the generated groups basing on other kind of relationships, such as partitive, causative, locative, temporal, etc…
The parent elements have the common basic structure and characteristics that their children will inherit. The children elements will have their parents’ characteristics plus their own ones (they specialize the parents’ elements)
In this process the most general top category is analyzed and progressively specialized in more concrete concepts, ending up in the leaves of the ontology, which will be the instances of the lowest entity in the hierarchy they inherit from.
Example:
T == Ingredient animal_origin sea_animal_origin fish flesh salmon The sequence above represents a path down through the ontological ingredient structure, where the entity ingredient represents the top element in the hierarchy and salmon is an instance of the last entity flesh
But the ISA relationship has to be used carefully in order to make a consistent diagram, many a very common mistake while developing these kinds of diagrams is to inherit from a class with different characteristics like ours.
Summing up: the most important kinds of decomposition are the inheritance and the composition ones, and are the ones in which the Ontology model is based.
Course classification course
beverage cocktail infusion juice milk_shake
dessert
non_sweet_dessert sweet_dessert fish_course
fatty_fish_course lean_fish_course seafood_course meat_course
pasta_course
pasta_with_red_sauce pasta_with_white_sauce
pasta_with_cream
pasta_with_non_cream_white_sauce rice_course
soup
vegetarian_course Ingredients classification food
baking_supply leaven yeast cereal_grain
cereal
wheat grain
cacao coffee condiment
oil sauce
tomato sauce vinegar
dairy_product butter cheese cream milk
yogurt
dairy_product_substitute butter_substitute cheese_substitute cream_substitute milk_substitute yogurt_substitute egg
fat_oil butter margarine oil
fish
caviar_roe crab fatty fish lean fish shellfish
smoked_dry_fish flavoring
sweetener cacao chocolate
milk_chocolate dark_chocolate honey
jam sugar syrup salt
spice herb
tea fruit
fresh_fruit nut legume meat
cured_precooked_meat mammal_meat
poultry_meat reptile_meat pasta_bread
bread pasta stimulant
cacao coffee tea vegetable
herb tea sea_vegetable common_vegetable
soy tomato drink
beer hard_drink infusion
tea coffee juice
fruti_juice vegetable_juice milk
soda water wine
! ! ! !
The following example shows the ingredient taxonomy divided with the IS-A relationship kind of origin (animal or vegetal). Some knowledge about the place where the animal lives has been added as well, in order to improve the classification. The resulting taxonomy is shown in the next picture:
This classification has a poly-hierarchical structure. The leaf entities inherit from a kind of animal entity and from a place of living entity.
Some reorganization can be done to avoid some of the multiple inheritance:
But the classification still has a poly-hierarchical structure due to the entity Mammal.
Mammals normally live in the earth, but the instance whale lives in the sea (this is called a boundary case). The objective is to create a classification where the instances inherit from only one entity.
Animal Vegetal
Air Sea
Bird
Earth
Poultry Reptile Mammal Fish Seafood Ingredient
ingredient
Animal Vegetal
Kind of animal Place of living
Poultry Reptile Mammal Fish Seafood
Bird Air Earth Sea
Chicken Cow Whale
At this point two solutions are possible: to duplicate the entity whale in both earth and sea entities, or to transform one of the classification criteria (place of living or origin) into an attribute of the other.
When this classification was reconsidered, place of living was found not an important feature to bear in mind when studying a recipe content.
For example: also for the boundary case whale, it does not matter where it comes from.
Although it lives in the sea, its taste resembles more to meat than to fish (which is the culinary point of view). (Moreover, the place of leaving of a vegetal ingredient has no sense)
Due to the place of living is not a basic feature to bear in mind, it can be removed as a classification term and put as an attribute. The classification will now look like in the following picture:
The next diagram shows an example of how to transform a multiple-hierarchical classification made in the sea-animal context.
Imagine a first stage, when the classification is made following three different criteria. The next picture shows how the taxonomy would look like:
Ingredient
Animal Vegetal
Poultry Reptile Mammal Fish Seafood Bird
Place of living
If the multiple-inheritance should be removed, the first step is to analyze the subdivision relationships and find-out the less important one. In this case the kind-of-water
relationship is considered no crucial. In this first stage the modelling as an attribute has been chose (like in the first example), as shown in the next picture:
Finally, the remaining classifications are compared to find the least important one. As explained in the project, the part-of relationship varies depending of the kind of animal it is applied to. This time instead of swapping this characteristics to an attribute, it has been decided to put it as a sub-classification of the IS-A classification.
Fatty fish Sea-animal
Shellfish Fish
Kind-of-animal Part
IS-A IS-A
Lean fish Mammal
IS-A
Mollusk Crustacean
Flesh Roe
Salmon flesh Salmon roe Shark flesh Shrimp Whale flesh Lobster roe Clam Octopus
Kind_of_water:
Fresh|Salty
Kind_of_water:
=Salty Kind_of_water:
=Salty
Fatty fish Sea-animal
Shellfish Fish
Kind-of-animal Part
IS-A IS-A
Lean fish Kind-of-water
Mammal IS-A
Mollusk Crustacean
Flesh Roe
Salmon flesh Salmon roe
Sweet Fresh
Shark flesh Shrimp Whale flesh Lobster roe Clam Octopus
Sea-animal
Shellfish IS-A
IS-A
Mammal IS-A
Mollusk Crustacean
Kind_of_water:
Fresh|Salty
Kind_of_water:
=Salty
Kind_of_water:
=Salty Whale
Shrimp Lobster roe Clam
Octopus
Salmon
flesh Salmon roe Shark
flesh
Flesh Roe
Fatty fish Fish
Lean fish
Flesh Roe Flesh Roe
"
"
"
" # # # # $ % # $ % # $ % # $ % #
& ' (
& ' ( & ' (
& ' (
Retrieval performance is measured in terms of precision and recall.
First of all here are some definitions about Precision and Recall:
Precision: “It is the measure of the purity of retrieval”
Recall: “It is the measure of the completeness of retrieval”
It has been proved that precision and recall are inversely related: Precision seems to turn down as Recall augments. Although all the researchers would prefer high precision and recall at the same time, its relationship nature is a handicap for this objective. This relationship is a tangent parabola
It is also proved that if the information retrieval is made in more than one step it is possible to improve both parameters at the same time.
Some studies have been made in this field, ones considering precision and recall as continuous functions (Heine-1973, Robertson-1975, Gordon and Kochen-1989) and others considering them as a two-Poisson discrete model (Bookstein-1974)
The precision and recall definitions are based on two traditional assumptions (although some authors question their veracity)
Every retrievable item in a text of study is “relevant” or “not relevant”
Information retrieval is an extensive process; the retrieving can be augmented in order to retrieve more items, and hence increasing the recall.
Basing on the first assumption, all the retrievable items are classified following the table below (they belong to one and only one cell)
Retrieval matrix Relevant Not relevant TOTAL
Retrieved N(retrieved∩relevant) N(retrieved∩~relevant) Nretrieved Not Retrieved N(~retrieved∩relevant) N(~retrieved∩~relevant) N~retrieved
TOTAL Nrelevant N~relevant Ntotal
& + 5
This matrix states two possible classifications for an item:
If it is retrieved or not retrieved and if it is relevant or not relevant.
Definition: Generality: it is the percentage of texts that are relevant among the whole texts collection to be retrieved information from.
After being familiar with these notions it is easier to understand the following definitions, which are more complete than the first ones:
Recall: “Is the number of retrieved relevant items (N (retrieved∩relevant)) among the all relevant items in the texts (Nrelevant)”.
So it is calculated by the formula:
Recall = N (retrieved∩relevant) / Nrelevant
6 )
It is a measure of the effectiveness in retrieving relevant information from the texts.
The number of relevant items in a given set is a fixed number, so it is clear that the higher recall the bigger the relevant retrieved set.
It can happen that the retrieved information set augments (more information is retrieved) but this information is wrong or non-relevant information, so the recall remains the same. But if the entire document is retrieved then a 100% recall is always achieved.
But this is clearly a non-sense, as long as the user is only interested in some relevant information, not the entire text. So a 100% recall is not necessarily good, there is a need to measure the Precision of the retrieved information.
Precision: “Is the number of retrieved relevant items (N (retrieved∩relevant)) among the whole retrieved items (Nretrieved) that exist in the document”
So it is calculated by the formula:
Precision = N (retrieved∩relevant) / Nretrieved
7 )
It is a measure of the purity in retrieving relevant information from the texts.
It measures the efficiency in extracting relevant information, nor to retrieve irrelevant one.
Like the Recall, a high rate of Precision is desired, but a rate of 100% Precision, does not mean that the information retrieval is necessarily good. Because it can be achieved retrieving just a very few items from the text but all of them relevant, and a lot of useful information can be being lost.
So the objective is to combine high rates of Precision and Recall at the same time. The ideal situation would be to obtain a 100% rate of both simultaneously.
Precision vs. Recall
The empirical cases show the inverse relationship between Precision and Recall. If one of them improves the other one tends to decrease. Here it is the big challenge all the information extraction experts deal with: evade this behavior and obtain good rates of precision and recall.
' ) ' ) ' )
' ) * * * * + * $ , + * $ , + * $ , + * $ ,
*
-* - *
-* -
The input corpus is a set of texts annotated by the user. Each relevant element is surrounded by SGML (XML in fact) tags. For example:
<quantity> 5</quantity> <measure>kg</measure> <ingredient>rice</ingredient>
The algorithm introduces all the instances annotated by the user in what is called the positive pool (contains the positive or relevant examples). There is also a negative pool, which contains all the negative examples (the rest of the words in the text)
The algorithm covers all the training examples sequentially; when a new induced rule covers some positive examples, these are removed from the positive examples pool. The induction finishes when the positive pool is empty.
The algorithm makes use of different techniques: Lemmatization [see Glossary], Upper/lower case letters [see Glossary], POS tags [see Glossary], and the gazetteers (synonyms and
acronyms)
The training is carried out two steps:
1. The first set of induced rules make no use of linguistic information
1.1. First of all it induces tagging rules. These rules are the ones that will tag information (in order to annotate and then extract it) from new untagged texts.
The tagging rules are of two kinds:
1.1.1. Initial tagging rules: Are the rules that will annotate future texts 1.1.2. Contextual rules: Complete and improve the initial tagging rules
1.2. Afterwards correction rules are induced. These rules remove or correct mistakes and imprecitions that the previous rules could make while annotating.
2. The second set of induced rules make use of linguistical and aditional information
10.1.1.1 Tagging rules
The algorithm iterates in the following way: it selects a positive example from the positive pool. Then it builds an initial tagging rule, then it generalizes the rule and finally keeps the k best generalizations (k is set by the user).
All the elements in the positive pool covered by this new rule, are removed from there.
An intial taggin rule have a pattern of conditions in the left side and the right hand is an action to insert XML tags in the texts.