Document Grammars - Regular Tree Grammars

3.8 Regular Tree Grammars

3.8.1 Document Grammars

Every XML document that is well-formed and satisfies the namespace con-straints is said to have an XML Information Set(Infoset). AnInfoset describes the abstract data model of the information that is stored in an XML document.

A non-empty set of items in thatInfoset contains some⁹information items that constitute the abstract representation of one or more elements in the document.

In return, each information item has a set of associated named properties. As of the current specification, there are 11 information items, 8 of which are relevant in our schema:

* Document Information Item,

* Element Information Items,

* Attribute Information Items,

* Processing Instruction Information Items,

* Character Information Item,

* Comment Information Items,

* Document Type Declaration Information Item,

* Namespace Information Items.

Following is a simple, yet somewhat formal description of a language grammar that well-formed documents must abide by.

3.8.1.1 Context-free Grammars

By a grammar, we understand a set of rules that govern how validsentences in a language are constructed. By grammar validity, we understand two sets of rules:

Syntax :

Rules that only concern theform of a document. An informal description previously given in section2.1.5.1. A slightly more formal one is given in the next paragraph.

9Two or more information items.

Semantic :

Rules that concern the meaningfulness of a well-formed document. As we have seen in2.1.5.2, Semantical correctness implies syntax correctness.

Informally, this correctness is enforced by some validity constraints. By schema validation, the following definition is understood:

Let σ be an instance schema expressed in the grammar of XML Schema and letδ(σ) be a function that defines the set of XML documents that are valid relative toσ:

δ(σ) ={X|X is valid relative toσ} (3.1) Given an XML document X, we need to determine the mem-bership ofX, i.eX∈δ(σ).

Context-free Grammars (CFG), introduced by Chomsky in mid 1950’s, describe the phrase structure of natural language sentences. A CFG deterministically describes all possible ‘sen-tences’ so that a valid sentence has a single interpretation. This decidability is important when the abovemembershipis verified.

Below, well-formedness constraints are specified using a simple Extended Backus-Naur Form (EBNF) notation: if an XML document grammar does conform to these rules, its language is not well-formed XML.

Regular Expressions: operators and closure operators are used as POSIX extended regular expression notation; non-special characters match themselves.

R|E : The union notation matches any string that is either in R or S, ,: R,S comma denotes concatenation,

*: R* matches 0 or more repetitions of R, +: R+ matches 1 or more repetitions of R,

?: R? matches 0 or 1 repetition of R, (..): Grouping operators to their arguments.

Production rules follow:

d o c u m e n t ::= p r o l o g e l e m e n t M i s c *

p r o l o g ::= X M L D e c l ? PI * M i s c * ( d o c t y p e M i s c *) ?

X M L D e c l ::= ‘ <? xml ’ V e r s i o n I n f o E n c o d i n g D e c l ? S D D e c l ? S ? ‘? > ’ e l e m e n t ::= E m p t y T a g | S t a r t T a g c o n t e n t E n d T a g

c o n t e n t ::= C h a r D a t a ? (( e l e m e n t | R e f e r e n c e | C D a t a S e c t | PI | C o m m e n t ) C h a r D a t a ?) *

E n c o d i n g ::= S ‘ e n c o d i n g ’ E q u a l s ( ’ " ’ E n c o d i n g T y p e ‘ " ’ | " ’ "

E n c o d i n g T y p e " ’ " )

E n c o d i n g T y p e ::= [ A - Za - z ] ([ A - Za - z0 -9. _ ] | ‘ - ’ ) *

V e r s i o n I n f o ::= S ‘ v e r s i o n ’ E q u a l s (" ’ " X M L V e r s i o n " ’ " | ‘" ’ X M L V e r s i o n ‘ " ’)

E q u a l s ::= S ? ‘= ’ S ?

X M L V e r s i o n ::= ‘1.0 ’

d o c t y p e ::= ‘ ’

M i s c ::= C o m m e n t | | S

C o m m e n t ::= ‘ <! - - ’ (( C h a r - ‘ - ’) | ( ’ - ’ ( C h a r - ‘ - ’) ) ) * ‘ - - > ’ C h a r ::= C a r r R e t u r n | Tab | S p a c e | L i n e F e e d | U n i C o d e C h a r s U n i C o d e C h a r s ::= [# x20 -# x D 7 F F ] | [# xE000 -# x F F F D ] | [# x10000 -# x 1 0 F F F F ] C h a r D a t a ::= [^ <&]* - ([^ <&]* ‘]] > ’ [^ <&]*)

W h i t e S p a c e ::= ( C a r r R e t u r n | L i n e F e e d | Tab | S p a c e ) + C a r r R e t u r n ::= ‘\ r ’ | # x9

L i n e F e e d ::= ‘\ n ’ | # xD

Tab ::= ‘\ t ’ | # xA

S p a c e ::= ‘ ’ | # x20

S t a r t T a g ::= ‘ </ ’ N a m e S ? ‘ > ’

E n d T a g ::= ‘ < ’ N a m e ( S A t t r i b u t e ) * S ? ‘ > ’ A t t r i b u t e ::= N a m e E q u a l s A t t V a l u e

R e f e r e n c e ::= E n t i t y R e f | C h a r R e f E n t i t y R e f ::= ‘& ’ N a m e ‘; ’

C h a r R e f ::= ‘&# ’ [0 -9]+;

Adding Namespace Support:

Given a namespace ns with a valid URI, an element identifier may appear as

<ns:element>value</element>

Q N a m e ::= ( P r e f i x ‘: ’ ) ? L o c a l P a r t

P r e f i x ::= N C N a m e

L o c a l P a r t ::= N C N a m e

Adding Processing Instructions (PIs):

The XML declaration, which is itself a special PI, signified by the ruleXMLDecl above, is not matched here. Since wee need to a append transformation grammar or style sheets (that is: either XSL or CSS) to an XML document, we must justify the use of xml-stylesheetprocessing instruction in the prologue of the document:

PI ::= ‘ <? ’ T a r g e t ( S ( C h a r * - ( C h a r * ‘? > ’ C h a r *) ) ) ? ‘? > ’ T a r g e t ::= ‘ xml - s t y l e s h e e t ’

The target names ”XML”, ”xml” constitute the reserved words in the XML specification. Under normal circumstances, all PIs, except any combination of ((‘X’ | ‘x’) (‘M’ | ‘m’) (‘L’ | ‘l’)), would be allowed, but to simplify the notation, no PIs except XSLT stylesheets are expected, it’s target being

‘xml-stylesheet’. Also document type declaration is not deemed necessary at the current stage.

Adding extended character data:

Most of the time, there’s a need to store characters that would otherwise be recognized as markup, without having to escape them. Data within a ”CDATA”

section are interpreted as characters, not markup or entity references. Also, big chunks of text data, such as the whole content of a CFP document, may be stored within CDATA.

C D a t a S e c t ::= C D S t a r t C D a t a C D E n d C D S t a r t ::= ‘ <![ C D A T A [ ’

C D a t a ::= ( C h a r * - ( C h a r * ‘]] > ’ C h a r *) )

C D E n d ::= ‘]] > ’

Notice these two features:

Threshold :

The minimal items defined by the ”document” rule: one document in-formation item and one element information item, is the smallest XML document conforming to theInfoset paradigm.

Unicode :

The XML version 1.0 specification, currently in its fourth edition, only al-lows characters which are defined in Unicode 2.0. The ”Char” production rule implies all Unicode characters, except the surrogate blocks, FFFE, and FFFF [XML06]. Despite this abundance,

The XML specification seems to indicate that valid character data in XML com-prises all Unicode characters, excluding the surrogate blocks. The production rulesUniCodeCharsdefines such a range (definition from the XML specification)

Figure 3.3: Application architecture

Figure 3.4: Hierarchy and dependency relations between classes

Figure 3.5: Simple role-based access control

Figure 3.6: Processing sequence of FLWR expressions

Figure 3.7: XSL Transformation scheme and components used.

Figure 3.8: Visualization of the first schema design: Russian Doll

Figure 3.9: Visualization of the sec-ond, more lenient version of the schema:

Venetian Blind design

Figure 3.10: Document elements hierarchy, shown with cardinalities

Implementation Details

In this chapter we’re going to describe a number of important issues within the implementation. Some amendments or additions to the conditions mentioned in the previous chapter are also found here. Not every line of code will be ex-amined and explained, but we’ll try to explain the main ideas and include code when it can help ease the understanding of the proposed solution.

Before we delve into the details, for the sake of reference:

Prerequisites:

* PHP5 with default extensions (CURL, XSL, SimpleXML),safe mode off

* MySQL 5

* Mozilla Firefox (Javascript enabled)

* libXML (part of GNU Project, see 4.3.2.1)

libXML needs to be compiled in the local environment, no dependencies on additional libraries were observed. See appendix A.1 for a short installation-guide.

4.1 Mail-interface Application

In the following sections, we will look into how parts of the domain problem can be solved using remote servers and local configurations as a part of a distributed system.

In document A Tool for Web-based Management of Call-for-Papers (Sider 83-94)