acedb --- A C.elegans Database
Syntactic Definitions for the
ACEDB Data Base Manager

Jean Thierry-Mieg
CNRS--CRBM
Route de Mende, BP 5051, 34033 Montpellier, France
Email: mieg@crbm1.cnusc.fr

Richard Durbin
MRC Laboratory for Molecular Biology
Hills Road, Cambridge, CB2 2QH, UK
Email: rd@mrc-lmba.cam.ac.uk

Abstract

ACEDB is an object-oriented database management system that we have developped for scientific research projects where many data are very incomplete. The aim of this technical report is to give the formal syntactic definition of the schema of an ACEDB database, of the associated query language and of the update file mechanism. Corresponding syntactic validation tools are available on request.

December 10, 1992.

Introduction

Not yet written.

The ACEDB Data Model

Description

A thorough description of the ACEDB semantics will be found in [Durbin, Thierry-Mieg, 1993]. In this section we just hope to clarify, using a simple example, the meaning of the 3 BNF grammars described in this technical document.

Figure 1 shows a simple ACEDB schema consisting of four models, three for classes and one for a constructed type.

?Paper     Title   UNIQUE Text Int
           Author ?Author XREF Paper
           Page Int UNIQUE Int
           Reference UNIQUE Journal ?Journal XREF Paper
                            Book #Paper
                            Personal_communication
           Language UNIQUE  French
                            English
           Complete_Text ?LongText
                            
?Journal Paper ?Paper XREF Journal

?Author Address Int #Address
        Paper ?Paper XREF Author
        Director UNIQUE ?Author XREF Colleagues
        Colleagues ?Author XREF Colleagues

#Address  Mail Text
          Phone Int REPEAT
Data are organised into objects and classes. Each object belongs to exactly one class and has an externally visible name which is unique within that class. Each class and constructed type is identified by its externally visible model name, which is unique within the schema.

A model is presented as a tree [Figure 1]. It contains five types of node: tags (e.g. Title, French), which are words that must be enumerated in a separate file, basic data types ( Int, Float, Text), modifiers ( XREF, UNIQUE, REPEAT), class names (e.g. ?Author) and constructed type names (e.g. #Address). The tags must form a rooted subtree of the model and can appear only once in a model (except those following XREF). Note that tags may appear as leaves in the tree. This implements in a natural way Boolean flags, via the presence or absence of a tag, and restricted vocabularies, as following Language in model Paper.

acePaper Title "ACEDB - a database for Genomic Information"
         Author "Thierry-Mieg J."
                "Durbin R."
         Reference Book Title "Genome Mapping and Sequencing"
                        Author "Richard Myers" -C "et al"
         Language French

"Thierry-Mieg J." Address 1989 Mail "Observatoire de Meudon"
                            1992 Mail "CNRS, Montpellier"
                                 Phone  33 67 61 33 24 -C "Direct line"
                                                    21 -C "Secretariat"
                                              52 15 59 -C "Fax"
                    Paper  acePaper


                Example of ACEDB objects\label
An instance of a class, i.e. an object belonging to that class, is also presented as a tree [Figure 2]. It contains tags, data items and pointers to other objects. The object must match the model for its class: tags must appear in corresponding positions, data items must be of the type specified by the model, and pointers must refer to objects of the class specified by the model. Hence, ACEDB is a strongly typed system. However, the -C construction allows to add anywhere in the objects free comments that will be visible to the human users but will not, in any way, affect the behaviour of client programs relying on the database server.

The single most important feature of the ACEDB design is that an object need only match a rooted subtree of its model; any branch in an object can legally be pruned off. This allows one to extend the schema without touching the objects and implies that small partially specified objects cost little memory.

UNIQUE can follow any non-leaf node in a model. If present, the matching node in an object will be followed immediately to its right by at most one node chosen from among those allowed by the model at that position. This implements exclusive choice between branches (e.g. after Language or Reference in class paper) and unicity of pointers or data items (e.g. after Title or Journal). By default, a data or class type in the model allows a non repetitive set of nodes of the given type, displayed in a vertical column (e.g after Author in Paper acePaper).

XREF establishes automatic cross-referencing between objects. It must follow a pointer and be followed by a tag name, which tells the system where to create or remove the back pointer. Note however that XREF acts in only one direction and does not propagate; both XREF's should be stated if one wants to guarantee the existence of the double link. This is quite versatile as shown in the last 2 lines of Class Author.

REPEAT must follow a data or pointer node and must end a branch. It allows in the object the existence of an arbitrary tree of nodes of the given type (e.g. phone numbers in object Thierry-Mieg). In the same conditions UNIQUE REPEAT forbids branching and hence just allows an ordered list of nodes with possible repetitions.

Constructed types allow a recursive specification of data within a model (e.g. #Paper in ?Paper). The #Type node in the model is not represented by a node in the object but simply switches at that point the matching rule to the root of the included model. #Type is always UNIQUE and cannot be REPEAT.

Finally, in addition to the classes defined using models, ACEDB allows the existence of Array classes, which are simply treated by the database manager as binary buffers. Such classes can appear in models, since pointers to them can exist, but they do not support XREF clauses. It is up to the application to deal with the content of the buffer. The standard ACEDB implementation defines LongText and DNA Array classes, optimised for text and DNA sequence storage.

BNF Grammar for the ACEDB Models

<models> ::= <model> 
           | <model> <models>
           ;
<model> ::= ?<model name> <unique> <tag column>  /* For classes */
          | #<model name> <unique> <tag column>  /* For constructed types */
          ;
<tag column> ::= <tag node>
               | <tag node> NL <tag column> 
               ;
<tag node> ::= <tag>
             | <tag> <unique> <data cluster>		
             | <tag> <unique> START_INDENT <tag column> END_INDENT 
/* In addition, in ACEDB 1-x, we allowed   */
             | <tag> <unique> START_INDENT <data cluster> NL
                                           <tag column> END_INDENT
/* This construction however can lead to ambiguities when parsing data and */
/* will be forbidden in release 2-x */
             ;
<data cluster> ::= <data type list>
                 | <data type list> REPEAT      
                 | <data type list> #<model name>
                 | #<model name>
                 ;
<data type list> ::= <data type>
                   | <data type> <unique> <data type list> 
                   ;   
<unique> ::= <null> 
           | UNIQUE
           ;
<data type> ::= <primitive data type>
              | <class reference>
              | ANY	/* reserved for kernel use */	
              ;
<primitive data type> ::= Int 
                        | Text 
                        | Float
                        ;
<class reference> ::= ? <class name>
                    | ? <class name> XREF <tag name>	
                    ;






<tag> ::=  /* string starting with a letter or digit,
                       and containing letters, digits and `_' */
        ;
<model name> ::= /* string starting with a letter or digit, 
                       and containing letters, digits and `_' */

Preprocessor

There is a preprocessor ran before parsing the models. it takes the following actions.

A double slash "//" is considered as the beginning of a comment and input is skipped until the end of the line.

Tabs are reinterpreted as rounding to 8 spaces.

Additional Rules

A tag name must be unique within a model.

An ACEDB model is defined as a hierarchical tree. Items at the same level in the hierarchy must be identically indented on consecutive lines. This is indicated in the BNF grammar by indentation ( START_INDENT and END_INDENT), and new line separators ( NL).

If "?C XREF T", appears in a < class reference > clause in model M, then "T ?M" must appear in the model for class C. XREF clauses included recursively within a constructed subtype are ignored.

The ACEDB Query Language

Introduction

Acedb is object oriented. In contradistinction with relational databases like Oracle or Sybase, each data item is addressed by name rather than by content. This precludes a direct implementation of the standard relation oriented query language SQL. Rather, the central operation is to construct sets of objects with desirable properties, to combine and to filter these sets and to follow the pointers to neighbouring objects as in the following example:

     samPapers := {Find Paper Author = Sam } ;
     { Find Journal Proceedings* ; Follow paper }
           AND $samPapers ;
      Year >= 1992 ;
In the first line, we construct the set of all papers known in the database such that Sam is one of the authors, and name this set samPapers. In the second line, we search for journals named "Proceedings of something or other" and then obtain all the papers published in those journals. In the third line, we intersect this unnamed set with the former. The resulting intersection is then piped to the fourth line and filtered on the year of publication.

Obviously, there are many equivalent ways to express the same query. Given the automatic cross-referencing between Journal and Paper expresed in figure 1, we could have asked directly:

   Find Paper 
     Author = Sam AND Journal = Proceedings* AND Year >= 1992 ;

Target Set

Apart from the usual nightmare of learning yet another language, with its peculiar punctuation and symbols, the only difficulty is to understand properly which data are screened against which part of the query.

The first operation is to select a target set. a) One can start from a whole class, as in "Find Paper". b) One can select a previous set by picking it on screen with the mouse or by the clause "From samPapers". c) One may follow pointers as in "Follow Paper" which will create the set of all objects following the tag Paper in objects of the active keyset. d) One may rely on the semicolon to pipe the result of the previous query as the default target of the next query. Hence "From samPapers & lt; condition & gt; ;" is equivalent to the double query "$samPapers ; < condition > ;".

If a query is of the form < set operators >, each < set > is entered with the same default target.

The scope of the set names is limited to a session or a command file.

Active object and node

Given a target set, the < condition > clause is applied in turn to each object in the set. This "active object" is kept if the condition is satisfied, discarded otherwise. A cursor, defining the "active node", is attached to the active object, initially at the root, and a twin cursor points to the root of the corresponding model. Each succesful < locator > clause will then move these cursors. A < tag > will move to the tag, if present in the object. NEXT will move once right. HERE will stay on the same node. A < locator > < comparator > < string > will move the cursor to the column right of the locator, and then, within that column, to the first node matching the comparison. It is important to realise that < locators > are evaluated by order of precedence of the operators linking them, and only as far as required to evaluate the boolean expression, as in C language. Thus, "A OR B OR C" will leave the cursor on the first tag among A,B, and C present in the object. "A = *" and "A AND NEXT" are equivalent clauses which leave the cursor on the node right of tag A and return TRUE if present, or else return FALSE.

Atomic conditions

The rest of the syntax is straightforward. Elementary conditions return TRUE or FALSE and are combined by the usual operators NOT, AND, XOR, OR. Subexpressions can be paranthesised. CLASS and IS refer to the class and name of the active object and must be followed by a comparator and a string. As a short hand, IS as the first operator of a < condition > can be omitted at the user's risk, provided the string is not misinterpreted as a tag name by the system. Also "=" may be dropped following IS. e.g. "IS < string2 >" is accepted as shorthand for "IS = < string2 >".

< strings > can contain wildcard characters (`*' for any text, `?' for a single character). For example, a*ba* will match strings starting with `a' aand containing `ba'. We may in the future allow more general Unix style regular expressions.

< locator > is a Boolean check on the existence of a node.

Finally, the # symbol evaluates to TRUE if both its left and right hand sides do. The difference with AND is that # jumps the implicit model cursor, and hence the matching process, into a constructed type (which must be present at that point in the model), without moving the cursor in the active object. Recall that no node in the object corresponds to the #type node of the model.

Command files

In place of a simple query, it is possible to invoke a command file that contains query text (which might possibly include further recursive invocations of command files). Any strings after the command file name will be used as parameters, substituting for %1, %2, %3 etc. in the command file text. This is done by macro string substitution. Named sets are inherited in the included command, without possible conflict with those declared locally.

Extraction of data

The system described in this section allows one to select objects, but not to extract numerical data. This operation can be performed in ACEDB, either in interactive or in batch mode, by using a facility called the table maker which essentially constructs a spreadsheet-like table. However, this facility has been introduced only recently and is still evolving, so it is too early to give its formal definition.

BNF Grammar for the ACEDB Query Language

<query> ::= <single query>
          | <single query> ";" <query>
          ;
<single query> ::= <target set>
                 | <target set> <condition>
                 | "@" <file_name> <parameters>   /* command file */

/*       Set operations:  not implemented in release  1-8
                 | <set>
                 | <set name> := <set>          
                 ;
<set> ::= <simple set>
        | <set> <set operator> <set>
        ;
<simple set> ::= { <query> }
               | ( <set> )
               | $ <set name >       
               ;
<set operator> ::= "|" | OR
                 | "&" | AND
                 | "^" | XOR     // exclusive or
                 | "-" | MINUS
*/
                 ;

<target set> ::= <null>         /* then the current key set is used */
               | <find class> <class name>
               | <follow> <tag> 
     /* we could possibly allow <follow> <atomic condition> */

/*         Set operations:  not implemented in release  1-8         
               | <from set> <set name>      

<from set>   ::= ">$" | FROM
*/
               ;

<find class> ::= ">?" | FIND
               ;
<follow>     ::= ">"  | FOLLOW
               ;
<condition> ::= <atomic condition>
              | <not> <atomic condition>
              | <condition> <operator> <condition>
              ;

  /* Order of precedence: OR, XOR, AND, NOT, #, comparator, IS, CLASS */
<operator> ::= "|" | OR   
             | "^" | XOR 
             | "&" | AND
             | "#"      /* AND, with jump in constructed type */
             ;
<atomic condition> ::= IS <regexp>     /* short hand for IS = <regexp> */
                     | CLASS <regexp>  /* Checks class of object matches regexp */
                     | COUNT <locator> /* counts the number of
                                          entries in the column right of locator */
                     | <locator>
                     | <atomic condition> <comparator> <regexp>
                     | "(" <condition> ")"
                     ;
<locator> ::= <tag> /* go to tag according to current model (which
                       will depend on # symbols encountered previously) */
            | HERE  /* stay put */
            | NEXT  /* move one node right */
            | IS    /* relocates at root and stands for the name of the object */
            ;
    /* All comparators have equal precedence */
<comparator> ::= "<"  | LT
               | ">"  | GT
               | "<=" | LE
               | ">=" | GE
               | "="  | EQ
               | "!=" | NEQ
               ;

<tag> ::= /* string starting with a letter or a digit 
                             and containing letters, digits or `_' */
        ;
<class name> ::= /* string starting with a letter or a digit
                             and containing letters, digits or `_' */
               ;
<set name> ::= /* string starting with a letter or a digit
                             and containing letters, digits or `_' */
               ;
<reg exp> ::= /* integer or float decimal number */
            | /* string of letters, digits, _, 
                 ? (matching one letter) and * (matching any substring */
            | /* arbitrary ascii string protected by double quotes */
            ;

Preprocessor

There is a preprocessor ran before parsing the queries. it takes the following actions.

A double slash "//" is considered as the beginning of a comment and input is skipped until the end of the line.

The backslash can be used to escape a backslash, a double quote or a newline.

Tabs are reinterpreted as rounding to 8 spaces.

Spaces inside names must be protected by double quotes, but to prevent runaway, a new line ends an ongoing open double quote.

Additional Rules

< comparators > try to evaluate their operands as numbers and if they cannot fall back on comparing strings of characters. However ACEDB does not use strict alphabetic ordering. It recognises embeded integers and will order a3x left of a25x.

The < condition > is evaluated in succession on each object of the < target set >. When the condition is entered a cursor is set at the root of the object. This implicit cursor will then move with each successful < locator > clause. A < locator > < comparator > < string > clause will move the cursor if possible to the right of the locator and then down the column onto the first node satisfying the comparator or to the end of that column. This position can then be used as the start of the next search.

< set name > must appear for the first time to as left hand side of an assignment, < set name > := < set >. The scope of set names is the life of a query window or vt100 client, or a single command file. Set operations are not implemented in ACEDB 1-8.

Description

The ace file update language is used to import and export data as ascii text files. The following properties make it versatile and powerful.

Entries are based on records (lines) and paragraphs (groups of lines separated by blank lines).

Each paragraph starts with a class name and object name, and corresponds to a set of actions on that object.

Each subsequent line in a paragraph starts with a tag name and defines, according to the model, a path in the object to a single node. Any parts of the branch up to the node not yet present in the object will be created.

In addition to simply specifying data, simple directives at the beginning of lines allow deletion of data, renaming of objects, and establishment of aliases. Altogether this allows the propagation of any transformation of one database to a remote site.

Hence, in contrast to ASN-1 files, ace files are easy to read and hand edit and are highly compatible with the standard line oriented Unix commands. Furthermore, it is extremely easy to export data from one ace database and import part of these into another ace database with a rather different schema. One just has to delete or comment out with a ! those classes and tags unknown to the second schema. The fields common to the two schemas will then be read by the receiving ACEDB.

Property (c) relies on the construction of ACEDB models, which permits unambiguous addressing of each node of an object using an abbreviated path from the root. Because tags are unique in a model, "interior" tags, i.e. tags that are followed by another tag, are not required.

In figure 4, we give an ace file sufficient to reconstruct the objects of figure 2.


Paper acePape
Title  "ACEDB - a database for Genomic Information"
Author "Thierry-Mieg J."
Author  "Durbin R."
Book  Title "Genome Mapping and Sequencing"
Book  Author "Richard Myers" -C "et al"
Language French


Author "Thierry-Mieg J."
Address 1989 Mail "Observatoire de Meudon"
Address 1992 Mail "CNRS, Montpellier"
Address 1992 Phone  33 67 61 33 24 -C "Direct line"

Author "Thierry-Mieg J."
Address 1992 Phone  33 67 61 33 21 -C "Secretariat"
Address 1992 Phone  33 67 52 15 59 -C "Fax"


            Example of ACE File\label

BNF Grammar for the ace-update language

<ace file> ::= <object update>
	     |  <object update> NL NL <ace file>
	     ;
<object update> ::= <class name> <name>
             | <class name> <name> NL <data lines>
             | -D <class name> <name>         /* Delete */
             | -R <class name> <name> <name>  /* Remame */
             | -A <class name> <name> <name>  /* Alias */
             ;

<data lines> ::= <data line>
               | <data line> NL <data lines>
               ;
<data line> ::= <path>
              | -D <path>          /* Deletion */
              | -T <path>          /* Tops within column, not in 1_8 */
              | -R <datum> <path>  /* Substitution, not in 1_8 */
              ;
<path> ::= <simple path>
         | <simple path> # <path>
         ;
<simple path> ::= <tag>
	        | <tag> <data list>
                ;
<data list> ::= <datum>
              | <datum> <data list>
              ;

<tag> ::= /* string starting with a letter or a digit 
                             and containing letters, digits or `_' */
             ;
<class name> ::= /* string starting with a letter or a digit
                             and containing letters, digits or `_' */
               ;
<datum> ::= /* string  containing letters, digits or `_' */
          | " /* arbitrary printable string protected by double quotes */ "
          ;

Preprocessor

There is a preprocessor ran before parsing the ace files. it takes the following actions.

Paragraph starting with "!" are skipped.

Lines starting with "!" are skipped.

A double slash "//" is considered as the beginning of a comment and input is skipped until the end of the line.

The backslash can be used to escape a backslash, a double quote or a newline. In the case of a newline all space and tabs at the beginning of the next line are also skipped. Therefore, one often wants to break a long line as follows:

  here is some text from a very long line.  There will be only \
  one space between "only" and "one".
Other files can be included recursively by the command @fileName, with @ the first character on a line.

Tabs are reinterpreted as 8 spaces. Spaces inside names must be protected by double quotes, but to prevent runaway, a new line ends an ongoing open double quote. Furthermore, ACEDB will remove leading and trailing spaces from names even if the are protected by a double quotes.

Additional Rules

< path > and < data list > are interpreted according to the model, and must conform. In particular, if the model specifies a class, a < datum > will be interpreted as the name of an object in that class, and this object will silently be created if it does not already exist. If the model specifies Float or Int < datum > will be interpreted as a numerical value if possible, else a parse error will be generated. If the model specifies Text then the < datum > will be entered as a string. There is also a parse error if a tag is given that can not be found in the currently relevant model.

In case of a parsing error, the whole paragraph is discarded with a warning message.

It is not an error if a -Delete, -Alias or -Rename clause does not find its target. This property ensures that the same ace file can be read twice without error (which is very useful).

bibliography

Durbin R. and Thierry-Mieg J, "ACEDB, a database system for classical and quantum genetics", "to be published"