Richard Durbin
MRC Laboratory for Molecular Biology
Hills Road, Cambridge, CB2 2QH, UK
Email: rd@mrc-lmba.cam.ac.uk
December 10, 1992.
Figure 1 shows a simple ACEDB schema consisting of four models, three for classes and one for a constructed type.
?Paper Title UNIQUE Text Int
Author ?Author XREF Paper
Page Int UNIQUE Int
Reference UNIQUE Journal ?Journal XREF Paper
Book #Paper
Personal_communication
Language UNIQUE French
English
Complete_Text ?LongText
?Journal Paper ?Paper XREF Journal
?Author Address Int #Address
Paper ?Paper XREF Author
Director UNIQUE ?Author XREF Colleagues
Colleagues ?Author XREF Colleagues
#Address Mail Text
Phone Int REPEAT
Data are organised into objects and classes. Each object belongs to
exactly one class and has an externally visible name which is unique
within that class. Each class and constructed type is identified by
its externally visible model name, which is unique within the
schema.A model is presented as a tree [Figure 1]. It contains five types of node: tags (e.g. Title, French), which are words that must be enumerated in a separate file, basic data types ( Int, Float, Text), modifiers ( XREF, UNIQUE, REPEAT), class names (e.g. ?Author) and constructed type names (e.g. #Address). The tags must form a rooted subtree of the model and can appear only once in a model (except those following XREF). Note that tags may appear as leaves in the tree. This implements in a natural way Boolean flags, via the presence or absence of a tag, and restricted vocabularies, as following Language in model Paper.
acePaper Title "ACEDB - a database for Genomic Information"
Author "Thierry-Mieg J."
"Durbin R."
Reference Book Title "Genome Mapping and Sequencing"
Author "Richard Myers" -C "et al"
Language French
"Thierry-Mieg J." Address 1989 Mail "Observatoire de Meudon"
1992 Mail "CNRS, Montpellier"
Phone 33 67 61 33 24 -C "Direct line"
21 -C "Secretariat"
52 15 59 -C "Fax"
Paper acePaper
Example of ACEDB objects\label
An instance of a class, i.e. an object belonging to that class, is
also presented as a tree [Figure 2]. It contains tags, data items and
pointers to other objects. The object must match the model for
its class: tags must appear in corresponding positions, data items must
be of the type specified by the model, and pointers must refer to
objects of the class specified by the model. Hence, ACEDB is a
strongly typed system. However, the -C construction allows to add
anywhere in the objects free comments that will be visible to the
human users but will not, in any way, affect the behaviour of
client programs relying on the database server.The single most important feature of the ACEDB design is that an object need only match a rooted subtree of its model; any branch in an object can legally be pruned off. This allows one to extend the schema without touching the objects and implies that small partially specified objects cost little memory.
UNIQUE can follow any non-leaf node in a model. If present, the matching node in an object will be followed immediately to its right by at most one node chosen from among those allowed by the model at that position. This implements exclusive choice between branches (e.g. after Language or Reference in class paper) and unicity of pointers or data items (e.g. after Title or Journal). By default, a data or class type in the model allows a non repetitive set of nodes of the given type, displayed in a vertical column (e.g after Author in Paper acePaper).
XREF establishes automatic cross-referencing between objects. It must follow a pointer and be followed by a tag name, which tells the system where to create or remove the back pointer. Note however that XREF acts in only one direction and does not propagate; both XREF's should be stated if one wants to guarantee the existence of the double link. This is quite versatile as shown in the last 2 lines of Class Author.
REPEAT must follow a data or pointer node and must end a branch. It allows in the object the existence of an arbitrary tree of nodes of the given type (e.g. phone numbers in object Thierry-Mieg). In the same conditions UNIQUE REPEAT forbids branching and hence just allows an ordered list of nodes with possible repetitions.
Constructed types allow a recursive specification of data within a model (e.g. #Paper in ?Paper). The #Type node in the model is not represented by a node in the object but simply switches at that point the matching rule to the root of the included model. #Type is always UNIQUE and cannot be REPEAT.
Finally, in addition to the classes defined using models, ACEDB allows the existence of Array classes, which are simply treated by the database manager as binary buffers. Such classes can appear in models, since pointers to them can exist, but they do not support XREF clauses. It is up to the application to deal with the content of the buffer. The standard ACEDB implementation defines LongText and DNA Array classes, optimised for text and DNA sequence storage.
<models> ::= <model>
| <model> <models>
;
<model> ::= ?<model name> <unique> <tag column> /* For classes */
| #<model name> <unique> <tag column> /* For constructed types */
;
<tag column> ::= <tag node>
| <tag node> NL <tag column>
;
<tag node> ::= <tag>
| <tag> <unique> <data cluster>
| <tag> <unique> START_INDENT <tag column> END_INDENT
/* In addition, in ACEDB 1-x, we allowed */
| <tag> <unique> START_INDENT <data cluster> NL
<tag column> END_INDENT
/* This construction however can lead to ambiguities when parsing data and */
/* will be forbidden in release 2-x */
;
<data cluster> ::= <data type list>
| <data type list> REPEAT
| <data type list> #<model name>
| #<model name>
;
<data type list> ::= <data type>
| <data type> <unique> <data type list>
;
<unique> ::= <null>
| UNIQUE
;
<data type> ::= <primitive data type>
| <class reference>
| ANY /* reserved for kernel use */
;
<primitive data type> ::= Int
| Text
| Float
;
<class reference> ::= ? <class name>
| ? <class name> XREF <tag name>
;
<tag> ::= /* string starting with a letter or digit,
and containing letters, digits and `_' */
;
<model name> ::= /* string starting with a letter or digit,
and containing letters, digits and `_' */
A double slash "//" is considered as the beginning of a comment and input is skipped until the end of the line.Tabs are reinterpreted as rounding to 8 spaces.
A tag name must be unique within a model.An ACEDB model is defined as a hierarchical tree. Items at the same level in the hierarchy must be identically indented on consecutive lines. This is indicated in the BNF grammar by indentation ( START_INDENT and END_INDENT), and new line separators ( NL).
If "?C XREF T", appears in a < class reference > clause in model M, then "T ?M" must appear in the model for class C. XREF clauses included recursively within a constructed subtype are ignored.
samPapers := {Find Paper Author = Sam } ;
{ Find Journal Proceedings* ; Follow paper }
AND $samPapers ;
Year >= 1992 ;
In the first line, we construct the set of all papers known in the
database such that Sam is one of the authors, and name this set
samPapers. In the second line, we search for journals named
"Proceedings of something or other" and then obtain all the papers
published in those journals. In the third line, we intersect this
unnamed set with the former. The resulting intersection is then piped
to the fourth line and filtered on the year of publication.Obviously, there are many equivalent ways to express the same query. Given the automatic cross-referencing between Journal and Paper expresed in figure 1, we could have asked directly:
Find Paper
Author = Sam AND Journal = Proceedings* AND Year >= 1992 ;
The first operation is to select a target set. a) One can start from a whole class, as in "Find Paper". b) One can select a previous set by picking it on screen with the mouse or by the clause "From samPapers". c) One may follow pointers as in "Follow Paper" which will create the set of all objects following the tag Paper in objects of the active keyset. d) One may rely on the semicolon to pipe the result of the previous query as the default target of the next query. Hence "From samPapers & lt; condition & gt; ;" is equivalent to the double query "$samPapers ; < condition > ;".
If a query is of the form < set operators >, each < set > is entered with the same default target.
The scope of the set names is limited to a session or a command file.
< strings > can contain wildcard characters (`*' for any text, `?' for a single character). For example, a*ba* will match strings starting with `a' aand containing `ba'. We may in the future allow more general Unix style regular expressions.
< locator > is a Boolean check on the existence of a node.
Finally, the # symbol evaluates to TRUE if both its left and right hand sides do. The difference with AND is that # jumps the implicit model cursor, and hence the matching process, into a constructed type (which must be present at that point in the model), without moving the cursor in the active object. Recall that no node in the object corresponds to the #type node of the model.
<query> ::= <single query>
| <single query> ";" <query>
;
<single query> ::= <target set>
| <target set> <condition>
| "@" <file_name> <parameters> /* command file */
/* Set operations: not implemented in release 1-8
| <set>
| <set name> := <set>
;
<set> ::= <simple set>
| <set> <set operator> <set>
;
<simple set> ::= { <query> }
| ( <set> )
| $ <set name >
;
<set operator> ::= "|" | OR
| "&" | AND
| "^" | XOR // exclusive or
| "-" | MINUS
*/
;
<target set> ::= <null> /* then the current key set is used */
| <find class> <class name>
| <follow> <tag>
/* we could possibly allow <follow> <atomic condition> */
/* Set operations: not implemented in release 1-8
| <from set> <set name>
<from set> ::= ">$" | FROM
*/
;
<find class> ::= ">?" | FIND
;
<follow> ::= ">" | FOLLOW
;
<condition> ::= <atomic condition>
| <not> <atomic condition>
| <condition> <operator> <condition>
;
/* Order of precedence: OR, XOR, AND, NOT, #, comparator, IS, CLASS */
<operator> ::= "|" | OR
| "^" | XOR
| "&" | AND
| "#" /* AND, with jump in constructed type */
;
<atomic condition> ::= IS <regexp> /* short hand for IS = <regexp> */
| CLASS <regexp> /* Checks class of object matches regexp */
| COUNT <locator> /* counts the number of
entries in the column right of locator */
| <locator>
| <atomic condition> <comparator> <regexp>
| "(" <condition> ")"
;
<locator> ::= <tag> /* go to tag according to current model (which
will depend on # symbols encountered previously) */
| HERE /* stay put */
| NEXT /* move one node right */
| IS /* relocates at root and stands for the name of the object */
;
/* All comparators have equal precedence */
<comparator> ::= "<" | LT
| ">" | GT
| "<=" | LE
| ">=" | GE
| "=" | EQ
| "!=" | NEQ
;
<tag> ::= /* string starting with a letter or a digit
and containing letters, digits or `_' */
;
<class name> ::= /* string starting with a letter or a digit
and containing letters, digits or `_' */
;
<set name> ::= /* string starting with a letter or a digit
and containing letters, digits or `_' */
;
<reg exp> ::= /* integer or float decimal number */
| /* string of letters, digits, _,
? (matching one letter) and * (matching any substring */
| /* arbitrary ascii string protected by double quotes */
;
A double slash "//" is considered as the beginning of a comment and input is skipped until the end of the line.The backslash can be used to escape a backslash, a double quote or a newline.
Tabs are reinterpreted as rounding to 8 spaces.
Spaces inside names must be protected by double quotes, but to prevent runaway, a new line ends an ongoing open double quote.
< comparators > try to evaluate their operands as numbers and if they cannot fall back on comparing strings of characters. However ACEDB does not use strict alphabetic ordering. It recognises embeded integers and will order a3x left of a25x.The < condition > is evaluated in succession on each object of the < target set >. When the condition is entered a cursor is set at the root of the object. This implicit cursor will then move with each successful < locator > clause. A < locator > < comparator > < string > clause will move the cursor if possible to the right of the locator and then down the column onto the first node satisfying the comparator or to the end of that column. This position can then be used as the start of the next search.
< set name > must appear for the first time to as left hand side of an assignment, < set name > := < set >. The scope of set names is the life of a query window or vt100 client, or a single command file. Set operations are not implemented in ACEDB 1-8.
Entries are based on records (lines) and paragraphs (groups of lines separated by blank lines).Hence, in contrast to ASN-1 files, ace files are easy to read and hand edit and are highly compatible with the standard line oriented Unix commands. Furthermore, it is extremely easy to export data from one ace database and import part of these into another ace database with a rather different schema. One just has to delete or comment out with a ! those classes and tags unknown to the second schema. The fields common to the two schemas will then be read by the receiving ACEDB.Each paragraph starts with a class name and object name, and corresponds to a set of actions on that object.
Each subsequent line in a paragraph starts with a tag name and defines, according to the model, a path in the object to a single node. Any parts of the branch up to the node not yet present in the object will be created.
In addition to simply specifying data, simple directives at the beginning of lines allow deletion of data, renaming of objects, and establishment of aliases. Altogether this allows the propagation of any transformation of one database to a remote site.
Property (c) relies on the construction of ACEDB models, which permits unambiguous addressing of each node of an object using an abbreviated path from the root. Because tags are unique in a model, "interior" tags, i.e. tags that are followed by another tag, are not required.
In figure 4, we give an ace file sufficient to reconstruct the objects of figure 2.
Paper acePape
Title "ACEDB - a database for Genomic Information"
Author "Thierry-Mieg J."
Author "Durbin R."
Book Title "Genome Mapping and Sequencing"
Book Author "Richard Myers" -C "et al"
Language French
Author "Thierry-Mieg J."
Address 1989 Mail "Observatoire de Meudon"
Address 1992 Mail "CNRS, Montpellier"
Address 1992 Phone 33 67 61 33 24 -C "Direct line"
Author "Thierry-Mieg J."
Address 1992 Phone 33 67 61 33 21 -C "Secretariat"
Address 1992 Phone 33 67 52 15 59 -C "Fax"
Example of ACE File\label
<ace file> ::= <object update>
| <object update> NL NL <ace file>
;
<object update> ::= <class name> <name>
| <class name> <name> NL <data lines>
| -D <class name> <name> /* Delete */
| -R <class name> <name> <name> /* Remame */
| -A <class name> <name> <name> /* Alias */
;
<data lines> ::= <data line>
| <data line> NL <data lines>
;
<data line> ::= <path>
| -D <path> /* Deletion */
| -T <path> /* Tops within column, not in 1_8 */
| -R <datum> <path> /* Substitution, not in 1_8 */
;
<path> ::= <simple path>
| <simple path> # <path>
;
<simple path> ::= <tag>
| <tag> <data list>
;
<data list> ::= <datum>
| <datum> <data list>
;
<tag> ::= /* string starting with a letter or a digit
and containing letters, digits or `_' */
;
<class name> ::= /* string starting with a letter or a digit
and containing letters, digits or `_' */
;
<datum> ::= /* string containing letters, digits or `_' */
| " /* arbitrary printable string protected by double quotes */ "
;
Paragraph starting with "!" are skipped.Lines starting with "!" are skipped.
A double slash "//" is considered as the beginning of a comment and input is skipped until the end of the line.
The backslash can be used to escape a backslash, a double quote or a newline. In the case of a newline all space and tabs at the beginning of the next line are also skipped. Therefore, one often wants to break a long line as follows:
here is some text from a very long line. There will be only \ one space between "only" and "one".Other files can be included recursively by the command @fileName, with @ the first character on a line.Tabs are reinterpreted as 8 spaces. Spaces inside names must be protected by double quotes, but to prevent runaway, a new line ends an ongoing open double quote. Furthermore, ACEDB will remove leading and trailing spaces from names even if the are protected by a double quotes.
In case of a parsing error, the whole paragraph is discarded with a warning message.
It is not an error if a -Delete, -Alias or -Rename clause does not find its target. This property ensures that the same ace file can be read twice without error (which is very useful).