Jean Thierry-Mieg
CNRS--CRBM
Route de Mende, BP 5051, 34033 Montpellier, France
Email: mieg@crbm1.cnusc.fr
December 18, 1992
WARNING: You do not need to read anything in this manual if you are merely keeping up a version of the standard worm database. In fact, if you start applying anything in this manual then you may possibly get the database into a state where it will not accept the official updates, in which case you will have to start again from ace files.
Remember that the acedb kernel uses its own capabilities to manage the system, it is therefore extremelly important that you do not edit without prior consultation with us the system configuration files, i.e. sysclass, sysoptions, systags and the beginning of the model file.
2 Overview
In acedb the organisation of the basic classes and what they contain
and the overall deroulement of the program is not built into the
core database software, but is determined from a set of around 15
configuration files gathered in the wspec directory. Fortunately,
it is not too difficult to manipulate them because they serve very
different purposes and fall into several independant categories:
security, machine configuration, link between the kernel and the
applications and data semantics. All of them are read at run time, a
few, also during compilation. You may probably edit any one of those
files rather easily since they are self documented. However, the
purpose of this chapter is to explain the overal logic of the
system.
The environment variable ACEDB should give the name of the parent directory of the wspec and database directories. If you use acedb for several sets of data, you may have a single executable. However, for each set you must have have a pair of directories: database (to hold the data) and wspec (to explain the meaning of the data), and at least the value of the semaphore (file semkey ) must differ.
3 Security
For its security, acedb relies on Unix. The way to set up the read
write authorizations of the various directories is explained in the
installation manual. To summarize, the program runs setuid (4755).
Only the administrator, i.e. the Unix-owner of the
database directory can reorganize or reinitialise the database, only a
few users, listed in passwd can gain write access,
everybody else can read the data.
To implement these ideas, wspec contains just 2 very simple files dealing with security: passwd and semkey. Empty lines do not count. Anything following a pair of forward slashes // is a comment. The only remaining lines of passwd list the (unix) login name of those user who may gain write access.
In semkey, the only meaningful line reads SEM = n where n is some positive integer. Type the unix command ipcs to see if you have semaphores on your machine, and which ones are available. Choose 0 (zero) to disable the semaphore system. You should do this is you do not have semaphores mounted on your machine. Otherwise choose some unused small number. If you use acedb for several purposes, say Arabidopsis and C.elegans data, select a different semaphore value for each database.
4 Machine configuration
A few files are specific to the X11 implementation (xace) of acedbd.
These are GraphPackage and
xfonts. GraphPackage stricly follows the
X11 syntax and we refer you to your relevant manuals.
xfonts lists the font acedb will use. The syntax is as
above. Empty lines do not count. // starts a comment. As implied by
the enum FONTYPE listed in comments in the xfonts
file, Acedb expects a list of 18 font names, six plain fonts of
various size, then six italics, then six bold. The first font is the
default font. At least that one must really be vailable on your
machine. If any other is missing, then acedb will issue a message and
use the default font instead.
It is completely harmless to edit this file, but you must restart the program to see the effect, since it is read just once during initialisation. At the top of that file, you will find instructions on how to find out what fonts are available on your machine.
5 Display control
Acedb produces many windows on screen that, as far as the user is
concerned, behave more or less independantly. We call them displays.
They can communicate and in a few cases are infeodated, but generally,
each display has a life of its own. As of release 1-8, we distinguish
15 different species of displays, enumerated in the file
disptype. The number of allowed displays of each type varies.
You always have one Main, to control the overall behaviour of the
program, if you kill it, you exit acedb. You have at most one display
corresponding to the general tools, like Status or File_Chooser, and
you may have an arbitrary number of data-displays like TREE, GMAP,
Biblio etc.
The file disptype is read at compile time. It therefore follows the C syntax. It is also read at run time to establish a correspondance with the displays.
displays is read at run time. The meaningful lines start with _DDisplayType, where DisplayType must match the above enumeration, with a number of options corresponding to this display following on the same line. Continuation on the next line is however possible, a la Unix, when a backslash is the last character. The possible options are listed at the top of the file. As of release 1-8 we allow
The displays can be registered as the preferred display type of a class in the file wspec/options.wrm. In that case, you must provide a displayFunction and register it in wspec/quovadis.wrm.
6 Help
The help file may be taylored to the data you
manipulate. The syntax is simple. Each section of the help starts with
**Name and ends when the next one begins. You just type whatever help
page you want and register it with a display type using the -Help
option in the displays file.
7 Tags and Classes enumerations
In acedb, the data are organized in objects and classes (see the
users' manual). First of all, the classes are enumerated in the files
sysclass, for the system classes, and {classes} for
the data classes. These file are read at compile time and follow the C
syntax. Each class name is hash-defined to an integer between 1 and
255. Classes are accesed in the source code as _VClass, and in the
interface and ace files simply as Class. It is the responsability of
the programmer to insure that each class is hash defined to a
different number and that these numbers are not modified throughout
the life time of the database.
Exactly the same caveat applies to the files systags and tags which enumerate the vocabulary understandable to the system. Again the systags are used by the database manager and should not be edited by the end user, the tags correspond to the regular data. The tags (in fact their name preceded by an underscore) are #define'd to positive integers; they must appear in order. Gaps in the list of integers are allowed, but they use up space in the system, so don't jump to 100,000! Numbers less than 128 are reserved for systags.wrm. The integers are of course the numbers used to represent them in the database. The first few tags in systags.wrm are used to specify the data types. The pseudo-tags _LastC and _LastN terminate the lists of character types and numerical types respectively. There can be up to 2^24 = 16777216 different tags. This is also the maximum number of objects in any one class (since tags are in fact the objects in class 0, the System class).
If a tag or class name is doubly defined, or if a number is used twice, or if the numbers have been modified between sessions the system will complain at run time. There is however no required consistency accross machines. In fact, this is a desirable property of acedb, because it allows communications between implementations of acedb which do not implement exactly the same schema. Say, acedb can easily exchange sequences of DNA between C.elegans and Arabidopsis, although the complete lists of tags and classes differ.
The reason why the numbers must not be modified is that these number are actually written on disk and in the compiled code en-lieu of the word they stand for. If you change the name after compilation of the code, or after entering, and hence compiling, some data, you invalidate that correspondance. It is theoretically safe to change the name of a tag, without changing its intended meaning. It will create no display problem. But you will have to change it in the source code before recompiling and in the ace file before parsing them. The correct way to rename something in acedb is to use the alias/rename option of the main menu, or the -R -A options in an ace file. As of 1-8, we have not extended the alias possibility to tags, although this is clearly feasable, but we allow a dynamic renaming of the classes as one of the possibilty offered in the options file.
As of Octobre 1992, everybody is still running the same acedb executable. We correlate the use of tags by attributing slots to the various interested people, as we have been doing between ourselves during the development of the program. So to have your own tags, and remain binary compatible, contact us.
8 Class options
The options and sysoptions files
are read only at run time. They give some detail on how classes should
be managed. The format is the same as that of the
displays file. The meaningful lines start with _VClass, where
_VClass must match the enumeration in classes and
sysclass.
First of all a class must be of type A, B or X. These three types are mutually exclusive and should never be modified. X is purely for kernel use. A classes are Array, or tuples, they are manipulated by content, as in a relational database. B classes are Trees, they are manipulated according to their models. The library available to access A and B objects are totaly different, and if class type was changed during the lifetime of the database, or simply after the application code has been written, the system would complain bitterly.
The other options can be changed at will. Hidden/Visible indicates if the class is listed or not in the main acedb display. Note that you can never the less list the members of a hidden class by typing the Query ``>?Class'' (Find all members of Class Class) in the query window.
Display indicates the preferred display type of that class. The chosen display type must be listed in displays and registered in quovadis.
Title is an option used only in tree-displays. If a given object of the class is refered to by another object in a tree-display, then the given object will be indicated by the Text following the chosen Title field if available, rather than by its own name. For instance, we display the title of the papers, rather than their cryptic medline or cgc immatriculation.
Rename let you alias the name of a class to some newName. NewName will show in lists, old and new name will be recognized in ace files and queries, old is used in the code. In case some newName matches some old name, the new name takes precedence.
CaseSensitive, with this option set, the names in this class will be case sensitive. Never undo this option once it has been used, since you would start mixing in an unpredictable way the objects whose name just differ by upper/lower case letters.
Classes default as -B -V -D TREE . A classes further default as -H -D ZERO
A classes may have a user defined parse and dump functions.
As of 1-8, classes default as -B -V -D TREE, A classes further default as -H -D ZERO.
9 Linking
The file quovadis is used to link new applications
to the kernel. This file is read at compile time and follows the
ANSI-C syntax. We hope that the file is sufficiently self documented
to be clear to C-programmers. It contains the correct prototypes, then
the main menu, i.e. the pop up menu of the main acedb window. Note the
menuSpacer which provokes empty lines in the menu and also delimitate
menu paragraph that can be excluded or reactivated via the
menuSuppress and menuInit calls.
Then inside the function displayInit, we find the actual link to the display, parse and dump functions. You need to set up a display function for every display type which is declared as the prefered display type of a given class in options. Apart from that, you must provide a parse function and a dump function for every visible A class, to allow reading and writing members of this class in ace format. In principle, you should also provide for them an asnParse and an asnDump function to allow communication in ASN.1 notation, although we have not done this systematically so far.
10 Models
We finish by the most interesting file of this directory:
models Each B class must have a model for it in this file.
The zero'th entry in the class is an object with the name ?
ClassName, whose associated tree is used to store the model
for all the other members of the class. You can display this tree
like any other in the database by selecting ?
ClassName in the selection window.
Each model in models.wrm must be on a set of consecutive lines. Comments are from a // to the right of the line. The tree structure is indicated by indentation, in a similar fashion to the way that it is displayed in the tree display windows within the database. Items at the same level are indented the same amount. Lines where the first non-whitespace character does not line up with an initial character in a line above will cause the parser to halt. Tab characters can be used to indent. It is assumed that there are tab stops every eight characters.
The models contain three types of entries, corresponding to the three types of items that can be found in the object trees: tags, pointers to other objects, and data such as numbers or constructed types. Tags are represented by themselves, pointers by ? ClassName for the relevant class, and data by a code word determining the data type (actually a tag -- see above). Currently the only code words used are Int, Float and Text. There is a potential source of confusion over text, because there is also a ?Text class. The distinction is that text in the ?Text class is all stored together and contains back pointers to the objects that contain it. Thus it is possible to search all the text in ?Text for e.g. a keyword. However many small text items will never need to be searched, and it is much more efficient to store them with the rest of the tree. These are stored as data with type Text (they can still be searched using the Query system, but in a much slower way). Look at the current models.wrm to see how we used this alternative. In addition comments can be inserted at any point in the tree. Comments are also searched when searching the ?Text class. To enter comments via the .ace file mechanism, precede the comment by -C. The comment must follow some Tag or some data.
Long texts and raw DNA sequence is handled as an array of characters. The corresponding annotations are handled in a B class object with the same name allowing the full flexibility of the tree and model structure.
The two fundamental restrictions on the model are that all the tags used must be unique in that model, and that the path rightwards from the root to each tag must only pass through other tags, not through object or data entries. Note that a sequence of data or pointer entries or even a constructed data type can (and often does) follow a tag. This allows vectors of several different types of information to be stored together, e.g. a contig pointer and the left and right integer endpoints in that contig determine the physical map position for a clone, and are stored together in a vector.
There are several special symbols that can also be used in a model:
pairs Int UNIQUE Intallows you to tabulate a function, the tag pairs will be followed by pairs of numbers, where the first number of each pair is not repeated in his column, and a UNIQUE second nuber is associated to the first number. On the other hand:
gMap ?Chromosome UNIQUE Float UNIQUE Floatasociates a unique pair of numbers to each possible chromosome. This allows you to give the position of a gene both on chromosome X, the official X chromosome, and myX, your own idea of the map of the X.
Location ?Laboratory #Lab_Locationwhere the Lab_location class contains a tag Freezer.
10.2 Syntax
We will try to give here a la Yacc, the exact grammar of the models.
/* The models file is made of a succession of 'model' separated by a space line */ models : | model models ; /* Each individual model starts with ?ClassName, followed by descriptors */ model : ?Class Tree SpaceLine ; Tree : Vector | [UNIQUE] Branch Branch ... Branch ; Branch : Tag | Tag Tree ; Vector : Data Vector | Data | #Class ; Data : UNIQUE DataType | DataType ; DataType : Int | Float | Text | Pointer ; Pointer : ?Class | ?Class XREF Tag ;Int, Float, Text are the basic C types, Tag is an entry in the tags or systags file, Class an entry in classes or sysclass.
There are additional global constraints, every B-class must have a model, a given tag may appear explicitly only once inside a model, but may be hidden again in a constructed #Type, the XREFed tag must appear in the target class.
10.3 Examples
This simple example shows the layout, the way XREF must be used, and
an example of a constructed type.
?Gene Reference_Allele ?Allele
Molecular_information Clone ?Clone XREF Gene
Sequence ?Sequence XREF Gene
Map Physical bMap ?Contig XREF Gene UNIQUE Int UNIQUE Int
Autopos
Genetic gMap ?Chromosome XREF Gene UNIQUE Float UNIQUE Float
Mapping_data 2Point ?2Point
3Point ?3Point
Location ?Laboratory #Lab_Location
?Lab_Location Freezer Text
LiquidN2 Text
ced-4 Reference_Allele n1162
Molecular_information Clone MT#JAL1
Map Genetic gMap III -2.7
Mapping_data 2Point ced-4/unc-32
Location Cambridge Freezer "Second Floor"
11 The ace edit language
Data can be exchanged between acedb implementations without knowing
the precise state of the receiving database and even if the schema of
the two databases are not identical. This very useful mechanism is
implemented by our Ace edition command language.
11.1 Editing B objects
Data for objects in the B classes, for which there are models, can be
entered by placing it in a standard file, whose name ends in ``.ace''.
These files can be read in using the ``Read .ace files'' option of the
main selection window menu.
The format of a .ace file is pretty straightforward. An object is specified by leaving a blank line, then giving its class name, followed by an optional colon and the object name. Following this, without any blank lines, come data items for that object. Each is specified by a tag, followed by an optional colon and then the data (recall that each tag is unique within a model). If several items of data follow the tag then they will be entered in a vector to the right of the tag (assuming the model allows this). To enter data about a constructed type, just give the in between tag, say
Allele e2310 Location mylab Freezer myfreezerEverything is done according to the model, but be aware that new objects are silently created at pointer sites if the name given is not currently known in the class expected by the model. Also, items in different classes can have the same name. So it is possible to give, for example, an allele with a tag that expects a gene according to the model, in which case the program will generate a new object in the gene class with the name of the allele. Probably not what was intended. We plan to develop a previewing system that would tell you what new items will be created on parsing a file, to allow a dry run before actual data entry.
Items are separated by white space (spaces or tabs) or commas. If your item contains white space or a comma (such as a title, and many object names), then it must be surrounded in double quotes. The finishing double quote can be omitted at the end of the line. If you want to include a double quote or a backslah then you can escape it with a backslash.
If you want to continue onto the next line then put a backslash at the end of the line. Note that it must directly precede the end of line character. Following this all the white space at the beginning of the next line is ignored, so in long pieces of text it is standard to finish a line with a space, followed by a backslash, then newline, then tab, then continuation. Unfortunately this system means there is no way to enter a hard carriage return as part of the text. acedb makes its own line breaks when displaying text. See however the LongText class.
Two consecutive forward slashes (//) signify the start of a comment. Any text following them is ignored. You can escape one of them with a backslash if you really want two forward slashes. It is useful to put a comment at the start of a .ace file to say what it contains, and where it comes from. Don't put a pure comment line inside an object description, because it will look like a blank line, and cause a new object to start (we should probably fix this).
11.2 More details
The !, exclamation mark, allows to skip an object, if it is the
first character of a paragraph, or a tag, if it is the first
character of a line
!Gene myGene gMap X 3.8 Gene myOtherGene !gMap V Phenotype "Pink with green stars"will skip myGene completely and skip the gMap entry of the myOtherGene, not that in the second case. The syntax
//gMap Vwould be recognised as an empty line so the parser would treat Phenotype as a Class name, which would be wrong.
The only inaccuracy is that -C can not start a line. It must come on a data line anywhere after the initial tag. e.g.
Paper "abc 1970" // NB quotes needed since the name contains a space Author "Jones PB" -C "He did all the work"is OK, as is
Paper "abc 1970" Page 23 -C "Half way down" 32or to delete a single field, as in
Paper "abc 1970" -D Author "Jones PB"This will delete just the one author, and the author tag only if that author was the only author. If you say
[example missing from manuscript]that will delete all authors, plus the author tag. To be precise. The last thing on the line is deleted, along with everything that follows it in the tree, and then if the current entry was unique at its level the tree is pruned back from this point until a branch point is reached.
Both Rename and Alias (-R and -A) apply only to objects, and they fuse data if both names already had data attached. There is no unalias function (how would you know how to unfuse)? I think that you cannot rename after aliasing (should look into that!).
Paper "abc 1970" -D Authorbut not
Paper "abc 1970" -C "a great paper"-D can be used either to delete a whole object, as in
-D Paper "[abc 1970]"An example of the subobject notation is
Clone F32H7 Gene unc-3 Lab_Location Cambridge Freezer C2 Lab_Location "St. Louis" Freezer 23 Lab_Location "St. Louis" Minus70 AI hope this is helpful.
11.3 Editing A objects
Each visible A class should have its own dumper and parser. The
easiest way to see the format is to ace dump a few objects of the
relevant class from the KeySet menu. The general format is:
class : nameon the first line, followed by stuff and an object breaker.
DNA is entered as:
DNA mysequence atgcgcgctatgatgctaaghtc tgttacagagtAcceptable letters are the usual atgc, but also nwsky etc for ambiguous bases (see the help file on dna). Upper and lower cases are both acceptable. The lines can be arbitrarily long or short. The entry is stopped on the first empty line. If the sequence is acceptable, a sequence B-object of the same name is created automatically and croo referenced to the DNA In general we break on the first empty line except for LongText class that breaks on an isolated line reading
Long texts, such as paper abstracts are entered as:
Paper : myPaper Abstarct myAbstarct LongText myAbstract What ever i want, including empty lines, untill I reach verbatim a line saying: ***LongTextEnd***Note that in this case you must cross reference yourself the abstract to its paper.
KeySets can be read in in the form :
KeySet myKeySet class1:name1 class1:name2 class3:name3 class7:name4