acedb --- A C.elegans Database
III. Configuration guide

Richard Durbin
MRC Laboratory for Molecular Biology
Hills Road, Cambridge, CB2 2QH, UK
Email: rd@mrc-lmba.cam.ac.uk

Jean Thierry-Mieg
CNRS--CRBM
Route de Mende, BP 5051, 34033 Montpellier, France
Email: mieg@crbm1.cnusc.fr

December 18, 1992

Contents

1 Introduction

Acedb is the database program that we are writing for the nematode genome project. The basic introduction and instructions for use as part of the nematode project are included in the users' manual, to be found in the file wdoc/users_guide.tex. How to install the system is explained in wdoc/installation_guide.tex. The present document is about entering new data, editing data, reconfiguring the database. It is assumed that you have already read the users' guide and are already familiar with the system. You do not need to be a Unix expert, but you must know how to move between directories and master a good text editor. We strongly recommand the powerful, portable and copy-left gnu-emacs. The draft of a programmers' guide is now available in doc/programmers.guide to help you write new applications.

WARNING: You do not need to read anything in this manual if you are merely keeping up a version of the standard worm database. In fact, if you start applying anything in this manual then you may possibly get the database into a state where it will not accept the official updates, in which case you will have to start again from ace files.

Remember that the acedb kernel uses its own capabilities to manage the system, it is therefore extremelly important that you do not edit without prior consultation with us the system configuration files, i.e. sysclass, sysoptions, systags and the beginning of the model file.

2 Overview

In acedb the organisation of the basic classes and what they contain and the overall deroulement of the program is not built into the core database software, but is determined from a set of around 15 configuration files gathered in the wspec directory. Fortunately, it is not too difficult to manipulate them because they serve very different purposes and fall into several independant categories: security, machine configuration, link between the kernel and the applications and data semantics. All of them are read at run time, a few, also during compilation. You may probably edit any one of those files rather easily since they are self documented. However, the purpose of this chapter is to explain the overal logic of the system.

The environment variable ACEDB should give the name of the parent directory of the wspec and database directories. If you use acedb for several sets of data, you may have a single executable. However, for each set you must have have a pair of directories: database (to hold the data) and wspec (to explain the meaning of the data), and at least the value of the semaphore (file semkey ) must differ.

3 Security

For its security, acedb relies on Unix. The way to set up the read write authorizations of the various directories is explained in the installation manual. To summarize, the program runs setuid (4755). Only the administrator, i.e. the Unix-owner of the database directory can reorganize or reinitialise the database, only a few users, listed in passwd can gain write access, everybody else can read the data.

To implement these ideas, wspec contains just 2 very simple files dealing with security: passwd and semkey. Empty lines do not count. Anything following a pair of forward slashes // is a comment. The only remaining lines of passwd list the (unix) login name of those user who may gain write access.

In semkey, the only meaningful line reads SEM = n where n is some positive integer. Type the unix command ipcs to see if you have semaphores on your machine, and which ones are available. Choose 0 (zero) to disable the semaphore system. You should do this is you do not have semaphores mounted on your machine. Otherwise choose some unused small number. If you use acedb for several purposes, say Arabidopsis and C.elegans data, select a different semaphore value for each database.

4 Machine configuration

A few files are specific to the X11 implementation (xace) of acedbd. These are GraphPackage and xfonts. GraphPackage stricly follows the X11 syntax and we refer you to your relevant manuals. xfonts lists the font acedb will use. The syntax is as above. Empty lines do not count. // starts a comment. As implied by the enum FONTYPE listed in comments in the xfonts file, Acedb expects a list of 18 font names, six plain fonts of various size, then six italics, then six bold. The first font is the default font. At least that one must really be vailable on your machine. If any other is missing, then acedb will issue a message and use the default font instead.

It is completely harmless to edit this file, but you must restart the program to see the effect, since it is read just once during initialisation. At the top of that file, you will find instructions on how to find out what fonts are available on your machine.

5 Display control

Acedb produces many windows on screen that, as far as the user is concerned, behave more or less independantly. We call them displays. They can communicate and in a few cases are infeodated, but generally, each display has a life of its own. As of release 1-8, we distinguish 15 different species of displays, enumerated in the file disptype. The number of allowed displays of each type varies. You always have one Main, to control the overall behaviour of the program, if you kill it, you exit acedb. You have at most one display corresponding to the general tools, like Status or File_Chooser, and you may have an arbitrary number of data-displays like TREE, GMAP, Biblio etc.

The file disptype is read at compile time. It therefore follows the C syntax. It is also read at run time to establish a correspondance with the displays.

displays is read at run time. The meaningful lines start with _DDisplayType, where DisplayType must match the above enumeration, with a number of options corresponding to this display following on the same line. Continuation on the next line is however possible, a la Unix, when a backslash is the last character. The possible options are listed at the top of the file. As of release 1-8 we allow

GraphType default as TEXT_FIT. x and y as 0, w h as .3 .5

The displays can be registered as the preferred display type of a class in the file wspec/options.wrm. In that case, you must provide a displayFunction and register it in wspec/quovadis.wrm.

6 Help

The help file may be taylored to the data you manipulate. The syntax is simple. Each section of the help starts with **Name and ends when the next one begins. You just type whatever help page you want and register it with a display type using the -Help option in the displays file.

7 Tags and Classes enumerations

In acedb, the data are organized in objects and classes (see the users' manual). First of all, the classes are enumerated in the files sysclass, for the system classes, and {classes} for the data classes. These file are read at compile time and follow the C syntax. Each class name is hash-defined to an integer between 1 and 255. Classes are accesed in the source code as _VClass, and in the interface and ace files simply as Class. It is the responsability of the programmer to insure that each class is hash defined to a different number and that these numbers are not modified throughout the life time of the database.

Exactly the same caveat applies to the files systags and tags which enumerate the vocabulary understandable to the system. Again the systags are used by the database manager and should not be edited by the end user, the tags correspond to the regular data. The tags (in fact their name preceded by an underscore) are #define'd to positive integers; they must appear in order. Gaps in the list of integers are allowed, but they use up space in the system, so don't jump to 100,000! Numbers less than 128 are reserved for systags.wrm. The integers are of course the numbers used to represent them in the database. The first few tags in systags.wrm are used to specify the data types. The pseudo-tags _LastC and _LastN terminate the lists of character types and numerical types respectively. There can be up to 2^24 = 16777216 different tags. This is also the maximum number of objects in any one class (since tags are in fact the objects in class 0, the System class).

If a tag or class name is doubly defined, or if a number is used twice, or if the numbers have been modified between sessions the system will complain at run time. There is however no required consistency accross machines. In fact, this is a desirable property of acedb, because it allows communications between implementations of acedb which do not implement exactly the same schema. Say, acedb can easily exchange sequences of DNA between C.elegans and Arabidopsis, although the complete lists of tags and classes differ.

The reason why the numbers must not be modified is that these number are actually written on disk and in the compiled code en-lieu of the word they stand for. If you change the name after compilation of the code, or after entering, and hence compiling, some data, you invalidate that correspondance. It is theoretically safe to change the name of a tag, without changing its intended meaning. It will create no display problem. But you will have to change it in the source code before recompiling and in the ace file before parsing them. The correct way to rename something in acedb is to use the alias/rename option of the main menu, or the -R -A options in an ace file. As of 1-8, we have not extended the alias possibility to tags, although this is clearly feasable, but we allow a dynamic renaming of the classes as one of the possibilty offered in the options file.

As of Octobre 1992, everybody is still running the same acedb executable. We correlate the use of tags by attributing slots to the various interested people, as we have been doing between ourselves during the development of the program. So to have your own tags, and remain binary compatible, contact us.

8 Class options

The options and sysoptions files are read only at run time. They give some detail on how classes should be managed. The format is the same as that of the displays file. The meaningful lines start with _VClass, where _VClass must match the enumeration in classes and sysclass.

First of all a class must be of type A, B or X. These three types are mutually exclusive and should never be modified. X is purely for kernel use. A classes are Array, or tuples, they are manipulated by content, as in a relational database. B classes are Trees, they are manipulated according to their models. The library available to access A and B objects are totaly different, and if class type was changed during the lifetime of the database, or simply after the application code has been written, the system would complain bitterly.

The other options can be changed at will. Hidden/Visible indicates if the class is listed or not in the main acedb display. Note that you can never the less list the members of a hidden class by typing the Query ``>?Class'' (Find all members of Class Class) in the query window.

Display indicates the preferred display type of that class. The chosen display type must be listed in displays and registered in quovadis.

Title is an option used only in tree-displays. If a given object of the class is refered to by another object in a tree-display, then the given object will be indicated by the Text following the chosen Title field if available, rather than by its own name. For instance, we display the title of the papers, rather than their cryptic medline or cgc immatriculation.

Rename let you alias the name of a class to some newName. NewName will show in lists, old and new name will be recognized in ace files and queries, old is used in the code. In case some newName matches some old name, the new name takes precedence.

CaseSensitive, with this option set, the names in this class will be case sensitive. Never undo this option once it has been used, since you would start mixing in an unpredictable way the objects whose name just differ by upper/lower case letters.

Classes default as -B -V -D TREE . A classes further default as -H -D ZERO

A classes may have a user defined parse and dump functions.

As of 1-8, classes default as -B -V -D TREE, A classes further default as -H -D ZERO.

9 Linking

The file quovadis is used to link new applications to the kernel. This file is read at compile time and follows the ANSI-C syntax. We hope that the file is sufficiently self documented to be clear to C-programmers. It contains the correct prototypes, then the main menu, i.e. the pop up menu of the main acedb window. Note the menuSpacer which provokes empty lines in the menu and also delimitate menu paragraph that can be excluded or reactivated via the menuSuppress and menuInit calls.

Then inside the function displayInit, we find the actual link to the display, parse and dump functions. You need to set up a display function for every display type which is declared as the prefered display type of a given class in options. Apart from that, you must provide a parse function and a dump function for every visible A class, to allow reading and writing members of this class in ace format. In principle, you should also provide for them an asnParse and an asnDump function to allow communication in ASN.1 notation, although we have not done this systematically so far.

10 Models

We finish by the most interesting file of this directory: models Each B class must have a model for it in this file. The zero'th entry in the class is an object with the name ? ClassName, whose associated tree is used to store the model for all the other members of the class. You can display this tree like any other in the database by selecting ? ClassName in the selection window.

Each model in models.wrm must be on a set of consecutive lines. Comments are from a // to the right of the line. The tree structure is indicated by indentation, in a similar fashion to the way that it is displayed in the tree display windows within the database. Items at the same level are indented the same amount. Lines where the first non-whitespace character does not line up with an initial character in a line above will cause the parser to halt. Tab characters can be used to indent. It is assumed that there are tab stops every eight characters.

The models contain three types of entries, corresponding to the three types of items that can be found in the object trees: tags, pointers to other objects, and data such as numbers or constructed types. Tags are represented by themselves, pointers by ? ClassName for the relevant class, and data by a code word determining the data type (actually a tag -- see above). Currently the only code words used are Int, Float and Text. There is a potential source of confusion over text, because there is also a ?Text class. The distinction is that text in the ?Text class is all stored together and contains back pointers to the objects that contain it. Thus it is possible to search all the text in ?Text for e.g. a keyword. However many small text items will never need to be searched, and it is much more efficient to store them with the rest of the tree. These are stored as data with type Text (they can still be searched using the Query system, but in a much slower way). Look at the current models.wrm to see how we used this alternative. In addition comments can be inserted at any point in the tree. Comments are also searched when searching the ?Text class. To enter comments via the .ace file mechanism, precede the comment by -C. The comment must follow some Tag or some data.

Long texts and raw DNA sequence is handled as an array of characters. The corresponding annotations are handled in a B class object with the same name allowing the full flexibility of the tree and model structure.

The two fundamental restrictions on the model are that all the tags used must be unique in that model, and that the path rightwards from the root to each tag must only pass through other tags, not through object or data entries. Note that a sequence of data or pointer entries or even a constructed data type can (and often does) follow a tag. This allows vectors of several different types of information to be stored together, e.g. a contig pointer and the left and right integer endpoints in that contig determine the physical map position for a clone, and are stored together in a vector.

There are several special symbols that can also be used in a model:

UNIQUE
In general, where there is a ?class name in the model, any number of objects from that class can be entered (similarly for the data types Int etc.). The entries form a vertical column ordered by the time of addition. However, UNIQUE specifies that only one entry can be present. Whenever a new piece of data is added in that location it overwrites the old one. This is particularly useful where the data will be used to determine some property of the object, such as a map position. This must be unique in order to know where to draw it! As from 1-8, the scope of UNIQUE is one step right. Thus, the model

pairs Int UNIQUE Int
allows you to tabulate a function, the tag pairs will be followed by pairs of numbers, where the first number of each pair is not repeated in his column, and a UNIQUE second nuber is associated to the first number. On the other hand:

gMap ?Chromosome UNIQUE Float UNIQUE Float
asociates a unique pair of numbers to each possible chromosome. This allows you to give the position of a gene both on chromosome X, the official X chromosome, and myX, your own idea of the map of the X.
XREF
This can follow after any ?class name entry. It itself must be followed by a tag which is present in the model for the target class. The consequence of this is that, whenever a new pointer is entered there, a cross reference back to the original parent will be made in the object pointed to, at the specified tag position in that object. Look at models.wrm to see the examples.
ANY
This specifies that a pointer to an object of any class can follow. ANY should only be used in X classes, i.e. the system classses searchedd by the Text Search (Grep) command.
REPEAT
This allows an indefinite number of items of the same type to be listed on one line. The order of the items is garanteed but there is an implied depedency. If you update the some item along the list, all the ones to its rigth are lost.
FREE
This is like a combination of ANY and REPEAT, but yet more general. It allows an application to add an arbitrary subtree to the right of the current location. Needless to say, no type checking or cross referencing is done. Indeed data cannot be entered into such a location by the general ace file parser, nor exported in ASN-1. It is basically provided for private data structures maintained by sections of user code that want to defeat the database's type checking facilities. Caveat user.

10.1 Constructed types

In the model, to the right of a tag or even of a pointer, you can put the Type Constructor #Class_name This type must be alone in its column. This is functionally equivalent to add in that place the whole model of the corresponding class. There is however a difference is that the path to a terminal leaf may now pass through tags, then data, then #Class and more tags. This is very useful in a lot of cases. Say that you want to give the freezer number where you keep an allele. This number is attached in fact to the pair allele-lab. We store this information by having in the model for class allele a Tag

Location ?Laboratory #Lab_Location
where the Lab_location class contains a tag Freezer.

10.2 Syntax

We will try to give here a la Yacc, the exact grammar of the models.

 /* The models file is made of a succession of 'model' 
	separated by a space line */

   models : 
| model models 
; 

/* Each individual model starts with ?ClassName,
 followed by descriptors */

   model : ?Class Tree SpaceLine   
; 
 
   Tree : Vector 
| [UNIQUE] Branch 
Branch  
...     
Branch  
; 

   Branch : Tag 
| Tag Tree 
;

   Vector : Data Vector 
| Data  
| #Class  
;  

   Data   : UNIQUE DataType  
| DataType  
;  

   DataType : Int  
| Float  
| Text  
| Pointer  
;  

   Pointer : ?Class  
| ?Class XREF Tag  
;  
Int, Float, Text are the basic C types, Tag is an entry in the tags or systags file, Class an entry in classes or sysclass.

There are additional global constraints, every B-class must have a model, a given tag may appear explicitly only once inside a model, but may be hidden again in a constructed #Type, the XREFed tag must appear in the target class.

10.3 Examples

This simple example shows the layout, the way XREF must be used, and an example of a constructed type.

?Gene  Reference_Allele  ?Allele 
       Molecular_information  Clone  ?Clone XREF Gene
                              Sequence  ?Sequence XREF Gene
       Map     Physical  bMap ?Contig XREF Gene UNIQUE Int UNIQUE Int
                         Autopos
               Genetic   gMap ?Chromosome XREF Gene UNIQUE Float UNIQUE Float
                         Mapping_data  2Point  ?2Point 
                                       3Point  ?3Point
       Location ?Laboratory #Lab_Location


?Lab_Location  Freezer   Text
               LiquidN2  Text
                



ced-4  Reference_Allele  n1162
       Molecular_information  Clone  MT#JAL1 
       Map     Genetic  gMap  III -2.7
                        Mapping_data  2Point  ced-4/unc-32
       Location  Cambridge  Freezer  "Second Floor" 

10.4 Reconfigurations

The models file is read in when you create the database and when you choose the options Read-Models or Add-Update-File from the pop-up menu of the main acedb window. So it is possible to change it without changing the database. In general this will be harmless, in that the system will still work, and the tree displays will still work (assuming you didn't change the names of any of the existing tags). However the parts of the program that search for particular items in an object, such as the genetic map position, will only work if the leftward path back up to the root has remained the same. It is always safe to add to the model, and we recommend that if you do anything other than this you remake the database from .ace files (see below). To do this requires a way to dump the state of the database into one or more .ace files. This is done using the dump option from the main menu, to dump the whole database, or from the dump option of a selection window to dump a list of objects. When you get the .ace file, you could reenter it into a rearranged system which used the same tags, or edit it with some global edit script if you want to do something more complicated.

11 The ace edit language

Data can be exchanged between acedb implementations without knowing the precise state of the receiving database and even if the schema of the two databases are not identical. This very useful mechanism is implemented by our Ace edition command language.

11.1 Editing B objects

Data for objects in the B classes, for which there are models, can be entered by placing it in a standard file, whose name ends in ``.ace''. These files can be read in using the ``Read .ace files'' option of the main selection window menu.

The format of a .ace file is pretty straightforward. An object is specified by leaving a blank line, then giving its class name, followed by an optional colon and the object name. Following this, without any blank lines, come data items for that object. Each is specified by a tag, followed by an optional colon and then the data (recall that each tag is unique within a model). If several items of data follow the tag then they will be entered in a vector to the right of the tag (assuming the model allows this). To enter data about a constructed type, just give the in between tag, say

Allele e2310
Location mylab Freezer myfreezer
Everything is done according to the model, but be aware that new objects are silently created at pointer sites if the name given is not currently known in the class expected by the model. Also, items in different classes can have the same name. So it is possible to give, for example, an allele with a tag that expects a gene according to the model, in which case the program will generate a new object in the gene class with the name of the allele. Probably not what was intended. We plan to develop a previewing system that would tell you what new items will be created on parsing a file, to allow a dry run before actual data entry.

Items are separated by white space (spaces or tabs) or commas. If your item contains white space or a comma (such as a title, and many object names), then it must be surrounded in double quotes. The finishing double quote can be omitted at the end of the line. If you want to include a double quote or a backslah then you can escape it with a backslash.

If you want to continue onto the next line then put a backslash at the end of the line. Note that it must directly precede the end of line character. Following this all the white space at the beginning of the next line is ignored, so in long pieces of text it is standard to finish a line with a space, followed by a backslash, then newline, then tab, then continuation. Unfortunately this system means there is no way to enter a hard carriage return as part of the text. acedb makes its own line breaks when displaying text. See however the LongText class.

Two consecutive forward slashes (//) signify the start of a comment. Any text following them is ignored. You can escape one of them with a backslash if you really want two forward slashes. It is useful to put a comment at the start of a .ace file to say what it contains, and where it comes from. Don't put a pure comment line inside an object description, because it will look like a blank line, and cause a new object to start (we should probably fix this).

11.2 More details

The !, exclamation mark, allows to skip an object, if it is the first character of a paragraph, or a tag, if it is the first character of a line

!Gene myGene
gMap X 3.8

Gene myOtherGene
!gMap V
Phenotype "Pink with green stars"
will skip myGene completely and skip the gMap entry of the myOtherGene, not that in the second case. The syntax

//gMap V
would be recognised as an empty line so the parser would treat Phenotype as a Class name, which would be wrong.

The only inaccuracy is that -C can not start a line. It must come on a data line anywhere after the initial tag. e.g.

Paper "abc 1970" // NB quotes needed since the name contains a space
Author "Jones PB"  -C "He did all the work"
is OK, as is

Paper "abc 1970"
Page 23 -C "Half way down" 32
or to delete a single field, as in

Paper "abc 1970"
-D Author "Jones PB"
This will delete just the one author, and the author tag only if that author was the only author. If you say

[example missing from manuscript]
that will delete all authors, plus the author tag. To be precise. The last thing on the line is deleted, along with everything that follows it in the tree, and then if the current entry was unique at its level the tree is pruned back from this point until a branch point is reached.

Both Rename and Alias (-R and -A) apply only to objects, and they fuse data if both names already had data attached. There is no unalias function (how would you know how to unfuse)? I think that you cannot rename after aliasing (should look into that!).

Paper "abc 1970"
-D Author
but not

Paper "abc 1970"
-C "a great paper"
-D can be used either to delete a whole object, as in

-D Paper "[abc 1970]"
An example of the subobject notation is

Clone F32H7
Gene unc-3
Lab_Location Cambridge Freezer C2
Lab_Location "St. Louis" Freezer 23
Lab_Location "St. Louis" Minus70 A
I hope this is helpful.

11.3 Editing A objects

Each visible A class should have its own dumper and parser. The easiest way to see the format is to ace dump a few objects of the relevant class from the KeySet menu. The general format is:

class : name
on the first line, followed by stuff and an object breaker.

DNA is entered as:

DNA mysequence
atgcgcgctatgatgctaaghtc
tgttacagagt
Acceptable letters are the usual atgc, but also nwsky etc for ambiguous bases (see the help file on dna). Upper and lower cases are both acceptable. The lines can be arbitrarily long or short. The entry is stopped on the first empty line. If the sequence is acceptable, a sequence B-object of the same name is created automatically and croo referenced to the DNA In general we break on the first empty line except for LongText class that breaks on an isolated line reading

Long texts, such as paper abstracts are entered as:

Paper : myPaper 
Abstarct myAbstarct 

LongText myAbstract

What ever i want, including empty lines, untill I reach verbatim a
line saying: 

***LongTextEnd***
Note that in this case you must cross reference yourself the abstract to its paper.

KeySets can be read in in the form :

KeySet myKeySet
class1:name1
class1:name2
class3:name3
class7:name4