(1) MRC Laboratory of Molecular Biology
Hills Road, Cambridge CB2 2QH, UK
(email: rd@mrc-lmb.cam.ac.uk)
(2) CNRS Physique Mathematique and CRBM
BP 5051, 34033 Montpellier, France
(email: mieg@kaa.cnrs-mop.fr)
However, in another sense, the fear of data overload has some justification. This is because the issue is not just one of simple storage, but of integrating the new systematic data with all our accumulated experimental knowledge from classical and molecular genetics, so as to be able to select what is relevant for each scientific question. Until recently, with the results disseminated by standard scientific publication and discussion, this process of integration has taken place in the minds of the researchers. Even if all details could not be followed, the salient facts involved with some specialised function could be managed. It is this storage medium, the human brain, that is incapable of handling all the genomic data, not the computer.
Clearly what is required is a database system that, in addition to storing the results of large scale sequencing and mapping projects, allows all sorts of experimental genetic data to be maintained and linked to the maps and sequences in as flexible a way as possible. Since this is a new type of system, it seems very desirable to have a database whose structure can evolve as experience is gained. However this is in general very difficult with existing database systems, both relational, such as Sybase and Oracle, and object-oriented, such as ObjectStore.
We were faced with these issues three years ago, when starting a pilot project for obtaining the complete genomic sequence of the nematode C. elegans. There were dual needs: first for a system in which to maintain data for internal purposes, and second for one in which to make it public. We wanted to build on previous experience gained while building the physical clone map for C. elegans, which had been done using the program CONTIG9 (Sulston et al., 1988). An adapted form of this program, called PMAP, was made publicly available to the C. elegans research community, together with regularly updated copies of the in house data. This rapid and complete access to the map, even in incomplete form, proved to be extremely popular and successful, soon becoming a crucial resource when cloning worm genes. It therefore seemed sensible to extend the same approach, and develop a single database to hold sequence, physical and genetic map, and references, that we could use in house, and that we could distribute in read-only form freely within the worm community.
This led directly to the database program ACEDB, which is described in this chapter. Rather than being limited to the specific data that we could envisage when we started, we decided to write a general database management system that would allow easy and frequent extension and adaptation of the database schema as the project developed. For this reason, it has been comparatively easy to adapt ACEDB to be used by other genome projects working with other organsims. At the time of writing (March, 1993) there are public databases for the model plant Arabidopsis thaliana, and the mycobacteria M. leprae and M. tuberculosis, which are the pathogens for leprosy and tuberculosis. Several other databases for public distribution are under development. ACEDB is also being used internally at several sites, for example for storage of physical mapping results from human and Drosophila projects. Finally, it is being used as one of the core pieces of software in the IGD project (Chapter ??? this book), which plans to bring together all public human genome data in an integrated genome database. ACEDB is both being used as the primary graphics front end of IGD, and as one of the alternative back-end data storage systems.
The internal structures of the system, which are more general, and which contain some of the more novel features, will be described in later sections. There are interactive tools for many of these more general features available to the user as part of the graphical interface, but discussion of these will be delayed until later. In this section we will just briefly describe the windows used to display the different classes of object in the database.
Figure 1 Main window, Genetic map, and text window of one gene
If the user double clicks on any item a new window pops up with text information about the object. This text information is layed out hierarchically, in what is called a tree structure. The section below on ``Organisation of data'' describes further this tree structure, which in fact is the primary way of storing information in ACEDB. The maps are merely derived from the data stored in tree form with each object.
Figure 2 Physical map window and hybridisation grid window.
Figure 3 one or perhaps 2 sequence windows.
As well as displaying annotations and precalculated information, the sequence window supports several types of calculation. In particular there are a range of facilities derived from the Genefinder program (Green and Hillier, personal communication) for predicting gene structures in genomic DNA sequence based on likelihood predictions of splice sites and codon usage. These are in fact used to annotate the nematode genomic DNA sequence before submission to EMBL. There are also restriction site detection tools, and tools for extracting subsequences and translations of predicted genes.
Each object is represented by a unique identifier, its name, which is followed by an ensemble of attributes organised into a tree. The nodes at branchpoints of the tree are all named. The branches typically terminate in pointers to other objects, or data, which are numerical values, character strings. A bare branch ending just in the named branchpoint can be used to indicate presence of a binary property. There is also the possibility of constructed subobjects, similar to expanding a leaf node in place recursively into a full object with its own branches, rather than maintaining merely a pointer to an external object. Arbitrary text comments can be attached freely at any point in the tree.
The objects are allocated to classes. Each class has a model, specifying the maximal extension of the branches, and the types of data or classes of pointer permitted at each position. Individual objects, which are instances of the class, in general only have a part of the branching pattern permitted by the model. This approach gives a triple advantage:
?Gene Reference_allele ?Allele Molecular_information Clone ?Clone XREF Gene Sequence ?Sequence XREF Gene Map Physical pMap UNIQUE ?Contig XREF Gene UNIQUE Int Autopos Genetic gMap ?Chromosome XREF Gene UNIQUE Float UNIQUE Float Mapping_data 2point ?2point_data 3point ?3point_data Location ?Laboratory #Lab_Location ?Lab_Location Freezer Text Liquid_N2 Text ced-4 Reference_allele n1162 Molecular_information Clone MT#JAL1 Map Genetic gMap III -2.7 Mapping_data 2point "ced-4 unc-32/+ +" Location Cambridge Freezer A6Note that each object belongs to just one class. We have deliberately chosen to avoid multiple inheritance. This concept is at the same time notoriously difficult to implement efficiently and very difficult to use, because the inheritance graph easily becomes encumbered with potential conflicts amongst super-classes.
Our alternative to multiple inheritance is to restrict the number of classes, but allow a wider variety of objects within a class. In this system, it is possible that two objects in the same class have few or no branches in common. For example consider two genes, the first studied by classical genetics and uncloned, the second cloned by similarity to a protein in another organism. These objects could be considered as archetypes of two subclasses of the class gene. But such simple cases are relatively rare, a third gene could have data for some fields of one type, and some of the other, and one is rapidly led to a combinatorial explosion in the number of classes. Our approach lets us capture without difficulty all the intermediate cases, and we only need around fifty classes to hold nearly a hundred thousand heterogeneous objects.
As well as the classes of tree objects described above, also denoted type B classes, we have type A classes, which contain general arrays of data, and which allow more rigid but more efficient storage of data such as DNA sequences.
The schema itself is stored in objects within the database, allowing a simplified startup procedure and dynamic editing of some features.
Each paragraph in these files corresponds to one object, and must be separated from the next by one or more blank lines. The first line indicates the class and name of the object to be created or edited. Following lines start with the name of a branch node, followed by numerical or text data, or names of other objects to be pointed to. They are interpreted according to the model. Keywords such as -D or -R specify actions to be taken, with the default action being to add the data into the database. As in C++, // indicates a comment in the file.
Example:
// First let us define a sequence: Sequence ACT3 Title ``C. elegans actin gene (3)'' Library EMBL CEACT3 X16798 // next the corresponding DNA (A class with special reader) DNA ACT3 aagagagacatcctcccgctcccttcccacacccacttgctcttttctat tgaccacacattatgaagataaccatgttactaatcaaattcgtgttctt ttccaatttctttttc // here we change the name zk643 (if it exists) -R Sequence zk643 ZK643 // R for ``rename'' // here we change one of the authors of the paper [wbg101] Paper Nurture:7:234-242 -D Author ``Kimble JE'' // deletion of Kimble Author ``Ahringer JA'' // addition of AhringerIt is specifically because the objects have a public unique ascii identifier, the doublet [class:name], that these edit commands are well-defined and can refer to precise objects. If the object is not known yet, it is created, else it is modified. If a delete or rename operation finds nothing to delete or rename it moves on silently. If an instruction makes no sense according to the model, for example by referring to an unknown branch point, the user is warned and the paragraph is skipped. Together these properties also allow repeated reading of the same file without changing the database contents, and transfer of information between databases that may not match exactly, something which is very hard with traditional database systems. Indeed they even allow transfer of commonly meaningful data between systems whose schemas differ.
Advantages of this simple file structure - awk etc. cf ASN.1!
As well as reading in data, we can also export a set of objects in ace file form. An external program, acediff, takes as input two such ace files, and generates a third that would have the effect of transforming a database containing data as in the first file into one containing data as in the second file. This program can be used to generate update files for remote copies of a central database, and in fact this is the procedure we use to distribute the nematode genomic database. There are also facilities within ACEDB for certain types of specialised data output, such as DNA in FASTA format.
Since the class of an object is known at all levels, it is easy to selectively optimise the storage of certain classes, as we do for example with DNA.
Within any particular session, all modified objects are rewritten to new disk locations, which allows us to store multiple versions of an object, and recover from crashes by going back to the last verified save state of the database.
Only one user at a time is permitted write access. The set of changes made until write access is given up constitutes a session. When a session is saved, first the changed objects are flushed to disk, then the indices and hash tables for any altered classes are written (as type A class objects, and hence also to new disk locations). Finally a pointer in the superblock is changed to point to the new index information. Once this is done the system will start up with the new indices. Any crash before this point will leave the system so it starts up by retrieving the old indices, and hence the old objects from before the aborted session.
void paperDate (KEY paper)
{ int year; OBJ obj;
if ((obj = bsCreate(paper)) && bsGetData(obj,_Year,&year))
printf (``Paper %s was published in year %d\n'', name(key), year) ;
bsDestroy(obj) ;
}
The resulting keysets can be used in in various ways: single items can be looked at interactively, the whole set can be a starting point for further queries, it can be dumped out as ascii ace file (see above), or it can be saved in the database with a user-specified name. Boolean set operations can also be used to combine sets. An important feature is that sets can contain objects from many classes. One example of how this is used is via another query operation, ``text search''. This performs a search on all text stored in the database, and returns a list of all objects that either have names matching the search string, or contain text that matches it. For example a search for ``muscle'' might return genes with muscle phenotypes, papers with ``muscle'' in the title, sequences of muscle proteins, etc.
The query package is available both to the user, via an interactive interface that allows saving, recovery and reuse of queries, and to the programmer via a library of subroutines. In fact the main control window is implemented by setting up a limited set of straightforward queries.
There is another facility for general data presentation based on the query package, called the ``Table Maker''. This allows the user to construct tables of displayed information in a similar way to using a spreadsheet. The difference is that, in the table maker, new columns are derived from previous columns by queries, not by calculations. Once again, the instructions for defining a table can be built up interactively, and stored in a file once they are correct.
The way this is done is for the server to contain a full copy of the database, and for the client to start with an empty database. When a query is generated by the client, the server resolves it and sends the result back to the client as an ace file, which is parsed into the client database in the normal fashion. In fact the server is acting in exactly the same fashion as an independent textace program, except that it is connected to the client by a pair of sockets, rather than by the standard I/O. This type of structure is only possible because ACEDB allows meaningful data transfer between non-identical databases, via ace files. The client database can either be allowed to accumulate during the session, acting as a local cache, or can be restricted so that all calls for data are resolved by passing them back to the server. Of course the former can become much more efficient, while susceptible to data becoming stale if it is edited by another process. It is clear that when editing data, the objects must be retrieved from the server and locked there, rather than updated based on a local copy.
Although we make the source code available (under a licence restricting commercial exploitation), and we encourage development of new specific application code, we hope that the community of groups using ACEDB can keep to a single database kernel. This can be achieved by establishing good contact between groups that are doing development work, and folding kernel changes back into the official release version described above. With this policy, we believe that ACEDB can continue to support a growing community of genome database providers, covering many different genome projects.