Note: **Text-only version
Figures 1 through 5 are not now included
The development of a specialized database involves considerable resources and expertise. An important question that might be asked by the director of a new project is whether an existing database can be transferred to a different site and quickly reconfigured for new (but similar) purposes, thus saving the time and expense of development. Unfortunately ease of reuse--in the sense that a biologist without the assistance of specialized computer personnel can attempt the task--has not been a general property of most complex databases. This limitation brings with it two broad consequences. First, many databases are constructed de novo, without significant reference to existing databases; essentially, an expensive wheel is reinvented over and over again. Second and more importantly, biologists yield the direct responsibility for database configuration to computer experts, even though biologists are both the producers and consumers of the data in the database.
Recently a database tool designed specifically for use by biologists has been developed as part of the Caenorhabditis elegans genome effort. This database software is known as ACEDB. ACEDB is the creation of Richard Durbin (MRC-LMB, Cambridge, United Kingdom) and Jean Thierry-Mieg (CNRS, Montpellier, France). The name ACEDB, A Caenorhabditis elegans Data Base, is also the designation of the nematode genome database (which uses the ACEDB software) maintained by Durbin and Thierry-Mieg. ACEDB has many outstanding features which we will review below, but two of them need to be stated briefly at the outset. First, ACEDB is a generalized genome database. It can be used to create new databases without the need for any reprogramming or in fact any sophisticated computer skills. Second, ACEDB permits biologists to describe and organize their data in a manner that closely resembles how they typically think about their information. The result is that ACEDB offers both ease of access for reconfiguration, and ease of reuse for transfer from one project to another. We will return to these points later after introducing the methods used to interact with the ACEDB software.
ACEDB employs several features that allow the user easy access to the variety of types of information in a genome database. Information is presented using both text and graphic display windows and is navigated through the use of menus and a mouse. Each window used by the software to display information can be used to retrieve related information. ACEDB allows all parts of the database to be cross-referenced with each other thus providing a dense navigable network in which to locate information. Therefore there is no one path to information in a database constructed with ACEDB.
Since the cross-references can be explored via a mouse click, database browsing becomes a simple matter. Figure 1 shows two windows containing text displays presented by the ACEDB software. The text items in boldface type indicate a cross-reference to other information contained in the database. The cross-referenced information can be retrieved by clicking with the mouse on the boldface text. In figure 1 a window containing information about a bibliographic citation was followed to that for the appropriate sequence annotation. In this example the user is also one click away from the DNA sequence, peptide homology, and genetic map information.
One of the special features of the ACEDB software is the ability to create a pictorial representation in one of several specialized graphic displays using information contained within the database. Currently available graphic displays include a genetic map, physical map, sequence display, gridded clone display and simulated agarose gel display.
The genetic map display provides a graphical representation of a variety of genetically defined sites such as genetically defined map locations of mutations or molecular genetic markers. The genetic map can be associated with regions covered by mapped deficiencies and duplications, and the contact points between the physical and genetic maps (figure 2). A locus can be included on more than one genetic map, each genetic map representing a collection of information. The primary results (an estimate of recombination distance) of two point or three point recombination experiments can also be presented in a manner that allows each experimental result to be visually compared with the location of the involved markers in any defined genetic map.
The physical map display provides a view of the continuous overlapping set of cosmid and YAC clones referred to as a contig. The associations of these clones with DNA sequences and genetic markers are also displayed (figure 3). The contig display provides, as do all the graphic displays, a scrolling device which allows easy movement along large contigs. A small summary view is presented in the lower part of the display. The pictorial objects presented in this graphic display can be selected to navigate to other displays. For example each small box located in the center of the physical map display represents a DNA sequence associated with one of the members of the contig. Double clicking on a small box will cause the appropriate sequence display window to be presented. A text display with information about each of the cosmids or YACs is available from selecting a clone, represented by a line segment. Also the genetic map display could be reached by selecting the central strip that divides the window in half or one of the loci names just below the central strip.
The sequence display provides a view of the standard sequence features used by the DNA sequence databases such as coding regions, regions of similarity to DNA or peptide sequences, repeat units, promoter and binding sites (figure 4). Selecting one of the sequence features causes the corresponding region of the sequence to be highlighted with color. Several analysis features are available from the sequence display, including the ability to produce a restriction map, show predicted fingerprint bands and find a specific site in a DNA or protein sequence. ACEDB also produces a codon usage table from all or a subset of sequences in the database and can generate a table of splice-site consensus sequences from its database entries that have the exon-intron boundaries annotated. Beyond those analysis functions the GeneFinder software from Phil Green (Washington University, St. Louis) has been incorporated into the ACEDB software and provides a number of predictive analyses useful in the determination of coding regions. The GeneFinder results are displayed graphically in the sequence display window (figure 5). These include splice- site potentials, codon region potentials, terminator locations, open reading frames and start-site potentials. The GeneFinder functions built into ACEDB will also automatically select possible coding regions and display the resulting exons-intron structure prediction.
ACEDB also provides the ability to trigger the action of an external program and to have this event associated with a specific object. In the Arabidopsis genome database this feature is used to present scanned images of autoradiograms. The ACEDB software activates a external graphic image display program to actually display the image. This feature can be used to execute specialized analysis programs or to display information not available to ACEDB.
Information in an ACEDB database is stored in classes. A class represents a compartment that defines a collection of information on a common topic. Most ACEDB databases contain a variety of general classes such as Sequence, Locus, Clone, Chromosome, and Paper. These can be modified as required; in addition it is possible to create entirely novel classes tailored for new kinds of data. The basic metaphor is that a compartment (class) can have slots (data entry points) within it that are able to contain data. The task of configuration is to define what those slots contain and how they are connected to other compartments in the database. Although in principle it would be possible to configure a database with a single class, in practice this would create a database that would have a dearth of associations and be clumsy to use. It is recommended to use many small classes instead because a modular design allows information to be easily shared through the database.
The description of a class (known as the model) sets forth the types of information that are appropriate for a given topic. A class can contain unique pieces of information or "repeating" pieces of information, and either kind can cross-reference information contained within another class. For example, the Paper class definition might specify that a particular paper could have a single title, one or more authors, one or more keywords, and have associated with it one or more genes, clones, strains, and so forth. Any of these items could be defined as a cross-reference to another class. As mentioned earlier (figure 1) it is the cross-references that make it possible to click on one piece of information and move to another related item. The items that cross-reference should be chosen thoughtfully since users will expect them to make sense. It is also important to realize that the description of a class determines, in part, how information in that class will be presented in a text window by ACEDB. This fact may also influence the choices made.
Once a class is defined data can be entered into it to create data objects. To extend the metaphor, a class is just a template which functions to give structure to real data. In ACEDB very little data is required to create an object and in fact the only requirement is to supply a unique name. Filling in the other slots is optional although normally an "empty" object, one with just a name, has little utility. Returning for a moment to the Paper class example, it is sufficient to the ACEDB software to identify a paper object with a name like "xyz123" without also stating the title or any authors.
When slots are filled with data within a class the information is placed into a tree of information as defined in the model. The tree structure that describes a class can be built using just a few components and rules. First, each slot is identified by a label which is unique within the class. These labels are called tags. Second, the slots themselves come in two types: either they are cross- references to other classes or are simple data items which can be integers, text strings, or decimal numbers. Finally, each tag can be used to label a series of data items or cross-references or a mixture of the two.
ACEDB allows a richness in defining the class models which will not be completely presented here. However consider the following model, which is a simple example of a class for cataloging plasmids .
?Plasmid Location Freezer Text Shelf_Box Int Text Text Sequence ?Sequence Reference ?PaperIn this example the Plasmid class contains the information about the specific freezer location of the plasmid sample and cross-references for two other classes (Sequence and Paper). The anatomy of this class illustrates the points made above. Tags (Location, Freezer, Shelf_Box, Sequence, and Reference) label the slots into which data can be entered. Some tags (for example Freezer) are followed by single data entry points, in this case identified by the term Text. Text indicates the kind of data allowed into this slot (123, hello, and "Hi there!" are all legal Text entries). Other tags (for example Shelf_Box) are followed by multiple slots, in this case a slot for an integer (Int) and two Text slots. Finally, some tags (for example Sequence) are followed by a cross- reference to another class. The "?" in "?Sequence" means that the slot should contain the name of an object in the Sequence class.
To create an object in the Plasmid class we must name the object and perhaps populate its slots with data. By far the simplest method for accomplishing this is to create a list of information associated with the plasmid. Here is an example:
Plasmid pBR322-199a Reference shirl-1992-aaaxc Reference jones-1990-aabrc Location "9th floor freezer" Sequence ATCHSWithout going into detail it is a simple matter to enter data in this form into ACEDB (see next section). The plasmid entry shows the style required. The entry must start with the object name itself (the plasmid). Thereafter other data is entered by specifying the slot name (tag), then the data itself. A slot in the model may be "repeating" as is true for the Reference slot. In such cases more than one data item can go into this slot as long as each entry is preceded by the slot name. This is nearly a complete description of the necessary syntax. Notice also what is not required: some information (for example, the Shelf_Box) was not supplied. It could be added later if desired. Additionally, the order in which the tags are listed is not identical to the order of the tags in the Plasmid model (Reference is last in the model). Data order--other than the all-important object name (pBR322-199a in this example), which must appear first--is irrelevant to ACEDB.
The reader may be wondering exactly how the information above will appear when it is viewed in a text window in ACEDB. First, the order of information will follow the order specified in the model. However, empty slots will not appear. This means that for the plasmid example the window will not display a Shelf_Box tag followed by an empty space. Missing data is not shown and the space it would occupy is removed. The result is a much more compact display. It is critical to understand the beneficial consequences this has on model design. Specifically there is no penalty, in terms of cluttering a window with empty fields, for defining slots that will be rarely filled with data. This is a feature and is designed with biologists in mind, since most biological data is sparse--meaning that while much can be known about an object, little usually is.
Second, the cross-references in the plasmid window will be boldface as they are in figure 1. In our example these will be Sequence and Paper object names (there will be two distinct paper names, one under the other). However, while we have deliberately chosen cryptic names for these objects the database is not constrained to use them. When a cross-reference to another class occurs the name of the cross-referenced object is shown by default. The default can be changed for all occurrences of the cross-referenced class so that the name that is displayed is more informative. The Paper and Sequence classes can be configured to display the title of the paper or sequence if a title is available as is the case in figure 1.
Once a class is created it can be referenced by another class. For example a Sequence object could also contain a cross-reference to the Plasmid class. These cross-references are automatically managed by the database. If a Sequence is entered and the slot cross-referenced to the Plasmid class is filled, the information associated with the Plasmid object will be associated with the Sequence. All that is required is that the cross-reference be specified in the relevant model.
A major advantage of ACEDB is the ability to add tags to a class without rebuilding the database. In other words, the models file can be edited to add the new tag, the database started, and the new information (which references the new tag) read in. However for more radical changes to the database model it is best to have ACEDB write out all of its known information into an ace file, edit the models and ace file to match the classes and tags, and then read the information into a newly initialized database.
For sharing the database with others a mechanism to produce and then incorporate updates is provided by ACEDB. The update files are text files in the ace format. These files can be distributed via a variety of methods and allow separate groups to update their copy of the database. Each of these groups could also store their own private data in their copy of the database.
A powerful query language, available using the query interface, is provided by the database software. Any piece of information can be located by searching for text or numerical values associated with a specific data item, or even searching for the presence or absence of a tag. The answer to a query is expressed as a list of objects that contain information satisfying the query. It is also possible, once an "answer" is generated, to use the answer as the input to another query. The next answer may contain a list of objects from a completely different class. For example, by querying the Author and Paper classes (assuming the cross-references are in place) it is simple to find all the papers ever published by author John Doe (a list of Paper objects) and then find all the authors who ever co-published with him (a list of Author objects).
A new query interface is available with the 1.9 version of ACEDB that provides easy access to all the information in the database without knowing the names of the tags in the database. The new query interface written by Gary Aochi (Lawrence Berkeley Laboratory) provides menus that aid the user in requesting just the right information and minimizes the need to understand formal query syntax.
Another query feature provided by ACEDB is the ability to search the database for specific data items and generate a table of the results. The table can display the names of objects (even from different classes) as well as the information in slots within objects. For example all genetically mapping loci contained on chromosome 4 that also have been placed on the physical map could be presented in a table containing columns labelled genetic marker name, genetic map location in cM, name of the clone its associated with on the physical map and map location of this clone on the physical map. This provides a useful method of presentation for several types of information. Once the specification for a table is created it can be saved and used again later. The interface to the table editor is similar to the query interface where the user makes choices from menus to easily build a complicated multiple query.
To date two species databases have been released that utilize the ACEDB software: the nematode database ACEDB and the Arabidopsis database AAtDB (1) . Both of these databases are available via anonymous FTP from ncbi.nlm.nih.gov or on the NCBI Repository CD-ROM available from the National Center for Biological Information at the National Library of Medicine. For more information about the NCBI and the Repository CD- ROM send electronic mail to info@ncbi.nlm.nih.gov or to the postal address: NCBI Data Repository; NCBI/NLM/NIH; Building 38A; Bethesda, MD 20894 USA.
Several other species or specialized databases are under development using the ACEDB software including soybean (SoyBase), wheat (GrainGenes), pine tree (APtDB), canine (DogBase), mycobacterium (MycDB), and human chromosome 21 and 22. It has also been announced that the Drosophila genome database project (FlyBase) will use ACEDB as one of the means to distribute their information.
Second, we strongly recommend that personnel involved in processing large amounts of data learn how to use Emacs, a text editor available for both Unix and VMS computers. The time required to learn Emacs will be repaid manyfold because of the power this editor provides. Several excellent books are available and are appropriate for the new user, including the GNU Emacs manual itself available from the Software Freedom Foundation.
Finally we urge anyone interested in reconfiguring ACEDB to contact the growing community of ACEDB database groups. The tone in this community is cooperative and friendly. These groups provide support to new and experienced groups in their design of classes and inform each other of new ACEDB features as they appear.
We wish to acknowledge the authors of ACEDB, Richard Durbin and Jean Thierry-Mieg, who have created an exceptional tool which is revolutionizing the ability of biologists to store and manipulate their data.