ACEDB, A tool for biological information

J. Michael Cherry and Samuel W. Cartinhour
Department of Molecular Biology
Massachusetts General Hospital
Boston, Massachusetts

Note: **Text-only version
        Figures 1 through 5 are not now included

Introduction

Traditionally biological data has been managed using a variety of computer software. Much of the data produced by a small research project can easily be stored in word processor and spreadsheet documents as well as several types of database applications. For larger scale projects, particularly those requiring the integration of diverse kinds of data, this solution quickly becomes inadequate. In these cases project coordinators have developed specialized databases to analyze and store information needed for their specific situations, often utilizing one of the commercially available relational database management systems.

The development of a specialized database involves considerable resources and expertise. An important question that might be asked by the director of a new project is whether an existing database can be transferred to a different site and quickly reconfigured for new (but similar) purposes, thus saving the time and expense of development. Unfortunately ease of reuse--in the sense that a biologist without the assistance of specialized computer personnel can attempt the task--has not been a general property of most complex databases. This limitation brings with it two broad consequences. First, many databases are constructed de novo, without significant reference to existing databases; essentially, an expensive wheel is reinvented over and over again. Second and more importantly, biologists yield the direct responsibility for database configuration to computer experts, even though biologists are both the producers and consumers of the data in the database.

Recently a database tool designed specifically for use by biologists has been developed as part of the Caenorhabditis elegans genome effort. This database software is known as ACEDB. ACEDB is the creation of Richard Durbin (MRC-LMB, Cambridge, United Kingdom) and Jean Thierry-Mieg (CNRS, Montpellier, France). The name ACEDB, A Caenorhabditis elegans Data Base, is also the designation of the nematode genome database (which uses the ACEDB software) maintained by Durbin and Thierry-Mieg. ACEDB has many outstanding features which we will review below, but two of them need to be stated briefly at the outset. First, ACEDB is a generalized genome database. It can be used to create new databases without the need for any reprogramming or in fact any sophisticated computer skills. Second, ACEDB permits biologists to describe and organize their data in a manner that closely resembles how they typically think about their information. The result is that ACEDB offers both ease of access for reconfiguration, and ease of reuse for transfer from one project to another. We will return to these points later after introducing the methods used to interact with the ACEDB software.

ACEDB Display of the Information

Perhaps one of the most critical features of any database system is the effectiveness of its user interface, i.e., whether or not users can reliably retrieve useful information from it, whether they feel intimidated by it, and indeed whether or not they enjoy using it. Although the ultimate database does not yet exist, ACEDB is becoming widely accepted as providing one of the best examples of an effective and helpful interface. This interface is part of the ACEDB software and requires no programming in order to become active. Below we introduce the major ACEDB display types and some of their properties. This will make clearer the approach ACEDB takes to data display although we must omit many details for the sake of brevity.

ACEDB employs several features that allow the user easy access to the variety of types of information in a genome database. Information is presented using both text and graphic display windows and is navigated through the use of menus and a mouse. Each window used by the software to display information can be used to retrieve related information. ACEDB allows all parts of the database to be cross-referenced with each other thus providing a dense navigable network in which to locate information. Therefore there is no one path to information in a database constructed with ACEDB.

Since the cross-references can be explored via a mouse click, database browsing becomes a simple matter. Figure 1 shows two windows containing text displays presented by the ACEDB software. The text items in boldface type indicate a cross-reference to other information contained in the database. The cross-referenced information can be retrieved by clicking with the mouse on the boldface text. In figure 1 a window containing information about a bibliographic citation was followed to that for the appropriate sequence annotation. In this example the user is also one click away from the DNA sequence, peptide homology, and genetic map information.

One of the special features of the ACEDB software is the ability to create a pictorial representation in one of several specialized graphic displays using information contained within the database. Currently available graphic displays include a genetic map, physical map, sequence display, gridded clone display and simulated agarose gel display.

The genetic map display provides a graphical representation of a variety of genetically defined sites such as genetically defined map locations of mutations or molecular genetic markers. The genetic map can be associated with regions covered by mapped deficiencies and duplications, and the contact points between the physical and genetic maps (figure 2). A locus can be included on more than one genetic map, each genetic map representing a collection of information. The primary results (an estimate of recombination distance) of two point or three point recombination experiments can also be presented in a manner that allows each experimental result to be visually compared with the location of the involved markers in any defined genetic map.

The physical map display provides a view of the continuous overlapping set of cosmid and YAC clones referred to as a contig. The associations of these clones with DNA sequences and genetic markers are also displayed (figure 3). The contig display provides, as do all the graphic displays, a scrolling device which allows easy movement along large contigs. A small summary view is presented in the lower part of the display. The pictorial objects presented in this graphic display can be selected to navigate to other displays. For example each small box located in the center of the physical map display represents a DNA sequence associated with one of the members of the contig. Double clicking on a small box will cause the appropriate sequence display window to be presented. A text display with information about each of the cosmids or YACs is available from selecting a clone, represented by a line segment. Also the genetic map display could be reached by selecting the central strip that divides the window in half or one of the loci names just below the central strip.

The sequence display provides a view of the standard sequence features used by the DNA sequence databases such as coding regions, regions of similarity to DNA or peptide sequences, repeat units, promoter and binding sites (figure 4). Selecting one of the sequence features causes the corresponding region of the sequence to be highlighted with color. Several analysis features are available from the sequence display, including the ability to produce a restriction map, show predicted fingerprint bands and find a specific site in a DNA or protein sequence. ACEDB also produces a codon usage table from all or a subset of sequences in the database and can generate a table of splice-site consensus sequences from its database entries that have the exon-intron boundaries annotated. Beyond those analysis functions the GeneFinder software from Phil Green (Washington University, St. Louis) has been incorporated into the ACEDB software and provides a number of predictive analyses useful in the determination of coding regions. The GeneFinder results are displayed graphically in the sequence display window (figure 5). These include splice- site potentials, codon region potentials, terminator locations, open reading frames and start-site potentials. The GeneFinder functions built into ACEDB will also automatically select possible coding regions and display the resulting exons-intron structure prediction.

ACEDB also provides the ability to trigger the action of an external program and to have this event associated with a specific object. In the Arabidopsis genome database this feature is used to present scanned images of autoradiograms. The ACEDB software activates a external graphic image display program to actually display the image. This feature can be used to execute specialized analysis programs or to display information not available to ACEDB.

Database Design

The process of creating a new database using the ACEDB software involves constructing a "model" of the information to be represented. Initially the models that are included with one of the released databases based on ACEDB provide a wealth of models for immediate use. Later however the new database manager will probably elect to modify the models in one or more ways to accommodate specialized data. The following discussion presents what is involved in creating an ACEDB model.

Information in an ACEDB database is stored in classes. A class represents a compartment that defines a collection of information on a common topic. Most ACEDB databases contain a variety of general classes such as Sequence, Locus, Clone, Chromosome, and Paper. These can be modified as required; in addition it is possible to create entirely novel classes tailored for new kinds of data. The basic metaphor is that a compartment (class) can have slots (data entry points) within it that are able to contain data. The task of configuration is to define what those slots contain and how they are connected to other compartments in the database. Although in principle it would be possible to configure a database with a single class, in practice this would create a database that would have a dearth of associations and be clumsy to use. It is recommended to use many small classes instead because a modular design allows information to be easily shared through the database.

The description of a class (known as the model) sets forth the types of information that are appropriate for a given topic. A class can contain unique pieces of information or "repeating" pieces of information, and either kind can cross-reference information contained within another class. For example, the Paper class definition might specify that a particular paper could have a single title, one or more authors, one or more keywords, and have associated with it one or more genes, clones, strains, and so forth. Any of these items could be defined as a cross-reference to another class. As mentioned earlier (figure 1) it is the cross-references that make it possible to click on one piece of information and move to another related item. The items that cross-reference should be chosen thoughtfully since users will expect them to make sense. It is also important to realize that the description of a class determines, in part, how information in that class will be presented in a text window by ACEDB. This fact may also influence the choices made.

Once a class is defined data can be entered into it to create data objects. To extend the metaphor, a class is just a template which functions to give structure to real data. In ACEDB very little data is required to create an object and in fact the only requirement is to supply a unique name. Filling in the other slots is optional although normally an "empty" object, one with just a name, has little utility. Returning for a moment to the Paper class example, it is sufficient to the ACEDB software to identify a paper object with a name like "xyz123" without also stating the title or any authors.

When slots are filled with data within a class the information is placed into a tree of information as defined in the model. The tree structure that describes a class can be built using just a few components and rules. First, each slot is identified by a label which is unique within the class. These labels are called tags. Second, the slots themselves come in two types: either they are cross- references to other classes or are simple data items which can be integers, text strings, or decimal numbers. Finally, each tag can be used to label a series of data items or cross-references or a mixture of the two.

ACEDB allows a richness in defining the class models which will not be completely presented here. However consider the following model, which is a simple example of a class for cataloging plasmids .

?Plasmid	Location	Freezer	Text
				Shelf_Box Int Text Text
		Sequence	?Sequence
		Reference	?Paper
In this example the Plasmid class contains the information about the specific freezer location of the plasmid sample and cross-references for two other classes (Sequence and Paper). The anatomy of this class illustrates the points made above. Tags (Location, Freezer, Shelf_Box, Sequence, and Reference) label the slots into which data can be entered. Some tags (for example Freezer) are followed by single data entry points, in this case identified by the term Text. Text indicates the kind of data allowed into this slot (123, hello, and "Hi there!" are all legal Text entries). Other tags (for example Shelf_Box) are followed by multiple slots, in this case a slot for an integer (Int) and two Text slots. Finally, some tags (for example Sequence) are followed by a cross- reference to another class. The "?" in "?Sequence" means that the slot should contain the name of an object in the Sequence class.

To create an object in the Plasmid class we must name the object and perhaps populate its slots with data. By far the simplest method for accomplishing this is to create a list of information associated with the plasmid. Here is an example:

	Plasmid	pBR322-199a
	Reference	shirl-1992-aaaxc
	Reference	jones-1990-aabrc
	Location	"9th floor freezer"
	Sequence	ATCHS
Without going into detail it is a simple matter to enter data in this form into ACEDB (see next section). The plasmid entry shows the style required. The entry must start with the object name itself (the plasmid). Thereafter other data is entered by specifying the slot name (tag), then the data itself. A slot in the model may be "repeating" as is true for the Reference slot. In such cases more than one data item can go into this slot as long as each entry is preceded by the slot name. This is nearly a complete description of the necessary syntax. Notice also what is not required: some information (for example, the Shelf_Box) was not supplied. It could be added later if desired. Additionally, the order in which the tags are listed is not identical to the order of the tags in the Plasmid model (Reference is last in the model). Data order--other than the all-important object name (pBR322-199a in this example), which must appear first--is irrelevant to ACEDB.

The reader may be wondering exactly how the information above will appear when it is viewed in a text window in ACEDB. First, the order of information will follow the order specified in the model. However, empty slots will not appear. This means that for the plasmid example the window will not display a Shelf_Box tag followed by an empty space. Missing data is not shown and the space it would occupy is removed. The result is a much more compact display. It is critical to understand the beneficial consequences this has on model design. Specifically there is no penalty, in terms of cluttering a window with empty fields, for defining slots that will be rarely filled with data. This is a feature and is designed with biologists in mind, since most biological data is sparse--meaning that while much can be known about an object, little usually is.

Second, the cross-references in the plasmid window will be boldface as they are in figure 1. In our example these will be Sequence and Paper object names (there will be two distinct paper names, one under the other). However, while we have deliberately chosen cryptic names for these objects the database is not constrained to use them. When a cross-reference to another class occurs the name of the cross-referenced object is shown by default. The default can be changed for all occurrences of the cross-referenced class so that the name that is displayed is more informative. The Paper and Sequence classes can be configured to display the title of the paper or sequence if a title is available as is the case in figure 1.

Once a class is created it can be referenced by another class. For example a Sequence object could also contain a cross-reference to the Plasmid class. These cross-references are automatically managed by the database. If a Sequence is entered and the slot cross-referenced to the Plasmid class is filled, the information associated with the Plasmid object will be associated with the Sequence. All that is required is that the cross-reference be specified in the relevant model.

Inputting Information

Earlier we mentioned that data can be entered into ACEDB via simple lists in which object slots are specified and filled. This method is extremely efficient and allows large quantities of data to be input in bulk form. Files containing information in this format are referred to as ace files. The format of these files is simply the tag name at the beginning of the line, a space and then the data item(s) or the unique name of the object in another class associated with that tag. The simple text file format can easily be produced for existing sources of information with the use of Unix utilities or programmable editors such as Emacs. A growing community of ACEDB databases is providing helpful assistance to new users and provides access to specialized utility programs that aid in the conversion of standard data formats.

A major advantage of ACEDB is the ability to add tags to a class without rebuilding the database. In other words, the models file can be edited to add the new tag, the database started, and the new information (which references the new tag) read in. However for more radical changes to the database model it is best to have ACEDB write out all of its known information into an ace file, edit the models and ace file to match the classes and tags, and then read the information into a newly initialized database.

For sharing the database with others a mechanism to produce and then incorporate updates is provided by ACEDB. The update files are text files in the ace format. These files can be distributed via a variety of methods and allow separate groups to update their copy of the database. Each of these groups could also store their own private data in their copy of the database.

Queries and the Table Editor

One of the strongest features of ACEDB is that it facilitates browsing. Browsing is a useful feature because it takes advantage of an important phenomenon in learning, namely, that even when the information one is trying to remember is not immediately available for recall, it can be recognized when it is seen. ACEDB facilitates this "I'll know it when I see it" method of retrieval by making it easy to follow cross-references with the mouse. Nonetheless there are times when more formal queries are useful.

A powerful query language, available using the query interface, is provided by the database software. Any piece of information can be located by searching for text or numerical values associated with a specific data item, or even searching for the presence or absence of a tag. The answer to a query is expressed as a list of objects that contain information satisfying the query. It is also possible, once an "answer" is generated, to use the answer as the input to another query. The next answer may contain a list of objects from a completely different class. For example, by querying the Author and Paper classes (assuming the cross-references are in place) it is simple to find all the papers ever published by author John Doe (a list of Paper objects) and then find all the authors who ever co-published with him (a list of Author objects).

A new query interface is available with the 1.9 version of ACEDB that provides easy access to all the information in the database without knowing the names of the tags in the database. The new query interface written by Gary Aochi (Lawrence Berkeley Laboratory) provides menus that aid the user in requesting just the right information and minimizes the need to understand formal query syntax.

Another query feature provided by ACEDB is the ability to search the database for specific data items and generate a table of the results. The table can display the names of objects (even from different classes) as well as the information in slots within objects. For example all genetically mapping loci contained on chromosome 4 that also have been placed on the physical map could be presented in a table containing columns labelled genetic marker name, genetic map location in cM, name of the clone its associated with on the physical map and map location of this clone on the physical map. This provides a useful method of presentation for several types of information. Once the specification for a table is created it can be saved and used again later. The interface to the table editor is similar to the query interface where the user makes choices from menus to easily build a complicated multiple query.

Availability of ACEDB and Current Genome Database Projects

Currently the ACEDB software is available for a variety of Unix workstations utilizing the X Windows environment, (including Sun Microsystems SPARCstation, Silicon Graphics IRIS, NeXT, Digital Equipment DECstation, and IBM R6000). The C program source and the application executables are available free of charge via Anonymous FTP through the Internet computer network from the host ncbi.nlm.nih.gov.

To date two species databases have been released that utilize the ACEDB software: the nematode database ACEDB and the Arabidopsis database AAtDB (1) . Both of these databases are available via anonymous FTP from ncbi.nlm.nih.gov or on the NCBI Repository CD-ROM available from the National Center for Biological Information at the National Library of Medicine. For more information about the NCBI and the Repository CD- ROM send electronic mail to info@ncbi.nlm.nih.gov or to the postal address: NCBI Data Repository; NCBI/NLM/NIH; Building 38A; Bethesda, MD 20894 USA.

Several other species or specialized databases are under development using the ACEDB software including soybean (SoyBase), wheat (GrainGenes), pine tree (APtDB), canine (DogBase), mycobacterium (MycDB), and human chromosome 21 and 22. It has also been announced that the Drosophila genome database project (FlyBase) will use ACEDB as one of the means to distribute their information.

Experiences of the AAtDB Project

The Arabidopsis genome database, AAtDB, uses exactly the same software as the C. elegans genome database, ACEDB. No programming was involved to change from an animal database to a plant genome database. The update mechanism provided by the ACEDB software is utilized by the AAtDB Project to distribute information to > 40 sites worldwide. It is difficult to summarize the variety of experiences involved in configuring the AAtDB database and populating it with information. At the risk of oversimplification we offer a few observations. First, we would recommend that the guiding principle of model design be simplicity. Although it may seem possible to specify every possible kind of data in advance, in practice it is not and such effort is wasted. Instead allow real data to drive the configuration process so that the models reflect what is actually happening in the laboratory. ACEDB imposes little penalty on trying new configurations so there is no need to get everything "right" the first time.

Second, we strongly recommend that personnel involved in processing large amounts of data learn how to use Emacs, a text editor available for both Unix and VMS computers. The time required to learn Emacs will be repaid manyfold because of the power this editor provides. Several excellent books are available and are appropriate for the new user, including the GNU Emacs manual itself available from the Software Freedom Foundation.

Finally we urge anyone interested in reconfiguring ACEDB to contact the growing community of ACEDB database groups. The tone in this community is cooperative and friendly. These groups provide support to new and experienced groups in their design of classes and inform each other of new ACEDB features as they appear.

We wish to acknowledge the authors of ACEDB, Richard Durbin and Jean Thierry-Mieg, who have created an exceptional tool which is revolutionizing the ability of biologists to store and manipulate their data.

Figure legends

Figure 1 Example text displays from the Arabidopsis database presented by the ACEDB software.

The upper display contains the sequence annotations for a sequence obtained from GenBank. The lower window is the reference information for the sequence. Text in bold typeface indicates a cross-reference to other information.

Figure 2 Genetic map display from ACEDB presenting a portion of C. elegans chromosome IV.

On the far left familiar marker genes are listed to provide landmarks. Farther to the right deficiencies and duplications are indicated representing the region involved. Continuing to the right is a cartoon of the chromosome. The wide regions represent areas covered by the overlapping cosmid and YAC physical map. To the right of the genetic distance scale the genetic loci are placed, in this particular view the number of loci is very large and ACEDB automatically spreads out the loci names so that none are occluded. The Zoom In button at the top of this window allows the user to examine a smaller magnified region of the chromosome.

Figure 3 Physical map display from ACEDB presenting part of the overlapping cosmid and YAC contig on C. elegans chromosome IV.

This single window is divided into two regions by the centrally located grey horizontal bar. The physical map display is closely associated with the genetic map. The lower half of the window contains genetically defined loci, comments and remarks, and a overview of the current region on the genetic map. In the upper half of this window the YAC clones are represented with a line segment illustrating the extent of their overlap with other members of the contig. The clones represented by a bold line indicate that these cosmids and YACs are present on one of the gridded clone filter panels included in the database.

Figure 4 Sequence display from the Arabidopsis database using the ACEDB software.

The exon-intron structure is illustrated in the center of this window. To the right of the coding region cartoon are shaded boxes which represent regions of peptide similarity. These peptide similarity boxes indicate the portion of the sequence that was found similar via a database searching program, such as BLASTX which is used for the Arabidopsis sequence. The similarity boxes are located in columns representing the reading frame in which they were found. The DNA sequence and nucleotide number scale are also shown.

Figure 5 GeneFinder results presented in a Sequence display.

To the right of the nucleotide scale and the exon-intron structure are three sets of columns representing the location of terminator codons, open reading frames, start site potentials and coding sequence potentials calculated by the GeneFinder algorithms incorporated within the ACEDB software. On the right side the splice-site potentials are shown with a longer line segment indicating a higher potential. Donor sites potentials are illustrated by a down turned flag and acceptor site potentials have an up turned flag.