John Barnett and Doug Bigwood Genome Informatics Group, National Agricultural Library, USDA, Beltsville, MD 20705 Sam Cartinhour Crop Biotechnology Center, Texas A&M University, College Station, Texas 77843
Built in to MS are the rules for constructing HTML versions of objects, lists, and query forms. MS outputs objects in ordinary tace style with one important difference: it uses the HTML construction rules to overlay "clickable" xace items (links between objects) with "Universal Resource Locators" (URLs). These become clickable in the WWW environment. The links are simply ACEDB queries expressed in the required URL syntax. For example, the link from a Paper object to a Person "Tom" is essentially "find person tom". The URL also contains the name of the database and explicitly names MS as an executable in cgi-bin:
...cgi-bin/MS/sorghumdb?find+Person+%22tom%22For the user, the transition between X-windows and a web browser (such as Mosaic) is straightforward, although links are navigated with one mouse click instead of two.
MS is positioned between the web server (e.g., httpd) and one or more databases. Typically it resides in the cgi-bin directory in the web server hierarchy. When a URL is activated via mouseclick on a web browser, the information is received by httpd and passed to MS for action. MS interprets the query and returns a HTML-formatted reply, which is sent back to the web browser by httpd.
One advantage of the WWW is that links to external data sources are possible. These are an important 'value-added' feature since genome databases typically contain a variety of potential links for strains, sequences, publications, and other data. MS does not provide for external links but they can be implemented by filtering server output through a perl script. The filter, essentially a series of "if-then" clauses, examines MS output line-by-line, 'grabs' the appropriate text and adds a hypertext link. The precise markup depends upon the database, the class, and the text (tags and data) on a particular line. For example, the line:
Library GenBank ATTS1273 Z25996in an AAtDB Sequence object would be modified to include a link to Genbank using the Genbank accession number as the visible part of the anchor. The result is that "Z25996" is replaced by a URL-style query directed towards the GenBank server:
<a href="http://ncbi.nlm.nih.gov:2555/htbin/birx_by_acc?genbank+Z25996">Z25996</a>The MS filter provides "on the fly" generation of external links and also gives the option of modifying MS output in other ways. A simple example is the removal of the "ISINDEX" HTML markup used by MS that invites users to "enter search keywords" into a text box. The caption, which is automatically presented when "ISINDEX" occurs, is misleading since correct ACEDB query syntax is required rather than a keyword. This feature is can be deleted and substituted with a different interface for formal queries.
Unfortunately the limitations of the filter are apparent once output from more than a few ACEDB databases with different models is involved. The filter is difficult to maintain and is not flexible enough as a foundation for more sophisticated tasks, such as connecting our collection of plant databases into a single "integrated" environment. These problems led us to consider a different approach to ACEDB data retrieval and presentation via the web.
We are reluctant to create another major version of the ACEDB software. The server is based on stock tace, modified to output objects in a special way. The modifications have been integrated into the standard release (one immediate benefit is that the cgi-bin scripts are robust with respect to new versions of tace). To associate tace with a port, we use a perl wrapper. Tace is started from within a perl process (hence "Perl Tace Server" or "PTS") and remains running while the wrapper monitors the port for incoming data. Data (at this point in proper tace syntax) is passed to and from tace without modification (although processing could occur here). Each database is assigned a distinct port. Note that a database and its tace server need not be located on the same machine as the web server.
One major barrier in MS to alternative object representation is that certain information about the object has been lost. For example, tags and data cannot be distinguished reliably and thus it would be impossible to cast them in different fonts. This limitation exists in tace as well. The modified tace conserves information by delivering objects in a special annotated form (see Appendix 1). The output is not intended to be human-readable but rather is parsable by perl's "eval" statement to create a pointer. The annotation accomplishes two purposes. First, it preserves the tree structure of the object, which makes it feasible to perform a variety of "treeish" operations, including column alignment, pruning, grafting, and so forth if we should choose. Second, it distinguishes each object element (tag, link to another object, integer, floating-point number, datetype) and additionally identifies links to empty objects or to "Titled" objects in classes registered in options.wrm as "-T MyTag". No assumptions about markup (HTML or otherwise) are made.
This single modification makes it possible to design a variety of text-based WWW interfaces to ACEDB databases. An obvious incentive exists to extend the tace command set in different ways. For example, the "attach" command is potentially very useful but is not yet part of the tace repertoire. Maps and other specialized graphical displays cannot currently be accessed with direct tace commands. Eventually we would like to extend the tace command set to include these functions and perhaps ACEDB data analysis utilities as well.
The modularization is reflected in the form taken by the URLs:
.../cgi-bin/command/[database][/arg1][/arg2]...For example, the cgi-bin script "find" is used to request a single object:
find/database/class/object find/ricegenes/locus/wxThe "find" command is used extensively to represent links between objects. Other fundamental commands include: "classes" (to list available classes), "list" (to list all objects in a class), and "model" (to display model for a particular class). These basic commands are named after their tace equivalents but more complex and specialized commands are also possible. For example, "range" is used to create collapsed lists of objects. An "imap" command could be used to present a table showing a genetic map as a table of intervals, requiring the construction of a table definition, use of the tablemaker, and further processing to calculate the intervals. Other possibilities include generating queries which depend on results from previous queries, retrieving genetic maps from more than one database and comparing them to generate a syntenic map, and combining objects from different databases to create customized virtual objects.
In all of the cases above several cgi-bin scripts are involved. The "find" command uses the script collection to interpret the URL, converse with the tace server, and produce a formatted object. As much as possible we have isolated common procedures to make it straightforward to build new commands.
The server-markup seperation also makes it more convenient to add markup to objects--for example, standard headers and footers, or URLs which point elements in an object to external databases (Appendix 3). As noted earlier, both are possible with MS by intercepting the server's output and modifying it before httpd sends it back to the web browser. However, since MS output is already marked up in HTML and formatted by indentation this is an awkward affair, made even more complex if multiple databases are involved (as is the case at the National Agricultural Library). When different kinds of markup are combined it is definitely more convenient to start with the unadorned object. The object can be examined and processed in various ways, with markup "directives" being stored as necessary, before actual markup is applied in a single pass.
Note that markup for external databases is simplified by the representation of ACEDB objects as trees. It is relatively easy to write rules which can recurse through the nodes of a tree, testing for certain patterns, then trigger an action if the test is satisfied (see Examples in Appendix 3). The action can involve attaching additional information to a node, for example the URL for the external data source. The same testing procedures could be used to accomplish other tasks, such as handle objects from a particular database or integer values in a special way.
Tace can be modified to deliver objects with their tree structure and other characteristics preserved. Web (or other) functionality can then be handled externally. We believe this approach shows promise and points out the need to create a text-based ACEDB server which is capable of representing ACEDB in toto, including access to its analysis tools and representation of its displays. Output should probably be in as general a form as possible; for example, a textual representation of a map display which could be used to build an image, rather than the image itself or PostScript.
The advantages conferred by the general approach make it possible to develop novel WWW and other interfaces to ACEDB. However, the potential is not limited to end-user data delivery. For example, another application would be a front end to tace which could be used to automate the update of remote databases. Two PTS could communicate on a master/slave basis with the master sending updates periodically to the remote server. Indeed, many such enhancements are possible because development can take place independent of ACeDB development.
# ty: node type, one of: tag, text, int, float, datetype, object
tg tx in fl dt ob
# va: name of tag or value of int. type, or name of object
# cl: name of class (only defined for type object)
# ti: title (only defined for objects using -T option)
# mt: empty (only defined for empty objects)
# Pn: pointer to array of pointers to the nodes on the right
# Pm: pointer to array of max field widths of fields in cols to right
(filled in after object is retrieved from ACEDB)
# db: database of origin (occurs at root node)
(filled in after object is retrieved from ACEDB)
A typical node might contain:
{ty=>'ob',
cl=>'Paper',
va=>'jones-1995-aabxc',
ti=>'Sequence of ADH-1'}
Nodes are connected to each other via pointers. This is similar to
what is done internally in ACEDB (using RIGHT and DOWN) but with a
slightly different interpretation. In particular, the pointers are
arranged so that nodes at the same branching "level" of indentation on
the tree are not directly connected; instead, a node only points to
nodes at the next level. For example, given an object like the one
below, where tags are numbers and text fields are letters (with 0 as
the root node):
0 1 a
b
2 c d
the tree can be drawn as:
0---1-a
\ \
\ b
\
2-c-d
i.e.,
{ty=>ob, va=>0, Pn=>[{node 1},{node 2}]} 0 points to 1 and 2
{ty=>tg, va=>2, Pn=>[{node c}]} 2 points to c
Here a pointer to an object/associative array is represented by a pair
of braces {} while a pointer to a list is represented by a pair of
brackets [].Thus ignoring other information, the object above could be represented with the following expression, which can be directly evaluated to produce a pointer to the root node:
{ty=>ob, va=>0, Pn=>[{ty=>tg, va=>1, Pn=>[{ty=>tx, va=>a
},
{ty=>tx, va=>b
}
]
}
{ty=>tg, va=>2, Pn=>[{ty=>tx, va=>c, Pn=>[{ty=>tx, va=>d
}
]
}
]
}
]
}
Note that nodes 1 and 2 are connected via node 0 in this scheme; i.e.,
node 2 is not "DOWN" from node 1 in the sense that node 1 points to
node 2.
[DELIMITER]....cats can't swim....[DELIMITER] [DELIMITER]....cats can' . "'" . 't swim....[DELIMITER]The data delimiter is then replaced with single quote marks:
'....cats can' . "'" . 't swim....'The rationale for handling single quotes in this fashion is that they prevent execution of any perl commands embedded in the data. When the object is evaluated using perl's "eval" a pointer to the root node is recovered and any "native" single quotes are restored.
{ty=>ob,
cl=>'Author',
va=>'Adams, S.',
db=>'foodb',
Pn=>[{ty=>tg,
va=>'Full_name',
Pn=>[{ty=>ob,
cl=>'Contact',
va=>'Adams, Sam'}]},
{ty=>tg,
va=>'Paper',
Pn=>[{ty=>ob,
cl=>'Paper',
va=>'adams-1992-aagad',
ti=>'My very first publication'},
{ty=>ob,
cl=>'Paper',
va=>'jones-1992-aahfp',
mt=>1},
{ty=>ob,
cl=>'Paper',
va=>'smith-1993-aahhz',
ti=>'Cats can' . "'" . 't swim'}]}]
}
To use ACEDB objects in Perl, a module was created with several basic methods of handling objects. The module is included in a perl script with the command
use Aceobj;which looks for a file called Aceobj.pm and evaluates it before the execution of the rest of the program. Every module which defines an object class must contain the method (ie, subroutine) 'new'; in Aceobj.pm this accepts as input a perl string (output from the modified tace) and calls the function eval to convert this string into a tree structure. The value returned from the method is a pointer to the tree structure; this is the object itself. Other methods available to objects of class Aceobj are:
However, the web offers the opportunity to add significant value to a database by supplying additional links to external data resources. Often these take the form of links between individual objects and data at a remote site, for example between a sequence object and NCBI's GenBank server. In this case the originating (ACEDB) database ordinarily does not provide sufficient information to generate the URL. In particular it does not know that the data is available at a particular host using certain conventions. Creating the URL may also involve computing a key from one or more items in the object, testing that certain conditions are met, and so on. Each link may vary considerably in what is required, and a single object may need to be linked to several external sources.
To supply the additional information we have added support for "external markup" to the ACEDB-WWW interface. The external markup routines exploit the fact that the object is a tree in which node properties have been preserved. Markup rules (described in detail below) determine if an object is eligible for external markup and how the markup is to be generated. The information is attached to the relevant node for later interpretation. The actual generation of HTML for both internal and external links occurs later.
Our markup rules have evolved considerably since the project began as we required greater and greater power from them. The current form is by no means stable. In addition, it is possible that their complexity should be hidden by another layer, i.e. a simplified language from which the rules are generated. We have explored several approaches to this but have reached no conclusions.
First, a rule contains a description of the root node of the object. The root specification mainly serves to identify the class of object that is affected by the rule, but it is also possible to set a requirement on the object name (in general, any node characteristic can be stated as a criterion):
'root' => {cl=>'Species'}
'root' => {cl=>'Species',va=>'Arabidopsis thaliana'}
If the root specification is omitted then the rule can potentially
apply to any object, not just objects from a single class.Note that the database name, which is in fact part of the root node, need not be explicit. This is because rules for each database are isolated in their own files and used when appropriate.
Second is the description of a series of nodes that form a branch or part of one. These nodes must form a contiguous structure; i.e., there cannot be gaps. However, a node need not be described beyond the fact that it exists. The branch ultimately determines where the external markup will appear; typically, the URL will be associated with the last node in the list.
'branch' => [{ty=>tg,va=>'Taxonomic_information'}, {ty=>tx}]
'branch' => [{ty=>tg,va=>'GeneFamily'},{cl=>'GeneFamily'}]
'branch' => [{ty=>tg,va=>'Library'},{cl=>'Source',va=>'GenBank'},{},{ty=>tx}]
These branches correspond respectively to objects with the structure
Taxonomic_information Text GeneFamily ?GeneFamily Library ?Source Text TextIn the final case, the ?Source field is required to contain the value "GenBank". The empty braces could represent any kind of node; in this case they serve as a placeholder for the first Text field.
A valid rule can omit the branch specification. This implies that any markup will be associated with the root node. It is an error to omit specifications for both the root and branch.
Third is the node specification. This identifies the node in the branch list with which the external link will be associated.
'node' => 1If the node is not specified the value defaults to the last node in the branch list if there is one or the root node if there is not. Note that the first node is indexed as '0', not '1'.
Fourth is the procedure used to generate the key or keys in the URL. This is an anonymous subroutine that returns a key value or a list of values. By 'key' we mean any data-dependent elements required to complete the URL, not necessarily a single value like an accession number. Multiple values may be required to construct a complete URL. A null return value signals that markup is not to occur. The procedure can be quite complex if necessary. Information can be drawn from any node in the object or drawn from another source.
The example below starts at the root node and traverses the tree to test for the value "Arabidopsis thaliana"; if it is found, it returns it as the key.
'keys' => sub {my ($node,$root) = @_;
if (@{find_nodes $root ({va=>"Arabidopsis thaliana"})}) {
$node->{va};
} else { undef;}
}
This example checks for an entry in another database which lists items
that should not be marked up. If the value is found the key is null;
otherwise, a key is generated.
'keys' => sub {my ($node,$root) = @_;
dbmopen(%grin,"/kaos/WWW/8200/cgi-bin/grin",undef);
my $omit = $grin{$node->{va}};
dbmclose(%grin);
if ($omit) {undef} else {$node->{va}}
}
Special variables ($node and $root) are provided to make it easy to
refer to the "current" node (in the 'node' specification) and the root
node.Very often URLs are formed in a simple manner from a value in a field or from the name of the object itself. The default for keys takes this into account. If keys are not specified a key will be created using the "value" of the last node in the branch list and from the root node:
$keys[0] = $node->{va} #usually, the value of a field
#from 'branch'; if no 'branch' is
#specified, then value from root
#(the object's name)
$keys[1] = $root->{va} #the name of the object
The last part of the rule is the URL specification. The URL can be
constructed in situ or referred to by name (in this case, the URL has
been "registered" in a file). The latter is useful if a URL is built
the same way again and again. An example of the in situ method is
'urls' => [{'name'=>'WeedLocus',
'URL'=>'http://weed.org:/cgi-bin/dbrun/aatdb?find+Locus+%22$keys[0]%22'}]
where @keys is the key list generated by the keys procedure. The URL
handling routines are designed so that the information in @keys is
sufficient to complete the URL.Alternatively, a URL "name" can be supplied if the corresponding information has been registered in another file, with @keys playing the same role:
'urls' => ['MetabolicEC']
{'root' => {cl=>'Sequence'},
'branch' => [{ty=>tg,va=>'General'},
{ty=>tg,va=>'Mendel_Gene_Family'},
{cl=>'Text'}],
'urls' => [{'name'=>'Mendel', 'URL'=>'http://origin.nalusda.gov:8200/cgi-bin/find/mendel37/GeneFamily/$keys[0]'}]
}
The root specification ensures that the rule applies only to objects
from the Sequence class. The branch specification further requires
the object to have used this part of the ?Sequence model:
?Sequence General Mendel_Gene_Family ?TextThe rule takes advantage of two useful defaults. First, 'node' defaults to the last node in the branch list (the Text field). Second, '@keys' contains the value extracted from that field, which will be a gene family name. The URL is constructed in situ simply by filling in the family name $keys[0] in the appropriate place.
{'branch' => [{cl=>'GenBank'}],
'urls' => ['GenBankAC','EMBLAC','GenoBaseAC']
}
External links of this sort are likely to be reused, so it is
economical to refer to them by name and register them in a file:
GenBankAC:GenBank:http://ncbi.nlm.nih.gov:2555/htbin/birx_by_acc?genbank+$keys[0] EMBLAC:EMBL:http://www.ebi.ac.uk/htbin/expasyfetch?$keys[0] GenoBaseAC:GenoBase:http://genome.cornell.edu:8300/cgi-bin/partialgenobase.pl?$keys[0]The default key is used and this time contains the contents of the ?GenBank field.
?Contact Address E_mail Internet Text
{'root' => {cl=>'Contact'},
'branch' => [{ty=>tg,va=>'Address'},
{ty=>tg,va=>'E_mail'},
{ty=>tg,va=>'Internet'},
{ty=>tx}],
'urls' => ['mailform']
},
The named URL is registered as:
mailform:mail form:http://origin.nalusda.gov:8300/cgi-bin/mailform.pl/$keys[1]/$keys[0]The default for @keys provides mailform.pl with its two arguments: the name of the object in $keys[1] (the person to whom mail is being sent) and the e-mail address in $keys[0] (the contents of the Text field).
This rule (from Mendel) is used to create an external link from a species object to the Germline Resources Information Network (GRIN) database only if GRIN contains data about the species. The species name is extracted from the Text field in
?Species Taxanomic_information Textand is checked against a dbm file containing a list of all species known to GRIN. Only if the check is successful is a key defined, in this case the species name.
{'root' => {cl=>'Species'},
'branch' => [{ty=>tg,va=>'Taxonomic_information'},
{ty=>tx}],
'keys' => sub {my ($node,$root) = @_;
dbmopen(%grin,"/kaos/WWW/8200/cgi-bin/grin",undef);
my $omit = $grin{$node->{va}};
dbmclose(%grin); #close dbm file
if ($omit) {undef} else ($node->{va})}
},
'urls' => ['GRINtax']
}
Note that this rule has exactly the same branch definition as the rule in Example 2. Any field can be the focus for more than one rule and both can contribute to the final markup.
{'branch' => [{cl=>'GenBank'}],
'keys' => sub {my ($node,$root) = @_;
if (@{find_nodes $root ({va=>"Arabidopsis thaliana"})}) {
$node->{va};
} else { undef;}
},
'urls' => ['AAtDB']
}