How to ask complex questions in ACEDB
via queries, tables and scripts

Jean Thierry-Mieg, CNRS

Tutorial for the ace97 workshop
Cornell, july 1997

Contents


Introduction

The simplest way to extract information from acedb is to start the graphic interface and to browse using the built-in hypertext functionalities. In many cases, the graphic maps which appear on screen give a very valuable information which would be very hard to extract in any other way. A good schematic drawing is more valuable than a long talk, said Napoleon.

However there are many cases when one wants a direct answer to a particular information. In such a case, the casual user may type a name in the top box of the main window, search for a word in all classes using the lower box of the main window or use one of the user friendly graphic query interfaces: Query builder and Query By Example.

The more advanced users will be able to type directly their queries in the graphic Query interface or using the new query button from the keyset window, and to extract tables using the table-maker interface.

There are however many more possibilities in acedb which are little known and ill documented. The first purpose of this document is to cover the advanced features of the query language: In particular the arithmetic and set logic operators.

Then in the sequel, I will try to explain how one can mix in a very general way ace queries and tables, edit commands and keyset stacks, embeded scripts and embedding scripts, servers and clients. or the table maker. The general idea is that, following the standard UNIX methodology, there is a unique code layer in acedb which actually executes any given elementary query, but the system can pile up layers over layers and pass parameters down the line, provided one includes in the definitions the correct random number of quotes and backslashes.

Model file to run this demo


?A b ?B
   c ?C
   myTag Int Text

?B Ct ?Ct #CTT

#CTT Length INt

?C c1 Int
   c2 Int

?Sequence DNA ?DNA Int

Data file to run this demo

A myA
myTag 1 x1
mytag 2 x2
b b1
b b2
C cc1

A hisA
b b3
c cc2
c cc3

B myB
Ct Ct1 Length 17

B hisB
Ct Ct1 Length 18

C myC
c1 7
c2 9
c2 11


C myC
c1 7
c2 9
c2 11

C hisC
c1 4
c2 13
c2 22

>mydna
atgcgtgtgac

Auxiliary files

File 'a b' # space included in the name of the file
// Notice that acedb can read fasta format files

>S1
atgcatgc

>S2
tgtgtcgat
tgg

////////////
File update.script
$echo `date` >> update.log
$echo -n "Parsing file %1 ... " >> update.log
Parse %1
$echo "Done" >> update.log

File cl.serv.csh, set up for saint louis lab
#! /bin/csh -f
setenv ACEDB `pwd`
setenv hh peg
setenv pp 2345678  # some crazy port number

aceserver $ACEDB -port $pp &
sleep 1 # needed to let the server detach itself from this shell

aceclient $hh -port $pp << END >! toto
Query Find sequence
show -a DNA
shut  // close server now
quit
END

nawk 'BEGIN {n=0;nn=0}/^DNA/{n++;nn += $3;}END {printf("%d seq, %d bases\n",n,nn)}' toto
#\rm toto

Current keyset

A query in acedb is a keyset pipe. It receives a set of keys, eventually empty, and either filters this set for those objects matching a complex condition, or constructs a new set and then filters it.

The simplest constructor is the FIND command. It ignores the input set and simply returns all members of the class. The second costructor is the FOLLOW command. This returns the set of all keys to the right of in the original set of objects. The last 2 constructors are NEIGHBOURS, which returns all first neighbours of the original set and GREP which looks for object whose name matches or such that they are pointed at by a matching AUTO-XREF object, i.e. a ?Text, a ?Keyword or any other class declared type X in options.wrm

The complex query consists of a pipe line of semi-column ';' separated list of simple queries, which are applied one after the other.

The simple query is made of a serie of elementary queries described elsewhere glued together by logical oeprators AND, OR, XOR, NOT.

Elementary queries

An elementary query is a boolean filter. It receives a develloped acedb object, with a pointer at the current location, and tries to locate the current pointer to the new location described by the elementary query. If it fails, it returns FALSE and the position of the pointer is not guaranteed. If it succeeds, the pointer is positioned on the successful data item.

The original location of the pointer only matters if the query is geographic and contains the reserved word NEXT or a # operator. Such queries are supported and indeed needed by the kernel, but should be avoided because they are hard to read and non resistant to schema modifications.

The recommended style is to use a tag name, or the relativelly new colon operator to move right of a given tag. Following the usual convention of the C language, Tag:0 is the tag itself, tag:1 the value immediatly to its right, tags:2 the next one and so on. However, this syntax is not always applicable. Consider the object:

?A myTag Int Text

myA myTag 1 "x1"
          2 "x2"

The following queries will succeed:
Query Find A myTag   // since myTag is present in the object
Query Find A myTag:0 // for the same reason
Query Find A myTag:1 // since there is a number '1' rigth of mytag
Query Find A myTag:2 // since there is a number rigth of '1'

Query Find A myTag = 1 // because a comparator moves one right
Query Find A myTag:1 = 1 // unless the : was specified
Query Find A myTag:2 = x1  // because : moves along first line
Query Find A myTag = 2 AND NEXT = x2 // a correct use of NEXT

Those will fail
Query Find A myTag:0 = 1 // because :0 is the tag itself
Query Find A myTag:2 = x2 // because :2 wil be positioned on 'x1'
The problem in the last query is that the query evaluation routine never backtracks. Finally, recall that in the case of a constructed type you must use the # operator to move in the subobject
Models:
?B Ct ?Ct #CTT

#CTT Length Int

Object:

myB Ct ct1 Length 17 // # does NOT appear in the ace file

Succesful query:
Query Find B Ct:1 # Length = 17


Set operations

The set operations can be used to mix different sets. The operators are SETOR, SETXOR, SETAND, SETMINUS abbreviated as $|, $^, $&, $- They are applied to pair of sets, recognised by the curly braket delimiters {}. These bracket establish the set context. The queries inside the bracket run on the same input set. However, it is often easier to use the stack operations described below.

Example

?A b ?B
   c ?C

Query Find A ; {FOLLOW b} SETOR {FOLLOW c}

Inside the bracket, you can have an arbitrary query, including in principle nested {} although I am not sure if this works.

Arithmetic operations

The square brackets establish arithmetic context. Inside [] the system will allways try to recover numbers and will be allowed to combined them via the four elementary arithmetic operators +, -, *, /.

The following operators eval to numbers: [], COUNT, AVG, SUM, MIN, MAX

Finally, in the arithmetic context, an attempt is made to cast the item to the right of the locator to a number. This should succeed for Int, Float, Date, may succeed for Text, but is not attempted for any class name, including ?Text. Note that arithmetic operations on dates use their internal unix representation and are rather ill defined, except min, max, avg. If casting fails, the query returns FALSE.

Examples:

?C c1 Int
   c2 Int

myC c1 7
    c2 9
       11

The following queries should return myC

Query Find C [ 1 + 2] = 3      // all C objects !
Query Find C [[c1] + [c2]] = 16   // the first value of c2 is used as usual
Query Find C [[c1] + AVG c2 ] = 17
Query Find C [[c1] + MAX c2 ] = 18
Query Find C [[c1] + 100*(COUNT c2)] = 207
Query Find C c2 && [[c2] - [c1]] = [MAX c2 - MIN c2] // returns also hisC (to be tested)
Query Find A COUNT {FOLLOW b} = 2 

Stack and edit commands

To be completed

works polish way, you have a stack of keysets

Example

Find sequence
spush
Follow DNA
sor
spop
list  // gives dna and sequences
write  filename  // exports

Edit available in tace, aceclient and from the edit button of the keyset windows applies ONE ace line to the active set
Example

FIND C
edit c1 13 // adds c1 13 to all C objects
edit -D c2 // removes all c2 tags
edit c2 9

Embedded scripts

There are two special character commands that are interpreted in a prticular way if they occur at the beginning of a line:

@includes the file in the flow and substitutes parm1 parm2 ... in place of %1 %2 %3. It is an error to provide an unsufficient number of parameters, the result of the error I think is to close the included file.

$shell_command line send the uninterpreted line directly to the Unix shell. Since this may constitute a breach of security, it is only possible if before running acedb, you setenv ACEDB_SUBSHELL, and $ is forbiden in aceserver.

Inclusion and shell calls can be recursive and as deep as one may wish, as in the example below.

The parameter substitution is a macro substitution done at run time when the %n token is reached. The substituted string is the result of unprotecting parm. That is, double quotes are removed, backslashes are interpreted and so on. We are missing an equivalent of the '' csh syntax which would provide for subtitution as is without interpretation.

Example
acedb> $cat update.script // from John Morris
// Update script for tace
// Modified Thu Aug  4 08:44:09 EDT 1994 JWM
$echo `date` >> update.log
$echo -n "Parsing file %1 ... " >> update.log
Parse %1
$echo "Done" >> update.log

acedb> @update.script "\"a b\""

will actually read the file with the stupid name a b
with included blank  and report about it in the update log file

More example are shown in the document on tace by John Morris 
available on the acedb.doc server.

Embedding client/server scripts

Converselly, one may embed tace scripts inside a larger csh. However, this may be costly, because of the over haed of starting tace. Rather, we suggest to use aceclient. This can be done even if you do not have a daemoniac server. Just invoke the server in foreground. In this example, i wish to find the cumul of all sequence lenghts using a mixture of ace and nawk.
#!/bin/tcsh -f
setenv hh `hostname`
setenv pp 2345678  # some crazy port number

aceserver $ACEDB -port $pp &
sleep 1 # needed to let the server detach itself from this shell

aceclient $hh -port $pp << END >! ~/__toto
Query Find sequence
show -a DNA
shut  // close server now
quit
END

nawk 'BEGIN {n=0;nn=0}/^DNA/{n++;nn += $3;}END {printf("%d seq, %d bases",n,nn}' ~/__toto
\rm ~/__toto

Note that a small perl library written by Steve Rozen ace/wtools/ace_perl.pm gives you access to acedb queries from perl and the larger system by Barnett,Bigwood et al provides more perl functionalities.

Mailing lists

Consider also the aceclient report mode, for mailings. It would be very useful to extend this to exports of html documents. I suggest that as an exercice for the meeting. Subject: Start from the code wrpc/aceclient.c which is only 300 lines long and extend its functionality as follows. In the present system
aceclient host -port port -f input_file
will substitute in input_file any occurence of #(...) by the result of asking ... to the server. This should be modifed a litle bit so that the input couild be a meta-html page and the output a correctly formated html document. In the present system there would be undesirable space lines and so on.

Here is an example of a correct input file

// @(#)client.report.example	1.2 1/23/96
// Hoping that my computer is running try the following
// that i develloped with Danielle
// ftp from ncbi my executable
// sparc.aceclient.Z
//
// uncompres it and type on your commnad line
// usage:
// "usage : aceclient host  [-time_out nn_in_seconds] [-ace_out] [-f reportfile parameters]\n",
// if filename is omitted, you run interactivelly
//
// for example try:
// aceclient 193.49.111.71 -f thisfilename genename
//
// 193.49.111.71 is my dec alpha in montpellier, broken by a thunderstorm these days, sorry.
//
// you will get a developped form of the gene
// embedded in the non // lines
//
// line starting with // are jumped in the output
// %1 and so on announces command line parameter 1 and so on.
// 
// #(command text) invokes the ace_server
// 
#(Find Gene %1) 

Consider gene %1 // %1 represente le nom du gene

These are the known papers:#(Show Reference)

$ls -ls

Or in a more standard blibliographic format:
#(Biblio)

These are the alleles #(Show Allele)
that's all, folks


All I know on tata box:
#(Grep tata) #(List)
# Bonsoir

// fin

// Modified July 1997