efetch
The Blixem display in ACEDB needs to get the sequences for the alignments
from an external source (unless the database curator has decided to load all the
blast hits into the database itself ...). To do this it sends the system the
command 'efetch -q 'sequence name'', and expects to get a raw sequence back
from this.
The present program is a rough replacement for the standard efetch distributed
with ACEDB, since that requires that the databases of sequences are on disk and
indexed in a specific way. Because ACEDB expects the program to be called
efetch this program is called exactly that.
However, this program has its own peculiarities and special demands. These will
be documented in the following. There are also comments in the code itself that
you should read.
efetch takes one argument, namely the sequence name (e.g. efetch EMBL:M12345).
For historic (and practical) reasons, it can optionally be given
a -q switch (e.g. efetch -q EMBL:M12345) (ACEDB does this). In either case it
will return only the raw sequence.
Here it gets a bit hairy: the variable $geturlapp should be set to a command that gets a document
over the web, given the URL. The perl module bundle libwww contains a perl
script called GET that does exactly what we need. libwww requires Perl 5.002 or
better. Some modules within this bundle depend on modules that are distributed
separately from perl. Recommended modules are: MIME-Base64, IO (bundled with
perl5.004 and better), libnet and MD5. You may also have to specifiy the
complete path to the application.
OK, we have now got to the point where we can define the databases we should
know about. We do this in a hash, keyed on database names. The mapping from what
you have in your database to these keys comes later. The databases efetch knows
about at the moment are SWISS-PROT, PIR, EMBL, TREMBL and GenPept.
The hash should contain, for each database the following parts:
-
'url' [mandatory]
-
a complete URL, minus the accession number. Substitute that part of the URL with
__ACCESSION__. This will later be substituted with the correct accession.
-
'pretreat' [optional]
-
should be code for things to do before actually trying to fetch the
sequence. This may include sanity-tests on the accession number, or some
trickery such as substituting the accession number with something else.
-
'extract' [mandatory]
-
should be code for extracting the raw sequence from what is returned by the
external server.
pretreat and extract should be anonymous code references. The following is.a complete example of the code for one database.
$dbdef = {
'GenPept' => {'url' => "$srsserver?[TREMBL-acc:__ACCESSION__]+-sf+EMBL",
'extract' => $emf2raw,
'pretreat' => sub {
if ($id !~ /_\d+$/) {
print STDERR "\n$db|$acc|$id is incomplete\n";
exit 1;
} elsif ($id eq '') {
print STDERR "\nNo ID in $getthis\n";
exit 1;
}
$idold = $id; $id = $acc; $acc = $idold;
}
},
};
Note that the url part contains a global variable $srsserver and that it has
__ACCESSION__ as a placeholder for the accession number. extract is a
pointer to a subroutine that was defined before, and pretreat is explicitly
defined here -- it tests the values the program has and aborts if they are
missing or truncated; it also swaps $id and $acc.
Now we need to get hold of the interesting parts of what ACEDB passes us, which is
the name of the sequence object that the hit is on.
This version knows about two different formats: database|accno|id as in blast
results from NCBI (among others), and database:accno which historically has
been the way MycDB formulates sequence names.
If you have another format, you need to insert the code for parsing it.
Author: Staffan Bergh, <staffan.bergh@eu.pnu.com>
The latest version of this program should be available on the ACEDB
documentation server, at URL xxx