efetch


A perl replacement for the ACEDB efetch

The Blixem display in ACEDB needs to get the sequences for the alignments from an external source (unless the database curator has decided to load all the blast hits into the database itself ...). To do this it sends the system the command 'efetch -q 'sequence name'', and expects to get a raw sequence back from this.

The present program is a rough replacement for the standard efetch distributed with ACEDB, since that requires that the databases of sequences are on disk and indexed in a specific way. Because ACEDB expects the program to be called efetch this program is called exactly that.

However, this program has its own peculiarities and special demands. These will be documented in the following. There are also comments in the code itself that you should read.

Calling efetch

efetch takes one argument, namely the sequence name (e.g. efetch EMBL:M12345). For historic (and practical) reasons, it can optionally be given a -q switch (e.g. efetch -q EMBL:M12345) (ACEDB does this). In either case it will return only the raw sequence.

GETting html documents

Here it gets a bit hairy: the variable $geturlapp should be set to a command that gets a document over the web, given the URL. The perl module bundle libwww contains a perl script called GET that does exactly what we need. libwww requires Perl 5.002 or better. Some modules within this bundle depend on modules that are distributed separately from perl. Recommended modules are: MIME-Base64, IO (bundled with perl5.004 and better), libnet and MD5. You may also have to specifiy the complete path to the application.

Known databases

OK, we have now got to the point where we can define the databases we should know about. We do this in a hash, keyed on database names. The mapping from what you have in your database to these keys comes later. The databases efetch knows about at the moment are SWISS-PROT, PIR, EMBL, TREMBL and GenPept.

The hash should contain, for each database the following parts:

'url' [mandatory]
a complete URL, minus the accession number. Substitute that part of the URL with __ACCESSION__. This will later be substituted with the correct accession.

'pretreat' [optional]
should be code for things to do before actually trying to fetch the sequence. This may include sanity-tests on the accession number, or some trickery such as substituting the accession number with something else.

'extract' [mandatory]
should be code for extracting the raw sequence from what is returned by the external server.

pretreat and extract should be anonymous code references. The following is.a complete example of the code for one database.

$dbdef = { 'GenPept' => {'url' => "$srsserver?[TREMBL-acc:__ACCESSION__]+-sf+EMBL", 'extract' => $emf2raw, 'pretreat' => sub { if ($id !~ /_\d+$/) { print STDERR "\n$db|$acc|$id is incomplete\n"; exit 1; } elsif ($id eq '') { print STDERR "\nNo ID in $getthis\n"; exit 1; } $idold = $id; $id = $acc; $acc = $idold; } }, };

Note that the url part contains a global variable $srsserver and that it has __ACCESSION__ as a placeholder for the accession number. extract is a pointer to a subroutine that was defined before, and pretreat is explicitly defined here -- it tests the values the program has and aborts if they are missing or truncated; it also swaps $id and $acc.

Parsing the input

Now we need to get hold of the interesting parts of what ACEDB passes us, which is the name of the sequence object that the hit is on.

This version knows about two different formats: database|accno|id as in blast results from NCBI (among others), and database:accno which historically has been the way MycDB formulates sequence names.

If you have another format, you need to insert the code for parsing it.

If you have problems:

Author: Staffan Bergh, <staffan.bergh@eu.pnu.com>

The latest version of this program should be available on the ACEDB documentation server, at URL xxx