embl2ace.pl


embl2ace.pl

embl2ace.pl is a perl script that parses embl style sequence records into .ace format. It was written for the MycDB database over a period of 5 years, and is very good (if I may say so myself).

An overview

The script reads filenames on standard input and writes the parsed results on standard out, as default.

The input files can be either individual records, or entire *.dat files from the distribution of EMBL. Files may also be from an SRS server, or from the Staden lip program (both these have their peculiarities).

The parsing can be somewhat customized:

The output can be directed into individual files, by saying 'embl2ace.pl -f 'directory''.

Finally there are some help functions: 'embl2ace.pl -h' prints a short summary of options; 'embl2ace.pl -v' prints the current version [currently $Rev$]; and 'embl2ace.pl -d #' will print debugging messages, where # is the level, from 0 (no messages) to 6 ('tell me everything ...').

Caveats

This script was developed for the mycobacterial database, MycDB. It naturally has some model dependencies, and also it reflects some of the design decisions behind MycDB. Feel free to change the code as you wish, the structure of the script should make it fairly easy to adapt the output to your favourite organism/models.wrm.

The script also uses one external file of perl functions, called util5.pl (this should be present in the same directory as this file), and you may have to adjust the require statement at the beginning of this script to point at the right place.

The future

This script will probably continue to evolve. The following are some of the plans:

.

How it all works

OK, here is a blow-by-blow account of how the script works. This part should probably be read in the script, and not in the html-file, but ...

Initialize
Before doing anything else we load in the auxiliary routines from util5.pl, initialize some globals, thru the init subroutine.

Then we read the command line and look for arguments. Anything that is not a legal argument is considered a filename. Also do some rudimentary checking on the arguments.

Loop over filenames
Then we loop over all the files we have left over from the command line, open them one-by-one and read the data in.

While reading in the data, it is broken into an list of records. Some checking goes on here: if the record is in some way broken (the // terminator may be missing, or we find a ID field code in the middle of another record), the script commits suicide with a note of the reason.

Loop over records in file
When finished with reading the file, we loop over all the complete records we have in the list of records. First step is some house-keeping, removing XX lines and filler spaces put on the end of lines by the Staden lip program.

ID field code [required]
Then parse the ID line, getting all the details out of it. Not all of it is used, at the moment (the second word, the class is not used).

Here goes the opening of filehandles for output: if the script is called with the -d argument, the output is to a file (and the script dies if it cannot open that file), else it attaches it output to STDOUT.

AC field code [required]
Then read all the accession numbers attached to the record, and check them for consistency. If any is faulty skip to next record.

Unless the -i switch was given at startup, the first accession number is used as the basis of the name of the sequence object, as mentioned above.

The Database tag is filled in with references, and Other_sequence tags are added unless running with the -i switch.

At this point, we actually print the first line of output, with some administrative stuff about files and dates etcetera. The rest of the output is held for later, in the variable $sequence.

DT field code [required]
Here we get the Creation and Update dates as well as the database version numbers corresponding to those events. The dates are reformatted into ace format and the whole shebang is added to $sequence.

If the -p argument was used the Parse date is also added at this point.

If the DT lines are missing, we abort from this record and go to the next (records must have DT lines according to the definition of the EMBL format).

DE field code [required]
The description is read in, redundant whitespace stripped, and the description added to $sequence under a Title tag. Again, if it is missing, we skip to the next record.

NI field code [required]
Parse out the NID accession number and make a Database tag line of that.

KW field code [required]
Keywords are treated essentially as the DE field, but put under Keyword tags in the $sequence variable.

OS, OC and OG field codes [required, required, and optional]
This is a place where this script shows its roots: for MycDB we only want to get hold of enough of this to work out which species of Mycobacterium the sequence comes from, and the OC field may help in that, but not necessarily. The OG field is completely redundant to bacteria, so that we simply ignore.

There is a further complication in that the feature key source also holds taxonomic data, and these may sometime conflict, or extend each other. Therefore we hold the organism name in a list until further notice.

Reference info [required]
This is one of the complicated bits... There are seven different fields code for references, and these are used both for real references, and patents, and for submitter info. Also there may be more than one reference, so there is a loop here.

In addition, there is a design decision buried here: if the reference has a RX field code, ie a pointer to an external literature database (this is usually MEDLINE) we put a pointer to a Paper into the Sequence object and ignore the rest - it is so much cleaner and easier to take the reference info directly from MEDLINE in those cases. The name of those are simply DATABASE_ACCESSION.

The Submitter info is not made into a Paper object, instead it is put into a Date tag.

The Paper objects formed are stored until later. The names are formed from the Sequence name, plus ref#, where the '#' is a counter.

DR field code [optional]
Another design decision lurks here -- SwissProt crossreferences are treated specially, ie they are put in as Sequences of their own, with a Corresponding_peptide tag pointing to them in the original Sequence object. THIS IS REALLY OBSOLETE, and should be rewritten into using Peptide.

There is a complication here, in that Features may also contain /db_xref qualifiers. The handling of these guys should be really reworked!

FT field code
There are several levels to this, that needs to be handled: there is a feature key, a feature location and one or more feature qualifiers. Step one is to simply get those parts out, cleaned up and stored in a hash.

When that is done we need to treat some features special: CDS, mat_peptide, sig_peptide, mRNA, rRNA, tRNA. So we loop over all the features and treat them one by one.

First have a look at the location, and parse that into a useful format. Also keep track of if the range is open-ended in either end (Start_not_found or End_not_found)

Then we look at the feature key, and treat the one that have special needs.

allele should have a replace-sequence.

attenuator should be more than one nucleotide long.

CDS (or CDS_pept, as they may come out of SRS) needs to be formatted into a separate Sequence objects (by design with a name as the parent sequence with '_cds#' added. Some of the feature qualifiers for this key also needs special handling (eg gene, db_xref, translation, codon_start and transl_table.)

conflict should have a replacement sequence or at least a position.

mat_peptide needs to be attached to an eventual subsequence that it belongs to, but since we cannot know at the time we see the mat_peptide feature key that we have seen the CDS, we have to store it for later treatment.

misc_binding should be bigger than 1 nucleotide.

misc_difference should have a replacement sequence or at least a real position.

misc_feature

misc_recomb

misc_RNA

misc_signal

misc_structure

modified_base

mRNA gets the same kind of treatment the CDSes get.

mutation

old_sequence

precursor_RNA

prim_transcript is really an mRNA, only not processed yet.

primer_bind

promoter

protein_bind

RBS

repeat_region the definition of these in EMBL is confused, and we have to do some sanity testing on the positions etc.

repeat_unit see above

rep_origin

rRNA is treated in the same way as CDS and mRNA.

sig_peptide has to be treated as mat_peptide

source This may contain a more elaborate organism definition, so this actually does not go into features, instead it has to be crosschecked against the OS data.

stem_loop

STS

terminator

tRNA is on a par with CDS and the other RNA types.

unsure

variation

virion

3'clip

3'UTR

5'clip

5'UTR

-10_signal

-35_signal

Then do the loops across mat_peptides and sig_peptides, trying to match them agains CDSes.

Thru all this we add text to the variable $features. Final step is to add $features to $sequence.

CC field code

Sequence header

... and the sequence

Finishing up
Check that nothings left of the record.

Print out
Print the sequence, and then all its subsequences, and then all of the references.

.

Author

Staffan Bergh (staffan.bergh@eu.pnu.com), 1997-08-06