The input files can be either individual records, or entire *.dat files from the distribution of EMBL. Files may also be from an SRS server, or from the Staden lip program (both these have their peculiarities).
The parsing can be somewhat customized:
embl2ace.pl -i'). The '-i' format also suppresses output of
Other_sequence tags, and instead puts all accession numbers on the Database tag.
embl2ace.pl -p' you can get the parse date into the Date tag, with a
descriptive comment.
embl2ace.pl -f 'directory''.
Finally there are some help functions: 'embl2ace.pl -h' prints a short summary of options;
'embl2ace.pl -v' prints the current version [currently $Rev$]; and 'embl2ace.pl -d #' will
print debugging messages, where # is the level, from 0 (no messages) to 6 ('tell me everything ...').
The script also uses one external file of perl functions, called util5.pl (this should be present in the same directory as this file), and you may have to adjust the require statement at the beginning of this script to point at the right place.
Then we read the command line and look for arguments. Anything that is not a legal argument is considered a filename. Also do some rudimentary checking on the arguments.
While reading in the data, it is broken into an list of records. Some checking goes on here: if the record is in some way broken (the // terminator may be missing, or we find a ID field code in the middle of another record), the script commits suicide with a note of the reason.
XX
lines and filler spaces put on the end of lines by the Staden lip program.
Here goes the opening of filehandles for output: if the script is called with
the -d argument, the output is to a file (and the script dies if it cannot
open that file), else it attaches it output to STDOUT.
Unless the -i switch was given at startup, the first accession number is used
as the basis of the name of the sequence object, as mentioned above.
The Database tag is filled in with references, and Other_sequence tags are
added unless running with the -i switch.
At this point, we actually print the first line of output, with some administrative stuff about files and dates etcetera. The rest of the output is held for later, in the variable $sequence.
If the -p argument was used the Parse date is also added at this point.
If the DT lines are missing, we abort from this record and go to the next (records must have DT lines according to the definition of the EMBL format).
There is a further complication in that the feature key source also holds
taxonomic data, and these may sometime conflict, or extend each other. Therefore
we hold the organism name in a list until further notice.
In addition, there is a design decision buried here: if the reference has a RX
field code, ie a pointer to an external literature database (this is usually
MEDLINE) we put a pointer to a Paper into the Sequence object and ignore the
rest - it is so much cleaner and easier to take the reference info directly from
MEDLINE in those cases. The name of those are simply DATABASE_ACCESSION.
The Submitter info is not made into a Paper object, instead it is put into a Date tag.
The Paper objects formed are stored until later. The names are formed from the
Sequence name, plus ref#, where the '#' is a counter.
There is a complication here, in that Features may also contain /db_xref qualifiers. The handling of these guys should be really reworked!
feature key,
a feature location and one or more feature qualifiers. Step one is
to simply get those parts out, cleaned up and stored in a hash.
When that is done we need to treat some features special: CDS, mat_peptide, sig_peptide, mRNA, rRNA, tRNA. So we loop over all the features and treat them one by one.
First have a look at the location, and parse that into a useful format. Also keep track of if the range is open-ended in either end (Start_not_found or End_not_found)
Then we look at the feature key, and treat the one that have special needs.
allele should have a replace-sequence.
attenuator should be more than one nucleotide long.
CDS (or CDS_pept, as they may come out of SRS) needs to be formatted into
a separate Sequence objects (by design with a name as the parent sequence with
'_cds#' added. Some of the feature qualifiers for this key also needs special handling (eg
gene, db_xref, translation, codon_start and transl_table.)
conflict should have a replacement sequence or at least a position.
mat_peptide needs to be attached to an eventual subsequence that it belongs to, but since we cannot know at the time we see the mat_peptide feature key that we have seen the CDS, we have to store it for later treatment.
misc_binding should be bigger than 1 nucleotide.
misc_difference should have a replacement sequence or at least a real position.
misc_feature
misc_recomb
misc_RNA
misc_signal
misc_structure
modified_base
mRNA gets the same kind of treatment the CDSes get.
mutation
old_sequence
precursor_RNA
prim_transcript is really an mRNA, only not processed yet.
primer_bind
promoter
protein_bind
RBS
repeat_region the definition of these in EMBL is confused, and we have to do some sanity testing on the positions etc.
repeat_unit see above
rep_origin
rRNA is treated in the same way as CDS and mRNA.
sig_peptide has to be treated as mat_peptide
source This may contain a more elaborate organism definition, so this actually does not go into features, instead it has to be crosschecked against the OS data.
stem_loop
STS
terminator
tRNA is on a par with CDS and the other RNA types.
unsure
variation
virion
3'clip
3'UTR
5'clip
5'UTR
-10_signal
-35_signal
Then do the loops across mat_peptides and sig_peptides, trying to match them agains CDSes.
Thru all this we add text to the variable $features. Final step is to add $features to $sequence.