Blixem: a graphical BLAST viewer

Blixem, as implemented in ACEDB, creates a graphical representation from BLAST output. This is useful when BLASTing long genomic sequences resulting in a large number of hits. Blixem is described in "Sonnhammer and Durbin (1994) A Workbench for large-scale sequence homology analysis. CABIOS 10:301-307". Blixem stands for "BLast matches In an X-windows Embedded Multiple alignment". It contains two displays: the bottom display shows the actual alignment of DNA or proteins to the genomic DNA sequence and the top display shows the "Big Picture" of matches around the alignment window
 

How to get the Blixem display

When in the ACEDB map display, indicators which link to Blixem data are represented as colored boxes. Cyan boxes are links to protein alignments to the mapped sequences data, and yellow boxes are links to the DNA alignments with the mapped sequence data (a sample map is shown below with protein (cyan) and DNA (yellow) sequence Blixem links). To select the Blixem option, the mouse cursor is placed over the colored box and the right button is pressed, yielding the following options:
 
  1. Show multiple protein alignment in Blixem
  2. Show multiple protein alignment of just this kind of homologies
  3. Develop a table of these homologies
  4. (slow) import this protein over the web
The first two selections from the right select button on the Blixem indicator will open up the Blixem display (mentioned further below).

With the "Develop a table..." selection, a message may appear:
 

which is then followed by a message: Then you are launched into the Forest display. The Forest display will allow for the sorting of sequence hits for preparing bibliographic records (SEE Forest Display).

The selection "(slow) import this protein ..." is a link to a program set under ACEDB's /wscript directory which retrieves the highlighted sequence over the WWW  interface.
 

Models for Blixem Display

?Sequence
Homol    DNA_homol ?Sequence XREF DNA_homol ?Method Float Int UNIQUE Int Int UNIQUE Int
              Pep_homol ?Protein XREF DNA_homol ?Method Float Int UNIQUE Int Int UNIQUE Int
              Motif_homol ?Motif XREF DNA_homol ?Method Float Int UNIQUE Int Int UNIQUE Int

?Protein
Homol    DNA_homol ?Sequence XREF Pep_homol ?Method Float Int UNIQUE Int Int UNIQUE Int
              Pep_homol ?Protein XREF Pep_homol ?Method Float Int UNIQUE Int Int UNIQUE Int
              Motif_homol ?Motif XREF Pep_homol ?Method Float Int UNIQUE Int Int UNIQUE Int

?Motif
Homol    DNA_homol ?Sequence XREF Motif_homol ?Method Float Int UNIQUE Int Int UNIQUE Int
              Pep_homol ?Protein XREF Motif_homol?Method Float Int UNIQUE Int Int UNIQUE Int
              Motif_homol ?Motif XREF Motif_homol ?Method Float Int UNIQUE Int Int UNIQUE Int

Rawdata entry:

Sequence : "B0001"
DNA_homol    "CESAA60F"              "BLASTN"    174.000000   2072   2022 109 159
DNA_homol    "CESAA60F"              "BLASTN"    148.000000 20890 20856 109 143
DNA_homol    "yk15d5.3"                  "BLASTN"    157.000000 23719 23780   93 154
DNA_homol    "yk15d5.3"                  "BLASTN"    145.000000   3795   3749   57 103
Pep_homol      "SW:KPPR_ARATH" "BLASTX"       79.000000 10026   9922 161 195
Pep_homol      "SW:KPPR_ARATH" "BLASTX"       64.000000   9867   9718 213 262
Pep_homol      "SW:KPPR_MESCR" "BLASTX"      58.000000  10276 10184 123 153
Pep_homol      "SW:KPPR_MESCR" "BLASTX"      73.000000  10026   9922 162 196
Motif_homol    "PS:PS00017"             "Queryprosite" 13.800000    1881   1902     1     8
Motif_homol    "PS:PS00077"             "Queryprosite" 19.900000  35626 35638     1     5
 

The Blixem Display


 

Top display:

The top display shows the "Big Picture" of genes and matches around the alignment window, which is shown as a blue frame. Each match is represented as a black line, placed on a percentage identity scale. Scrolling is possible with the middle mouse button to move the center of the blue frame. Each sequence is selectable and also highlights the corresponding sequence(s) in the lower window. A black line can be selected which highlights the span of the sequence in cyan. The corresponding sequence in the bottom box is also highlighted in reverse video. Zooming in and out only changes the amount of sequence shown in the Big Picture around the alignment window. At the bottom of the top window is also intron/exon information as displayed from the Map display.

Use of the left mouse button highlights sequences. With a second click (not double click) on a highlighted sequence you can fetch annotations from the local ACEDB database, from local databases using efetch, or over the World-Wide-Web using WWW-efetch. Be patient, access from a remote database may require some time. If this doesn't work, either efetch or the database itself is not installed for the external program. If it fails, the sequence will not be displayed, but the range which is annotated will represented by dashed-lines instead. If the sequence is not retrieved, the percent identification with the query sequence will not be calculated, but the original BLAST score reported will be diplayed.

In mode 1, blixem calls "efetch -q seqname", while in mode 2 it calls "efetch seqname". Your efetch script wrapper must therefore check for the -q option. If it is used, it should return the raw sequence on one line only. If it is not, it should return the annotation as raw text on multiple lines. Switch to the opposite strand by clicking on "Strand v^". "Goto..." can either be used by picking, to go to an absolute position, or as a pull-down menu with the right mouse button to go to the beginning or end of the query sequence currently in Blixem. By default, this is a 20 Kb region around the box that was used in ACEDB to call Blixem.

Other sequence retrieval tools are possible other than efetch (see discussion of efetch below). If you want to use your own in-house retrieval system, you can make a script wrapper that simulates efetch. This would be place in the $ACEDB/wscripts directory of the ACEDB database. The settings may be adjusted to select the method for retrieving sequences (acedb, efetch, www-efetch) from the Blixem settings menu (shown below).

The right mouse button invokes a menu with additional functions, such as Dotter creates a dot plot from the alignment. The general mouse key selections are shown below:
 

 

 

Bottom display:

The yellow highlighted sequence located at the top of the bottom display is the translated genomic sequence in three frames for the current orientation. Matching sequences are listed below the respective reading frame. Residues in a homologous protein may get three different colours: Cyan (bright blue) for the same residue, dark blue for a conserved substitution and no color shown for a mismatch. The Start and End coordinates refer to the entire match, not just what is displayed at the time. Scrolling is either done with the scroll buttons, or with the middle mouse button, which also tells the coordinate of the residue under the crosshair.


 
Options include being able to sort sequences (HSPs) by Score, %ID, position, or alphanumerically; this can be selected from the Settings box. When sorted by score, all proteins are listed with the highest-scoring first. Sorting by identity sorts all the proteins with the most identical first. Sorting by name lists all proteins alphabetically. Sort by position sorts all proteins with the most N-terminal first. Customization of the graphics display is possible using menu choices are for Background colour (1 of 32), Grid colour (1 of 32), Identical residues (1 of 32), and Conserved residues (1 of 32).

Toggle selections include (shown below):
 

Blixem's main menu (right mouse button anywhere):

Dotter is a graphical dot plot tool. It is integrated in Blixem and can be used for DNA-DNA, Protein-Protein and DNA-Protein comparisons. In the case of DNA-Protein the nucleic acid sequence is translated in all three reading frames. The Alignment Tool window shows which frame hits the other Protein sequence.

A selected Dotter alignment will display a screen of all reading frames aligned with the query sequence (shown below):


 
And the actual Dotter dot plot is displayed (shown below):

Right button mouse menus for dotter are:

Quit:
Help:
Greyramp Tool:
Alignment Tool:
Print:
Crosshair:
Crosshair coordinates:
Grid:
Zoom in with parameter control:
Save Current plot:
Load features from file:
Change size of sliding window:
Draw BLAST HSPs (grey pixels):
Draw BLAST HSPs (red lines):
Draw BLAST HSPs (colour=f(score)):
Remove BLAST HSPs:
Pixelmap:
 

The Results can be filtered with the graphical tool once the comparison is finished.
 

efetch - notes from README file

The retrieval tool is called efetch, and comes with a set of tools to create the index files. Currently six sequence databases are supported: GenBank , EMBL, SWISS-PROT, PIR, Prosite and ProDom. ProDom is a comprehensive collection of protein domain families clustered and aligned by the method of Sonnhammer and Kahn (1994). The tools for creating the indices are based on Rodger Staden's programs, and the index system conforms to the standards proposed by the EMBL data library. Any database in a Fasta-like format is supported and new formats can easily be handled. All programs are written in C and the source code is freely available. Efetch can also be used as a stand-alone program to retrieve records on the command line. Retrieval is possible either by entryname or accession number. It currently supports some five output formats, and additional formats can very easily be added.

Efetch is a stand-alone program used by acedb to retrieve sequence data from external databases, such as SwissProt and PIR. This saves acedb from storing all this information internally while allowing acedb to retrieve any sequence entry on the fly, in order to display sequence annotation and sequence alignments from within acedb.

The efetch program provided with acedb only works on databases with indices that conform to the standard used on the EMBL CD-ROM. The only external sequence references stored in acedb are PIR and SwissProt. The indices for SwissProt can be taken directly from he EMBL CD-ROM, whereas the indices for PIR can be created with the Staden package.

SwissProt and EMBL Information about getting the EMBL CD-ROM is available from Peter Stoehr, E-mail stoehr@embl-heidelberg.de. Copy the database file swissprot/seq.dat and some indices from indices/swissprot/ to a separate directory. You need these files:
 

Other databases. For other databases, you have to generate entrynam.idx and division.lkp yourself. The staden package contains programs for PIR, Genbank, EMBL and Swissprot. Mail Rodger Staden, rs@mrc-lmb.cam.ac.uk, if you need to obtain his package. The Staden package doesn't create the file division.lkp, which is therefore provided here.

Installing the efetch program

Set the environment variables SWDIR, PIRDIR and OTHERDIR to the directories where you keep SwissProt, PIR and Other. These directories MUST NOT BE THE SAME. The directory names must end with a slash (/).

Do:

setenv SWDIR SWDIRpath/
setenv PIRDIR PIRDIRpath/
setenv OTHERDIR OTHERDIRpath/

Then test efetch by e.g.: % efetch SW:HBA_HUMAN This should return the entire record of HBA_HUMAN. An alternative method which is more general and can be used for any database, is based on adding the prefix and directory to the environment variable EFETCH_PREFIX. For example, to link the prefix mydb to the directory /mydbdir, the syntax would be: setenv EFETCH_PREFIX "mydb:/mydbdir;/ By default, mydb is assumed to be in fasta format. If it is in flatfile format, this is set by using mydb(flat): instead of mydb: in the prefix definition. Any number of prefix:dir; entries can be added to EFETCH_PREFIX.

Now add the setenv commands above to a file that is run when you login, for instance .cshrc. If efetch is in the path and the environment variables set, efetch should now work from within acedb and you can see the sequence alignments in Blixem.

Sonnhammer, E. L. L and Durbin, R. (1994). A workbench for Large Scale Sequence Homology Analysis. Comput. Applic. Biosci, 10:301-307.

Sonnhammer, E.L.L. and Kahn, D. (1994) Modular structure of proteins as inferred from analysis of homology. Protein Science, 3:4