Seizing the Opportunity to Forge Practical Standards for Componentry in
Bioinformatics

Nathan Goodman and Matthew Cowley

The Jackson Laboratory, Bar Harbor, ME


The genome informatics community is coming to realize something that has
long been common knowledge in the ACEDB world, namely that software sharing
is both a good idea and a practical one.  But good intentions are not
enough.  For software sharing to actually occur, one also needs high quality
software that is worth sharing, and technology and architecture that make it
easy to adapt the software for new applications.  Arguably, the most
important factor in ACEDB's success, in addition to the usefulness of the
base software, has been the ease of extending its models for new databases,
while the factor that has most limited its success has been the difficulty
of extending the base software.

We and many others have been advocating component-based architectures as a
basis for constructing sharable software. A component is an independent
program that is designed to be used as modular building block.  A
component-based system is one constructed in a modular fashion from components.

The component-based approach depends on the adoption of standards in four
main areas: data models, graphical user interfaces, inter-program
communication, and database infrastructure.  Technologic progress is being
made in all these areas: CORBA provides good technology for data modeling
and inter-program communication; Java promises to solve the graphical user
interface problem; and object-relational databases will likely be an
effective base for our database infrastructure.  Moreover, the technologies
are conceptually compatible, and industry is showing strong interest in
making them truly compatible.

This leaves us with two not-so-little challenges.  One is to devise an
architecture and standards employing this technology that are extensible and
can support the software we need.  The other is to use these elements to
build good software that people want to use.

A key architectural issue is to maintain client-side caches of persistent
data.  This is needed to provide multiple, coordinated views of the same
data, for example to display both a map and a sequence view of a genomic
region at the same time, as well as for many other purposes.  ACE
client/server offers an interesting solution to this problem, which has also
been investigated at length in computer science and industry. 

A second key architectural issue is to design a library of generic classes
that can serve as foundation classes for data models; these are classes that
provides basic mechanisms that can be incorporated into data model classes.
Each such class will provide a standard repertoire of core functions, such
as data management, data exchange, and display. For a foundation class to be
valuable, the mechanism it implements must be both useful and difficult to
implement, so that a data model developer can save a lot of work by using
the class.  Examples include the Materials and Steps of LabBase which
encapsulate much of the bookkeeping complexity encountered in laboratory
databases;  we will also need such classes for maps and sequences,
alignments, pathways, tag/value structures, and more.  The goal is to design
a relatively small set of such classes, and to build the much larger set of
data model classes using these. In the past, we have used the term "nugget"
to refer to software of this type, which is meant to connote that the class
embodies a hard bit of software. 

We have an opportunity today to forge practical standards for
bioinformatics.  If we seize the opportunity and are successful, the impact
will be enormous.