Lecture 30 Oct 2001 Per Kraulis
Make biological data available to scientists.
As much as possible of a particular type of information should be
available in one single place (book, site, database). Published data
may be difficult to find or access, and collecting it from the
literature is very time-consuming. And not all data is actually
published explicitly in an article (genome sequences!).
To make biological data available in computer-readable form.
Since analysis of biological data almost always involves computers,
having the data in computer-readable form (rather than printed on
paper) is a necessary first step.
One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and Structures" by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein sequences determined at the time, and new editions of the book were published well into the 1970s. Its data became the foundation for the PIR database.
The computer became the storage medium of choice as soon as it was accessible to ordinary scientists. Databases were distributed on tape, and later on various kinds of disks. When universities and academic institutes were connected to the Internet or its precursors (national computer networks), it is easy to understand why it became the medium of choice. And it is even easier to see why the World Wide Web (WWW, based on the Internet protocol HTTP) since the beginning of the 1990s is the standard method of communication and access for nearly all biological databases.
As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular NMR. An new field of science dealing with issues, challenges and new possibilities created by these databases has emerged: bioinformatics. Other types of data that are or will soon be available in databases are metabolic pathways, gene expression data (microarrays), protein-protein interactions and other types of data relating to biological function and processes.
One very important issue is the frequency and type of errors among the entries of a database. Naturally, this depends strongly on the type of data, and whether the database is curated (modified by a defined group of people) or not. For the sequence databases, the errors may be either in the sequence itself (misprint, wrong on entry, genuine experimental error...) or in the annotation (mistaken features, errors in references,...). In the 3D structure database (PDB), structures have been deposited which were later discovered to contain severe errors. The error handling policy differs considerably between databases. If one bases new experiments or analysis on the data in a particular database, then the implications of its particular error-handling policy need to be considered.