Databases in bioinformatics

5. Protein sequence databases

The two protein sequence databases SWISS-PROT and PIR are different from the nucleotide databases in that they are both curated. This means that groups of designated curators (scientists) prepare the entries from literature and/or contacts with external experts.

SWISS-PROT, TrEMBL www.expasy.ch/sprot/

SWISS-PROT is a protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

It was started in 1986 by Amos Bairoch in the Department of Medical Biochemistry at the University of Geneva. This database is generally considered one of the best protein sequence databases in terms of the quality of the annotation. Release 39.12 (11 Jan 2001) contained 92,211 entries.

TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT. The procedure that is used to produce it was developed by Rolf Apweiler. Release 15.14 (5 Jan 2001) contained 378,152 entries. The annotation of an entry in TrEMBL has not (yet) reached the standards required for inclusion into SWISS-PROT proper.

SWISS-PROT and TrEMBL are developed by the SWISS-PROT groups at Swiss Institute of Bioinformatics (SIB) and at EBI. The databases can be accesses and searched through the the SRS system at ExPASy, or one can download the entire database as one single flat file. An example of what an entry looks like is given for the human raf oncogene protein, ID KRAF_HUMAN.

The SWISS-PROT database has some legal restrictions: the entries themselves are copyrighted, but freely accessible and usable by academic researchers. Commercial companies must pay a license fee from SIB to use SWISS-PROT.

PIR pir.georgetown.edu

The Protein Information Resource (PIR) is a division of the National Biomedical Research Foundation (NBRF) in the US. It is involved in a collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID). Release 67.00 (31 Dec 2000) contains 198,801 entries.

PIR grew out of Margaret Dayhoff's work in the middle of the 1960s. It strives to be comprehensive, well-organized, accurate, and consistently annotated. However, it is generally believed that it does not reach the level of completeness in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR overlap extensively, there are still many sequences which can be found in only one of them.

One can search for entries or do sequence similarity searches at the PIR site. The database can also be downloaded as a set of files. An example of what an entry looks like is given for the human raf-1 oncogene protein, ID TVHUF6.

PIR also produces the NRL-3D, which is a database of sequences extracted from the three-dimensional structures in the Protein Databank (PDB) (see also the following page in this lecture. The NRL_3D database makes the sequence information in PDB available for similarity searches and retrieval and provides cross-reference information for use with the other PIR Protein Sequence Databases.