Databases in bioinformatics

3. Accession codes vs identifiers

Many databases in bioinformatics (SWISS-PROT, EMBL, GenBank, Pfam) use a system where an entry can be identified in two different ways. Basically, it has two names:

Identifier
Accession code (or number)

The question how to deal with changed, updated and deleted entries in databases is a very tricky problem, and the policies for how accession codes and identifiers are changed or kept constant are not completely consistent between databases or even over time for one single database.

The exact definition of what the identifier and accession code are supposed to denote varies between the different databases, but the basic idea is the following.

Identifier

An identifier ("locus" in GenBank, "entry name" in SWISS-PROT) is a string of letters and digits that generally is interpretable in some meaningful way by a human, for instance as a recognizable abbreviation of the full protein or gene name.

SWISS-PROT uses a system where the entry name consists of two parts: the first denotes the protein and the second part denotes the species it is found in. For example, KRAF_HUMAN is the entry name for the Raf-1 oncogene from Homo sapiens.

An identifier can usually change. For example, the database curators may decide that the identifier for an entry no longer is appropriate. However, this does not happen very often. In fact, it happens so rarely that it's not really a big problem.

Accession code (number)

An accession code (or number) is a number (possibly with a few characters in front) that uniquely identifies an entry in its database. For example, the accession code for KRAF_HUMAN in SWISS-PROT is P04049.

The main conceptual difference from the identifier is that it is supposed to be stable: any given accession code will, as soon as it has been issued, always refer to that entry, or its ancestors. It is often called the primary key for the entry. The accession code, once issued, must always point to its entry, even after large changes have been made to the entry. This means that in discussions about specific database entries (e.g. an article about a specific protein), one should always give the accession code for the entry in the relevant database.

In the case where two entries are merged into one single, then the new entry will have both accession codes, where one will be the primary and the other the secondary accession code. When an entry is split into two, both new entries will get new accession codes, but will also have the old accession code as secondary codes.