Hidden Markov models

2. PROSITE patterns and rules

The PROSITE database has already been mentioned in this course. It was initiated and is maintained by Amos Bairoch and colleagues, now at the Swiss Institute of Bioinformatics. It is based on the proteins sequences in SWISS-PROT. It aims at describing characteristic patterns for some domain families using regular expressions, and contains about 1400 patterns, rules and profile/matrices. It is being maintained, but it is fair to say that it has been superceded in practical terms by other search methods and databases, such as Pfam (mentioned before, and discussed later).

PROSITE makes a distinction between patterns and rules, which are both described by regular expressions:

A pattern is intended to capture the characteristic fingerprint of a protein domain family.
A rule, on the other hand, is intended to highlight features in a protein sequence that does not necessarily have anything to do with a specific protein family. For example, potential glycosylation sites and phosphorylation sites can be found in many protein sequences, and have little to do with the family of a protein.

In this section we will just describe the notation for patterns and rules in PROSITE. Patterns and rules are described using the same notation. Unfortunately, the PROSITE notation for sequence patterns is different from the UNIX-type regular expressions. However, the concepts are the same, and it is not so difficult to translate a PROSITE pattern into a UNIX-type regular expression.

As an example, let us use the PROSITE pattern CBD_FUNGAL (accession code PS00562). The preceding link shows a nicer view of the entry. Below is the original text entry as it is given in the downloadable PROSITE data file.

ID   CBD_FUNGAL; PATTERN.
AC   PS00562;
DT   DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).
DE   Cellulose-binding domain, fungal type.
PA   C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C.
NR   /RELEASE=38,80000;
NR   /TOTAL=21(18); /POSITIVE=21(18); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=1; /PARTIAL=0;
CC   /TAXO-RANGE=??E??; /MAX-REPEAT=4;
CC   /SITE=1,disulfide; /SITE=7,disulfide; /SITE=9,disulfide;
CC   /SITE=16,disulfide;
DR   Q00023, CEL1_AGABI, T; Q12714, GUN1_TRILO, T; P07981, GUN1_TRIRE, T; 
DR   P07982, GUN2_TRIRE, T; P43317, GUN5_TRIRE, T; P46236, GUNB_FUSOX, T; 
DR   P46239, GUNF_FUSOX, T; P45699, GUNK_FUSOX, T; P15828, GUX1_HUMGR, T; 
DR   Q06886, GUX1_PENJA, T; P13860, GUX1_PHACH, T; P00725, GUX1_TRIRE, T; 
DR   P19355, GUX1_TRIVI, T; Q92400, GUX2_AGABI, T; P07987, GUX2_TRIRE, T; 
DR   P49075, GUX3_AGABI, T; P46238, GUXC_FUSOX, T; P50272, PSBP_PORPU, T; 
DR   O59843, GUX1_ASPAC, N; 
DO   PDOC00486;
//

The central line is the PA line, which contains the pattern. Let us go through this pattern step by step.

PA   C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C.

Let us go through the elements in the pattern to see what they mean:

Each non-x letter defines one particular type of amino-acid residue in that position in the pattern. Here, we must have a tripeptide Cys-Gly-Gly in the beginning of the matching segment of a protein chain. The dash characters '-' add no information to the pattern, and are added to make the pattern slightly easier to read.
The notation x(4,7) means that at least 4 and at most 7 residues of any type may occur at this position. This corresponds to the notation .{4,7} in a UNIX-type regular expression.
The notation [NHG] means the same thing as in a UNIX-type regular expression: in this position any of the residues within the brackets may be chosen. One and only one such residue must be at this position.
The notation x(2) means that exactly two residues of any type may occur at this position. This corresponds to the notation .. or .{2,2}in a UNIX-type regular expression.
The notation {GP} (not shown in this example) means that all residues except Gly and Pro are allowed in this position.

The lines marked DR are the protein sequence entries in SWISS-PROT that match (character T) or do not match (character N) the regular expression. In this case, the protein GUX1_ASPAC does not match the PROSITE rule, although it should; it is a false negative.