Lecture 13 Nov 2001, Per Kraulis
The PROSITE database has already been mentioned in this course. It was initiated and is maintained by Amos Bairoch and colleagues, now at the Swiss Institute of Bioinformatics. It is based on the proteins sequences in SWISS-PROT. It aims at describing characteristic patterns for some domain families using regular expressions, and contains about 1400 patterns, rules and profile/matrices. It is being maintained, but it is fair to say that it has been superceded in practical terms by other search methods and databases, such as Pfam (mentioned before, and discussed later).
PROSITE makes a distinction between patterns and rules, which are both described by regular expressions:
In this section we will just describe the notation for patterns and rules in PROSITE. Patterns and rules are described using the same notation. Unfortunately, the PROSITE notation for sequence patterns is different from the UNIX-type regular expressions. However, the concepts are the same, and it is not so difficult to translate a PROSITE pattern into a UNIX-type regular expression.
As an example, let us use the PROSITE pattern CBM_1 (formerly CBD_FUNGAL) (accession code PS00562). The preceding link shows a nicer view of the entry. Below is the original text entry as it is given in the downloadable PROSITE data file.
ID CBD_FUNGAL; PATTERN. AC PS00562; DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE). DE Cellulose-binding domain, fungal type. PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C. NR /RELEASE=38,80000; NR /TOTAL=21(18); /POSITIVE=21(18); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=1; /PARTIAL=0; CC /TAXO-RANGE=??E??; /MAX-REPEAT=4; CC /SITE=1,disulfide; /SITE=7,disulfide; /SITE=9,disulfide; CC /SITE=16,disulfide; DR Q00023, CEL1_AGABI, T; Q12714, GUN1_TRILO, T; P07981, GUN1_TRIRE, T; DR P07982, GUN2_TRIRE, T; P43317, GUN5_TRIRE, T; P46236, GUNB_FUSOX, T; DR P46239, GUNF_FUSOX, T; P45699, GUNK_FUSOX, T; P15828, GUX1_HUMGR, T; DR Q06886, GUX1_PENJA, T; P13860, GUX1_PHACH, T; P00725, GUX1_TRIRE, T; DR P19355, GUX1_TRIVI, T; Q92400, GUX2_AGABI, T; P07987, GUX2_TRIRE, T; DR P49075, GUX3_AGABI, T; P46238, GUXC_FUSOX, T; P50272, PSBP_PORPU, T; DR O59843, GUX1_ASPAC, N; DO PDOC00486; //
The central line is the PA line, which contains the pattern. Let us go through this pattern step by step.
PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C.Let us go through the elements in the pattern to see what they mean:
The lines marked DR are the protein sequence entries in SWISS-PROT that match (character T) or do not match (character N) the regular expression. In this case, the protein GUX1_ASPAC does not match the PROSITE rule, although it should; it is a false negative.