Lecture 19 Jan 2001 Per Kraulis
Very early in the days of protein sequence analysis, it was observed that some protein sequences contained long segments that were very similar to other proteins, while the rest of the sequence in that protein had no detectable similarity. Today, we take more or less for granted that proteins are composed of domains, segments of sequence which have been joined together by genetic events during evolution so that the new protein has a function that is based on the activities of the domains it contains.
Often the domains detectable by sequence analysis correspond to structural domains in the 3D structure as well. There are now many well-documented cases where it has been shown that domains can exists perfectly well in isolation, when excised from the original protein. Surprisingly often, a domain can be expressed and folded all on its own.
There are today several databases that keep track of which domains have been discovered, which proteins are involved, and that store the multiple sequence alignments of the relevant segments of the protein sequences. We have already discussed one such database, Pfam. Also, several of the primary sequence databases now contain information about the domains in the sequence entries.
The idea behind Pfam is twofold:
The multiple alignment used to define a domain (protein family) in Pfam are called the seed alignment. It is created by a curator, or taken from the literature. It is used to generate a profile HMM for identifying other sequences in the databases (SWISS-PROT and TREMBL) that contain the domain. The search results are inspected to decide which cutoff should be used for that particular Pfam entry. The search hits are then aligned automatically into a so-called full alignment.
There are a number of other useful databases of multiple sequence alignments, such as:
These databases allow analysis of new sequences in terms of which domains can be detected in the sequence. This is often more useful, and sometimes also more sensitive (although this is somewhat controversial) than doing sequence-to-sequence comparisons. For instance, if a new protein has a kinase domain, then it is more helpful to use a domain database (with some appropriate search software, such as HMMER for Pfam) to identify it directly in the sequence. The alternative, using BLAST or FASTA to find similar sequences, would return thousands of sequences, and it would require some work to sort out that this is because the query sequence contains a very common kinase domain.