Lecture 12 Nov 2001, Per Kraulis
The problem of how to compare two sequences (nucleotide or protein) through pairwise alignment has been discussed earlier in this course. By generating a pairwise alignment using one of the several available methods (Needleman & Wunsch, Smith & Waterman) one can compute how similar the sequences are (identity or similarity), and which parts of them are similar. These methods are based on specific assumptions about evolutionary processes of point mutations, insertions or deletions in the sequences starting from a common ancestral sequence, and rely on using appropriate parameters for quantifying these events.
When using one of the popular sequence searching programs (FASTA, BLAST) to find similar sequences in a database, one very often obtains many sequences that are significantly similar to the query sequence. Comparing each and every sequence to every other in separate figures may be possible when one has just a few sequences, but it quickly becomes impractical as the number of sequences increases.
What we need is a multiple sequence alignment, where all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other in a common coordinate system. In this coordinate system, each row is the sequence for one protein, and each column is the 'same' position in each sequence. Each column corresponds to a specific residue in the 'prototypical' protein.
As with pairwise alignments, there will be gaps in some sequences, most often shown by the dash '-' or dot '.' character. Note that to construct a multiple alignment, one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise alignment. This means that multiple alignments typically contain more gaps than any given pair of aligned sequences.