Stockholm Bioinformatics Center, SBC
Lecture notes: Structural biochemistry and bioinformatics 2001

Lecture 12 Nov 2001, Per Kraulis

Multiple sequence alignments

5. A standard multiple alignment program: ClustalW

From what we have learned in previous sections, doing a simultaneous N-wise alignment is not a realistic option if we have, say, 50 sequences to align. What to do? The obvious alternative is to use a so-called progressive alignment method: The alignment is built up in stages where a new sequence is added to an existing alignment, using some rules to determine in which order the sequences should be added, and how.

ClustalW (Thompson, Higgins & Gibson, 1994) is one of the standard programs implementing one variant of the progressive method in wide use today for multiple sequence alignment. The W denotes a specific version that has been developed from the original Clustal program.

The basic steps of the algorithm implemented in ClustalW are:

  1. Compute the pairwise alignments for all against all sequences. The similarities are stored in a matrix (sequences versus sequences).
  2. Convert the sequence similarity matrix values to distance measures, reflecting evolutionary distance between each pair of sequences.
  3. Construct a tree (the so-called guide tree) for the order in which pairs of sequences are to be aligned and combined with previous alignments. This is done using a neighbour-joining clustering algorithm. In the case of ClustalW, a method by Saitou & Nei is used.
  4. Progressively align the sequences/alignments together into each branch point of the guide tree, starting with the least distant pairs of sequences. At each branch point, one must do either a sequence-sequence, sequence-profile, or profile-profile alignment.

Note that the original idea of using a simultaneous N-wise dynamic programming alignment method had the algorithmic complexity O(c2n), whereas this method has something like O(n2) (which comes from the all-against-all pairwise comparison step).

A number of rules (tricks, some would say) are used to increase the success rate of the procedure:

There are some specific cases where ClustalW is know to have problems.

ClustalW is an example of an algorithm that has given up on trying to be perfect (because it takes too much time), and instead uses an approximation strategy, combined with more-or-less intelligent tricks that guide the computation towards a successful (but not necessarily optimal) result. This is called a heuristic algorithm.

One important point to keep in mind is that since ClustalW is a heuristic algorithm, it cannot produce a solution that is guaranteed to be optimal. But in practice, the results it produces are good enough, and one should perhaps worry more about the quality of the input data. For example, if one has sequences that are just barely significantly similar, one should worry more about if all of them really belong in the alignment at all, rather than if the alignment is perfect or not (which it in such a case almost certainly isn't).

ClustalW has a number of parameters that the user can change. This will affect the exact manner in which the computation proceeds, and it may be useful to compare runs with different parameters; the near-perfect parameter set varies with the specific case.

One shouldn't really be content with just ClustalW. There are other multiple-alignment software out there, and it is by no means clear that ClustalW is the best. It is usually very difficult to make good, unbiased comparisons between multiple-alignment software. And of course, one still has the basic problem of how to choose the objective criterion to determine what the best multiple alignment really is.


Copyright © 2001 Per Kraulis $Date: 2001/11/09 14:35:31 $