Stockholm Bioinformatics Center, SBC
Lecture notes: Structural biochemistry and bioinformatics 2001
Lecture 12 Nov 2001,
Per Kraulis
5. A standard multiple alignment program: ClustalW
From what we have learned in previous sections, doing a simultaneous
N-wise alignment is not a realistic option if we have, say, 50
sequences to align. What to do? The obvious alternative is to use a
so-called progressive alignment method: The alignment
is built up in stages where a new sequence is added to an existing
alignment, using some rules to determine in which order the sequences
should be added, and how.
ClustalW
(Thompson,
Higgins & Gibson, 1994) is one of the standard programs
implementing one variant of the progressive method in wide use today
for multiple sequence alignment. The W denotes a specific version
that has been developed from the original Clustal program.
The basic steps of the algorithm implemented in ClustalW are:
-
Compute the pairwise alignments for all against all
sequences. The similarities are stored in a matrix (sequences versus
sequences).
-
Convert the sequence similarity matrix values to
distance measures, reflecting evolutionary distance
between each pair of sequences.
-
Construct a tree (the so-called guide tree) for the
order in which pairs of sequences are to be aligned and combined with
previous alignments. This is done using a neighbour-joining clustering
algorithm. In the case of ClustalW, a method by Saitou & Nei is
used.
-
Progressively align the sequences/alignments together
into each branch point of the guide tree, starting with the least
distant pairs of sequences. At each branch point, one must do either a
sequence-sequence, sequence-profile, or profile-profile alignment.
Note that the original idea of using a simultaneous N-wise dynamic
programming alignment method had the algorithmic complexity
O(c2n), whereas this method has something
like O(n2) (which comes from the
all-against-all pairwise comparison step).
A number of rules (tricks, some would say) are used to increase the
success rate of the procedure:
-
Each sequence is weighted according to how different it is from the
other sequences. This accounts for the case where one specific
subfamily is overrepresented in the data set.
-
The substitution matrix used for each alignment step depends on the
similarity of the sequences (a somewhat circular argument, but what
the hell...).
-
Position-specific gap-open penalties are modified according to residue
type using empirical observations in a set of alignments based on 3D
structures. In general, hydrophobic residues have higher gap penalties
than hydrophilic, since they are more likely to be in the hydrophobic
core, where gaps should not occur.
-
Gap-open penalties are decreased if the position is spanned by a
consecutive stretch of five or more hydrophilic residues.
-
Both gap-open and gap-extend penalties are increased if there are no
gaps in a column, but gaps occur nearby in the alignment.
-
The guide tree can in some circumstances be overriden, for instance by
deferring joining two branches if they are too dissimilar, until more
information has been added by processing other branches.
There are some specific cases where ClustalW is know to have
problems.
-
If the sequences are similar only in some smaller
regions, while the larger parts are not recognisably similar,
then ClustalW may have problems aligning all sequences properly. This
is because ClustalW tries to find global alignments, not local. In
such a case, it may be wise to cut out the similar parts with some
other tool (text editor).
-
If one sequence contains a large insertion compared
to the rest, then there may be problems, for much the same reason as
the previous point.
-
If one sequence contains a repetitive element (such
as a domain), while another sequence only contains one copy of the
element, then ClustalW may split the single domain into two
half-domains to try to align the first half with the first the domain
in the first sequence, and the other half to the second domain in the
first sequence. There are many proteins that contain multiple, very
similar copies of a domain, so one should watch out for this.
ClustalW is an example of an algorithm that has given up on trying to
be perfect (because it takes too much time), and instead uses an
approximation strategy, combined with more-or-less intelligent tricks
that guide the computation towards a successful (but not necessarily
optimal) result. This is called a heuristic
algorithm.
One important point to keep in mind is that since ClustalW is a
heuristic algorithm, it cannot produce a solution that is guaranteed
to be optimal. But in practice, the results it produces are good
enough, and one should perhaps worry more about the quality of the
input data. For example, if one has sequences that are just barely
significantly similar, one should worry more about if all of them
really belong in the alignment at all, rather than if the alignment is
perfect or not (which it in such a case almost certainly isn't).
ClustalW has a number of parameters that the user can change. This
will affect the exact manner in which the computation proceeds, and it
may be useful to compare runs with different parameters; the
near-perfect parameter set varies with the specific case.
One shouldn't really be content with just ClustalW. There are other
multiple-alignment software out there, and it is by no means clear
that ClustalW is the best. It is usually very difficult to make good,
unbiased comparisons between multiple-alignment software. And of
course, one still has the basic problem of how to choose the
objective criterion to determine what the best multiple alignment
really is.
Copyright © 2001
Per Kraulis
$Date: 2001/11/09 14:35:31 $