Lecture 19 Jan 2001 Per Kraulis
How do we generate a multiple alignment? There is an obvious solution: Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. But is this as straightforward as it looks? Consider a case with three short fake sequence that are pair-wise homologous in the following way:
dront AGAC t-rex --AC dront AGAC unicorn AG-- t-rex AC unicorn AG
It is not self-evident how these sequences are to be aligned together. Here are some possibilities:
dront AGAC t-rex --AC unicorn AG-- dront AGAC unicorn AG-- t-rex --AC t-rex AC-- unicorn AG-- dront AGAC t-rex --AC unicorn --AG dront AGAC
It depends not only on the various parameters (insertion/deletion penalties, substitution coefficients,...) but also on the order in which sequences are added to the multiple alignment.
In pairwise alignments, one has a two-dimensional matrix with the sequences on each axis, and the elements in the matrix are initially the substitution coefficients, which are then operated on (as you have done previously) to locate the best "path" through the matrix. The number of operations required to do this is approximately proportional to the product of the lengths of the two sequences.
A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. Algorithmically, this is not difficult to do.
In the case of three sequences to be aligned, one can visualize this reasonable easily: One would set up a three-dimensional matrix (a cube) instead of the two-dimensional matrix for the pairwise comparison. Then one basically performs the same procedure as for the two-sequence case. This time, the result is a path that goes diagonally through the cube from one corner to the opposite.
The problem here is that the time to compute this N-wise alignment becomes prohibitive as the number of sequences grows. The algorithmic complexity is something like O(c2n), where c is a constant, and n is the number of sequences. (See the next section for an explanation of this notation.) This is disastrous, as may be seen in a simple example: if a pairwise alignment of two sequences takes 1 second, then four sequences would take 104 seconds (2.8 hours), five sequences 106 seconds (11.6 days), six sequences 108 seconds (3.2 years), seven sequences 1010 seconds (317 years), and so on.