Multiple sequence alignments

3. How to generate it?

How do we generate a multiple alignment? There is an obvious solution: Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. But is this as straightforward as it looks? Consider a case with three short fake sequence that are pair-wise homologous in the following way:

  dront   ...AGAC...
  t-rex   ...--AC...

  dront   ...AGAC...
  unicorn ...AG--...

  t-rex   ...AC...
  unicorn ...AG...

It is not self-evident how these sequences are to be aligned together. Here are some possibilities:

  dront   ...AGAC...
  t-rex   ...--AC...
  unicorn ...AG--...

  dront   ...AGAC...
  t-rex   ...AC--...
  unicorn ...AG--...

  dront   ...AGAC...
  t-rex   ...--AC...
  unicorn ...--AG...

A multiple alignment depends in a non-trivial way on the various parameters (insertion/deletion penalties, substitution coefficients,...), and, importantly, also on the order in which sequences are added to the growing multiple alignment.

In pairwise alignments, one has a two-dimensional matrix with the sequences on each axis, and the elements in the matrix are initially the substitution coefficients, which are then operated on (as you have done previously) to locate the best "path" through the matrix. The number of operations required to do this is approximately proportional to the product of the lengths of the two sequences.

A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. Technically, it is not difficult to implement such an algorithm.

In the case of three sequences to be aligned, one can visualize this reasonable easily: One would set up a three-dimensional matrix (a cube) instead of the two-dimensional matrix for the pairwise comparison. Then one basically performs the same procedure as for the two-sequence case. This time, the result is a path that goes diagonally through the cube from one corner to the opposite.

The problem here is that the time to compute this N-wise alignment becomes prohibitive as the number of sequences grows. The algorithmic complexity is something like O(c²ⁿ), where c is a constant, and n is the number of sequences. (See the next section for an explanation of this notation.) This is disastrous, as may be seen in a simple example: if a pairwise alignment of two sequences takes 1 second, then four sequences would take 10⁴ seconds (2.8 hours), five sequences 10⁶ seconds (11.6 days), six sequences 10⁸ seconds (3.2 years), seven sequences 10¹⁰ seconds (317 years), and so on.