Sequence alignment

7. How to generate a multiple alignemnt?

How do we generate a multiple alignment? There is an obvious solution: Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. But is this as straightforward as it looks? Consider a case with three short fake sequence that are pair-wise homologous in the following way:

  dront   AGAC
  t-rex   --AC

  dront   AGAC
  unicorn AG--

  t-rex   AC
  unicorn AG

It is not self-evident how these sequences are to be aligned together. Here are some possibilities:

  dront   AGAC
  t-rex   --AC
  unicorn AG--

  dront   AGAC
  unicorn AG--
  t-rex   --AC

  t-rex   AC--
  unicorn AG--
  dront   AGAC

  t-rex   --AC
  unicorn --AG
  dront   AGAC

It depends not only on the various parameters (insertion/deletion penalties, substitution coefficients,...) but also on the order in which sequences are added to the multiple alignment.

In pairwise alignments, one has a two-dimensional matrix with the sequences on each axis, and the elements in the matrix are initially the substitution coefficients, which are then operated on (as you have done previously) to locate the best "path" through the matrix. The number of operations required to do this is approximately proportional to the product of the lengths of the two sequences.

A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. Algorithmically, this is not difficult to do.

In the case of three sequences to be aligned, one can visualize this reasonable easily: One would set up a three-dimensional matrix (a cube) instead of the two-dimensional matrix for the pairwise comparison. Then one basically performs the same procedure as for the two-sequence case. This time, the result is a path that goes diagonally through the cube from one corner to the opposite.

The problem here is that the time to compute this N-wise alignment becomes prohibitive as the number of sequences grows. The algorithmic complexity is something like O(c²ⁿ), where c is a constant, and n is the number of sequences. (See the next section for an explanation of this notation.) This is disastrous, as may be seen in a simple example: if a pairwise alignment of two sequences takes 1 second, then four sequences would take 10⁴ seconds (2.8 hours), five sequences 10⁶ seconds (11.6 days), six sequences 10⁸ seconds (3.2 years), seven sequences 10¹⁰ seconds (317 years), and so on.