2. Some definitions for sequence alignments

Gaps and insertions

In an alignment, one may achieve much better correspondence between two sequences if one allows a gap to be introduced in one sequence. Equivalently, one could allow an insertion in the other sequence. Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.

Optimal alignment

The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.

Global alignment

An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A tiny example:

        LGPSTKDFGKISESREFDN
        |      ||||    | 
        LNQLERSFGKINMRLEDA

Local alignment

An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:

        ----------FGKI----------
                  ||||
        ----------FGKI----------

It may seem that one should always use local alignments. However, it may be difficult to spot an overall similary, as opposed to just a domain-to-domain similarity, if one uses only local alignment. So global alignment is useful in some cases. The popular programs BLAST and FASTA for searching sequence databases produce local alignments.

Substitution matrix

A substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time. This is used to estimate how well two residues of given types would match if they were aligned in a sequence alignment. The matrix is a symmetrical 20*20 matrix, where each element contains the score for substituting a residue of type i with a residue of type j in a protein, where i and j are one of the 20 amino-acid residue types. Same residues should obviously have high scores, but if we have different residues in a position, how should that be scored? There are many possibilities:

The same residues in a position give the score value 1, and different residues give 0.
The same residues give a score 1, similar residues (for example: Tyr/Phe, or Ile/Leu) give 0.5, and all others 0.
One may calculate, using well established sequence alignments, the frequencies (probabilities) that a particular residue in a position is exchanged for another. This was done originally be Margaret Dayhoff, and her matrices are called the PAM (Point Accepted Mutation) matrices, which describe the exchange frequencies after having accepted a given number of point mutations over the sequence. Typical values are PAM 120 (120 mutations per 100 residues in a protein) and PAM 250. There are many other substitution matrices: BLOSUM, Gonnet, etc.

Gap penalty

The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighbouring point in the sequence. One cannot let gaps/insertion occur without penalty, because an unreasonable 'gappy' alignment would result. Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions). Some different possibilities:

A single gap-open penalty. This will tend to stop gaps from occuring, but once they have been introduced, they can grow unhindered.
A gap penalty proportional to the gap length. This will work against larger gaps.
A gap penalty that combines a gap-open value with a gap-length value.