Stockholm Bioinformatics Center, SBC
Lecture notes, main page

Lecture notes 19 Jan 2001 Per Kraulis

2. Some definitions for sequence alignments

Gaps and insertions

In an alignment, one may achieve much better correspondence between two sequences if one allows a gap to be introduced in one sequence. Equivalently, one could allow an insertion in the other sequence. Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.

Optimal alignment

The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.

Global alignment

An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A tiny example:

        LGPSTKDFGKISESREFDN
        |      ||||    | 
        LNQLERSFGKINMRLEDA

Local alignment

An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:

        ----------FGKI----------
                  ||||
        ----------FGKI----------

It may seem that one should always use local alignments. However, it may be difficult to spot an overall similary, as opposed to just a domain-to-domain similarity, if one uses only local alignment. So global alignment is useful in some cases. The popular programs BLAST and FASTA for searching sequence databases produce local alignments.

Substitution matrix

A substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time. This is used to estimate how well two residues of given types would match if they were aligned in a sequence alignment. The matrix is a symmetrical 20*20 matrix, where each element contains the score for substituting a residue of type i with a residue of type j in a protein, where i and j are one of the 20 amino-acid residue types. Same residues should obviously have high scores, but if we have different residues in a position, how should that be scored? There are many possibilities:

Gap penalty

The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighbouring point in the sequence. One cannot let gaps/insertion occur without penalty, because an unreasonable 'gappy' alignment would result. Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions). Some different possibilities:


Copyright © 2001 Per Kraulis $Date: 2001/01/18 08:50:07 $