4. Database searching

When we have a sequence, and we want to find other sequences similar to it in a database, we do not really need the full alignment of this sequence against all others. All we want is a value, a score, that will tell us how similar our probe sequence is to the every other sequence. This score should be sensitive (so that as many of the true homologs are found) and specific (so that as few false positives are hit).

There is a simple rule-of-thumb: A database hit having a sequence identity of 25% or more (protein lengths 200 residues or more) is almost certainly a true hit, if one uses reasonable parameter settings for the common programs BLAST or FASTA. There are cases where this is not true, for example when the sequences have a high amount of low-complexity regions (Ser-Thr-rich regions, and such), but this can usually be dealt with by applying a low-complexity filter.

But what to do about hits with lower degree of identity? The basic problem is how to judge whether a score is significant or not. Could a given score be the result of pure chance? The various search programs (BLAST, FASTA) attempt to answer this question by computing an expectation value (or something similar). This is an estimate of the likelihood that a given hit is due to pure chance, given the size of the database. This calculation uses probability theory and various (reasonable) assumptions. It should be as low as possible. If the value is close to 1 (say, 0.01) rather than 0.0 or 1.0e-45, then the hit is suspect.