Lecture 22 Jan 2001 Per Kraulis
How can we measure how good a pattern is at detecting a particular sequence feature? There are two different, complementary measures that can be used to describe how well a pattern performs:
Sensitivity is the fraction of the true matches that actually are correctly predicted as matches by the pattern. If the pattern is too stringent (to avoid garbage), then too few of the true matches will be identified by it.
Specificity is the fraction of the sequences predicted as matches that really are true matches. If the pattern is too inclusive (to catch as many as possible), then a lot of garbage will also be falsely identified by it.
We want a pattern to be both sensitive and specific, so that all true matches are found, and nothing but true matches. But as usual, nature will rarely allow us to find such a pattern; we must try to find a good compromise.
A regular expression can easily be designed that has 100% sensitivity: just use the expression that matches anything:.* (dot, star). Everything will be identified as matching this expression, so all true positives will also be identified, and none will be left out (i.e. no false negatives).
Conversely, it is easy to prepare a regexp that is 100% specific: just use a regexp that exactly matches one of the known members of the domain family. Only that single member will be predicted, and no others, (i.e. no false positives).
Let us use an analogy: we have an exam (svenska: "tentamen") to check that a class of students actually have learned what they should about bioinformatics. The exam must of course be designed so that those students that actually have learned their stuff should pass, while those who haven't learned anything will fail.
If the exam is too difficult, then those who manage to pass it will most likely really know their stuff. But there will be many students among those who fail (for stupid reasons, like being too stressed by the difficult questions) who really should have passed it but didn't (false negatives). The exam is specific, but it is not sensitive.
If the exam is too easy, then there will be many among those who pass it who have just made wild guesses, and really haven't learned anything. But on the other hand, of the students who really have learned what they should, not many will fail (even the stress-sensitive will get some answers right). The exam is sensitive, but it is not specific.
Let us define this in a little more technical, precise terms. We define the notions of true and false positive and negative matches in the following way:
The pattern says | |||
---|---|---|---|
Yes, a match | No, not a match | ||
Reality | Yes, a match | True positive; TPos | False negative; FNeg |
No, not a match | False positive; FPos | True negative; TNeg |
If we use these definitions, then the following holds:
Sensitivity = TPos / (TPos + FNeg)
Specificity = TPos / (TPos + FPos)
Warning: in the medical context (diagnosis), specificity is defined differently: there it is TNeg / (FPos + TNeg).