visitor tracking

A fascinating (mis)use of statistics

I’m actually working on a blog post about unschooling and math education right now, but sometimes you just read such a cool article about statistics that you have to share immediately, right? I’m going to put forward the hypothesis that statistics classes would be a lot more interesting (and informative) if they all had a unit on police and crime stats.

Picture this:

Detectives with a partially degraded/mixed DNA sample from a crime scene hit a roadblock in an investigation. The case goes cold. Months or maybe years later, someone breaks out that DNA sample and runs it through law enforcement databases, hoping to find a hit among previously convicted felons. Out of the million records, a hit! Case closed, right?

What I didn’t know before reading DNA’s Dirty Little Secret (mostly because it’s never come up on Law and Order) is that DNA samples are like fingerprints: analyzed based on how many “markers” the sample contains. (This fact about fingerprints, covered on Law and Order extensively, I knew.) The more markers the sample contains, the more complete the picture and the less likely the sample will match a person by random chance alone. Similarly, the fewer markers contained in the sample, the more likely that others will turn up as a match for that sample. And here’s where statistics comes in.

If run a suspect’s DNA against a crime scene sample and you determine that there’s a one in a million chance that this is not your guy, you may feel pretty confident you’ve solved the case. But, remember the scenario mentioned earlier: you had no leads and simply put a weak DNA sample into a database containing a million known offenders. The fact that you found a match isn’t all that much of a coincidence, in fact, it’s expected.

If your test is known to have a one in a million chance of matching someone else, and you run that test with a million random samples, you expect to get one hit due to random chance alone. In other words, we would expect one person in every million to match the sample, not just the owner of the DNA. One of those matches would be the owner of the DNA sample, but the rest are referred to as false positives: people who pass the conditions of the test despite not being the correct match.

As stated in the Washington Monthly article, the good news is that in cases where DNA samples have all 13 markers intact (are fresh, complete and untainted with other DNA samples), the odds of a false positive plummet to one in many trillions.

If a sample only contains 9 (out of 13) intact markers, the FBI considers the probability of a false positive to be approximately one in 750 million. Yet, an Arizona state employee apparently ran a series of tests in 2005 on Arizona’s state DNA database and found multiple people in that database of only 65,000 profiles who shared nine or more identical markers. Another test in 2006 found 903 pairs with nine or more matching DNA markers out of an Illinois database of 233,000 profiles. Of course, that’s not to say that 1806 people would show up as a match for any one single DNA sample. (We don’t know how many of those pairs share markers.) But it does mean that there are 903 random 9-marker DNA samples you could put into that database and get at least 2 matches.

This is why there is an altermative probability statistic called the Database Match Probability, which for a nine-marker match in a system the size of Arizona’s is roughly one in 11,000. Using that statistic, we would have expected approximately six hits for a DNA sample with nine markers. Obviously all 6 of them can’t be guilty, and there’s no guarantee that any one of them is guilty. How significant, then, is a “cold hit” in a criminal database? What if this were not only the way the police identified you as a suspect, but also their main (only?) evidence against you?

Unfortunately, according to the article linked above, the FBI rejects the use of this alternative statistic, either because it contradicts their own figures or because it makes evidence appear less compelling. Consequently, judges (typically ruling from precedent and not typically statisticians) have often deemed the Database Match Probability inadmissible in trials. In order to even get to a ruling, though, this of course assumes that defense lawyers are aware of, understand and are able to adequately explain to a judge this statistical concept in the first place, itself not a common occurrence.

I myself am only a casual statistical enthusiast, but I think I understand what’s going on. Here’s how I interpret things, based on some basic statistical reasoning. Feel free to point out any errors.

If you get a DNA match with someone from your list of suspects (generally a sample size of not more than a few dozen), you can be fairly confident that if the test itself is accurate, then you don’t have a false positive. The chances of someone whom you reasonably suspect in a crime (access, motive, M.O. etc.) matching a DNA sample and *not* being the perp is relatively small, and this is where the one in so many million/trillion statistics come into play. It’s not that you need a million or a trillion actual people to validate the assumption; rather, your sample size is so small that hits become overwhelmingly meaningful and the chances of false positives drop to near zero.

But, once you open your sample size up to millions (or even thousands) of random people, matches aren’t nearly as meaningful. Imagine being a jury on an Arizona case and hearing that there’s a one in 750 million chance that the guy on trial didn’t do it, when in fact we expect 6 out of 65,000 people would show up as equally likely suspects if you just ran the DNA sample through the database. That’s a difference of multiple orders of magnitude and really should affect how much weight you give to that DNA evidence.

And notice that the process is completely different, too: in one situation, we are testing one person to see whether he matches a DNA sample. Since the chances of any one random person matching any one random DNA sample are so small (even when the sample has fewer than 13 markers), a hit is a big statistical deal. In the other situation, we are testing thousands or millions of people, and anyone with basic statistical understanding expects the appropriate number of hits based on probability. I’m not going to say hits are meaningless, but simply expecting a match in this situation vs. not expecting a match in the former situation highlights just how different these processes are and how different the significant of a match really is.

In very simplistic terms, it’s the difference between

- knowing your killer is left-handed, and discovering that out of the four people who had access to the crime scene, only one of them was left-handed (still not one in millions kind of proof, but damning nonetheless) and

- just searching for left-handed people in the general population and using their left-handedness as evidence against them in this particular murder

And that doesn’t even take into account the fact that some people who identify primarily as either left- or right-handed in fact use their left hand for certain tasks (would murdering be one of them?) and their right hand for others.

I highly recommend reading the article linked above and discussing the statistical concepts as part of your math education, but must warn you that it does reference disturbing crimes. I would suggest that parents preview the article before using it instructionally. And please let me know if you enjoyed it as much as I did! (Well, except for the killing and all.)

Did you enjoy this post? Why not leave a comment below and continue the conversation, or subscribe to my feed and get articles like this delivered automatically to your feed reader. If this is your first visit, please be sure to check out "starter kit" of articles. Then, click on the pages, posts or categories on the right that interest you for much more information about home school university admissions in Ontario and Canada.

Comments

No comments yet.

Leave a comment

(required)

(required)