17 March 2010

DNA databases

The NY Times ran an op/ed today by Michael Seringhaus about a proposal to create a DNA database of everyone in the country. It revolves around the idea that current databases, which includes those arrested for in addition to those accused of crimes, is racially unbalanced. That's not a can of worms I want to get into, but this is:
"Obviously, the more individuals profiled in the database, the more likely a crime-scene sample can be identified"
That's not self-evidently true. Expanding the size of the database can only make it easier to find the correct match if the perpetrator was not in the original, smaller database. If you add a bunch of people to the database who never commit crimes all you are doing is adding lots of hay and no needles to the stack you're searching in. Remember that while more data generally leads to better conclusions, more imbalanced data leads to worse conclusions. If the proportion of criminals from the set of people not in the database currently is lower than the proportion of criminals in the database, then expanding it may actually make finding a match more difficult.

The other thing to remember is that there are two very different tasks when it comes to DNA matching, or any other kind of biometric analysis.  One is the verification problem: This guy says he's John Smith. Does this guy's sample match the one on file which is from John Smith: yes or no? That's a fairly easy problem to solve, and we're quite good at it.

The other is the identification problem: We have a sample. Which of the many samples on file does it best match? That's what needs to be done when a DNA sample (or a fingerprint, etc.) is recovered from a crime scene. This task is much more difficult, we are not as good at it, and it becomes more difficult the more people are in the DB.

(NB If you already have a primary suspect, or a very short list of suspects, then your task upon recovering a DNA sample from the crime scene shifts from the identification to the verification problem. But you need the suspects first.)

It's been a long time since I've studied the DNA modality specifically, and I'm not familiar with the markers used in the Codis DB, so I'm not sure exactly what the state-of-the-art is for the identification task with DNA. Rest assured though that is much, much harder than it looks on CSI.

I also want to point out this line:
A universal record would be a strong deterrent to first-time offenders — after all, any DNA sample left behind would be a smoking gun for the police — and would enable the police to more quickly apprehend repeat criminals.
I confess to not understanding the latter part of that assertion. How could adding a bunch of people to the database who haven't committed crimes help apprehend people who have committed crimes?

As to the former part of the assertion, you wouldn't have a smoking gun — you'd have a battery of smoking guns, all but one of which would be pointing in the wrong direction. Sure, if there's only one DNA sample at the scene not belonging to the victim that's a great lead. But what about all the other people who may have left DNA behind at the crime scene? They're all going to get ID'ed in the database too. I'd be interested in knowing how many crimes leave behind only one DNA sample, and in how many of those the sample belonged to the perpetrator.

(Via Ron Bailey)

