24 January 2011

Statistical significance, science, and numerate journalism

NY Times | Benedict Carey | You Might Already Know This ...

The statistical approach that has dominated the social sciences for almost a century is called significance testing. The idea is straightforward. A finding from any well-designed study — say, a correlation between a personality trait and the risk of depression — is considered “significant” if its probability of occurring by chance is less than 5 percent.
There's a simple solution that Computer Science papers I read use.  Simply report the value of p. That's it.  You can call your finding "significant," but if right after that you have "(p = .499)" people aren't going to be very convinced. In contrast if you call your finding significant and have, as one paper I read yesterday did "(p >> .999999)" people will be pretty convinced, and likely rightfully so.
In at least one area of medicine — diagnostic screening tests — researchers already use known probabilities to evaluate new findings. For instance, a new lie-detection test may be 90 percent accurate, correctly flagging 9 out of 10 liars. But if it is given to a population of 100 people already known to include 10 liars, the test is a lot less impressive.

It correctly identifies 9 of the 10 liars and misses one; but it incorrectly identifies 9 of the other 90 as lying. Dividing the so-called true positives (9) by the total number of people the test flagged (18) gives an accuracy rate of 50 percent. The “false positives” and “false negatives” depend on the known rates in the population.
Hold the horses. Carey's getting at a valuable point, but this is completely wrong.

The experiment described has 100 trials consisting of 81 true negatives (correctly predicted honesty), 1 false negative (incorrectly predicted a liar was honest), 9 false positive (incorrectly predict dishonesty from an honest subject) and 9 true positives (correctly predicting dishonesty from a liar). That's an accuracy of (TP+TN)/(TP+TN+FN+FP) = 90%, not 50% given in the second paragraph.

The quantity Carey describes in the second paragraph as "true positives [divided] by the total number of people the test flagged" is TP/(TP+FP) = 50%, which is the precision of the test, not the accuracy.

The description in the first paragraph "correctly flagging 9 out of 10 liars" is actually corresponds to the recall (TP/(TP+FN) = 90%), not the accuracy of the test.

Other values of interest may be the specificity TN/(TN+FP) = 90%, sensitivity TP/(TP+FN) = 90%, false positive rate FP/(FP+TN) = 10%, false negative rate FN/(TP+FN) = 10%, and f1-score or f-measure 2TP/(2TP+FN+FP) = 64.3%. I could go on; there are plenty of other values derived from these such as the likelihood ratios.

(It is, of course, a coincidence that 10% and 90% keep popping up.  That's just an artifact of the numbers Carey chose.  (Except sensitivity and recall.  Those are two words for the same thing.))
In the same way, experts argue, statistical analysis must find ways to expose and counterbalance all the many factors that can lead to falsely positive results — among them human nature, in its ambitious hope to discover something, and the effects of industry money, which biases researchers to report positive findings for products.
Indeed they do.

And journalists much find ways of not reporting on studies with weak statistical findings, especially based on little more that press releases.  And especially releases which precede review and acceptance.

Journalists must also find ways of explaining the relevant numerical aspects to their readers rather than just printing up headlines like "Eating FOO reduces BAR, study finds."

Of course the numeracy of the reporter here does not fill me with confidence that they will take up that side of the burden.

No comments:

Post a Comment