2006-12-20

Reporting significance

This is a rather specialized and personal entry on the topic of reporting the p-values resulting primarily from ANOVA on psychological data. It is something that has been bugging me for a couple of decades or so, and so it's time for me to put some thoughts down about it. I'm not going to do much explaining of basic terminology here, the reader is assumed to gone through the kind of statistics classes required, for example, in undergraduate psych programs.

The "traditional" way that ANOVA results have been reported during the past several decades is to give the F and its degrees of freedom, and one out of four (or five) characterizations of the probability of getting similar results by chance alone:

  1. n.s., not significant, failed to reach significance;
  2. p < .05, significant;
  3. p < .01, significant;
  4. p < .001, [highly, very] significant;
an additional level is sometimes added: "p < .1, marginally significant". Sometimes these characterizations are referred to incorrectly as "alpha levels".

The alpha level, however, is part of a much simpler but harder to interpret approach to reporting ANOVA. It is intended to be an all-or-none hypothesis testing tool. Given a certain null hypothesis, the results must indicate a different result with a certain probability value (the same probability as above, but conceived differently). When the probability is less than alpha that the outcome resulted from chance, the null hypothesis is rejected; when the probability is greater than or equal to alpha, then it is not rejected. There are several conventional alpha levels: .05, .01, and .001. Notice the very good correspondence between the conventional alpha levels and the way that results have been reported even when no specific alpha level is being used. Some have called the traditional way of reporting results "variable alpha" or "multiple alpha", because those authors often are trying to bridge the simplistic, cookie-cutter alpha-level, null-hypothesis rejection methods (beloved of statisticians), while still trying to give the reader some idea of what else might be going on in the data set.

The problem with the single-alpha approach is that the kinds of hypotheses that interest psychologists are almost never amenable to black-or-white, yes-or-no analysis. On an assembly like operated by a robot (or human reacting robotically to numbers on a read-out), this approach makes sense. If you are making batteries and after a charge test, a certain battery fails to charge to criterion, then fine: it is a defective battery, toss it in the recycling pile; otherwise, it is a good battery, put it in an impossible to open plastic package and send it to a drug store for sale. In such a situation, a battery that charged only to .09 V is treated exactly like one that charged to .9 V; they are equally bad.

In the world of psychology, however, experimental hypotheses virtually never are so easily evaluated. As in horseshoes, close can count in psych experiments. For example, "close" in an unexpected direction can indicate new avenues for exploration; close but no cigar is also not as much as a problem for an hypothesis that has been supported by good deal of other experimentation, as it would be for an novel hypothesis with little confirmation elsewhere. In other words, using a single, yes/no alpha level simply doesn't supply enough information about the results of an experiment for readers to get a good understanding of what happened.

And in fact, almost no one uses single alpha in its strongest form (selecting a single, study-wide alpha level, and rejecting the null hypothesis when it is met while simply failing to reject it when alpha is not met). Instead, psychologists tend to act like they are using single alphas, but then they report the standard p < whatever regardless of whether the result met their alpha criterion. For example, if they set a liberal alpha of .05, they still report p < .001 when they get that result; if they set .01, they still report p < .05.

And this is what bugs me about it: why bother setting an alpha if (1) you aren't going to use it, and (2) it doesn't make sense anyway. Maybe the whole alpha thing is just a source of noise in experimental reports, and it would make more sense to give the reader a little more information than just a yes or a no, or even a p < .05/.01/.001 classification.

Now, the APA manual is somewhat ambiguous on this question. In some places, they seem to be suggesting that one should go back to the approach advocated by RA Fisher, the inventor of the ANOVA, of reporting the exact probability to two or three digits of precision (Fisher himself did this, but he was also fond of using common fractions to report probabilities). In other places, they seem comfortable with the variable alpha approach, and even with the Pearsonian single alpha. Furthermore, they fail to deal with what may be the real issue here, which is basically readability.

Strictly adhering to Fisher's exact reporting of probabilities is workable only when the results are not very strong, that is, when they fall into the range from, say, .1 to .001. That would give us results like .12, .034, and .0067 with two digits of precision, and .123, .0345, and .00678 with three. None of those are too hard to read, even compared with n.s., < .05, and < .01. When you get down to numbers less than .001, then you end up with .000876 and the like; this is becoming a bit awkward.

If you really want to have exact results even with very small probabilities, there is only one reasonable possibility: you can use scientific notation. That is, instead of .0000...00456 you can have, say, 4.56e-12. But in terms of readability, this would be a disaster. People simply aren't used to reading probabilities in scientific notation: p = 4.9e-2 and p = .049 both mean about the same thing as p < .05, but psychologists simply aren't used to seeing the former, whereas the latter is closer to what they are used to. Now, you could also do a kind of switch, using ordinary decimals for p >= .001, and scientific notation for smaller probabilities. This would solve the problem in many cases, perhaps even in most of the more common ones. But that kind of switch would make it hard to understand differences between two conditions, where one is, say, p = .00123 and the other is p = 9.87e-4 : are these very different or rather close? In other words, while this would indeed give the reader a lot more information, it may be that its readability would actually go down rather than up.

My current proposal for how to deal with this problem builds on all of the above, but it questions one of the assumptions made by the APA, namely that the same number of digits of precision should be used for all p values.

When I talk about this with psychologists, they often agree with me when I suggest that the use of exact probabilities will allow a more correct handling of p values like .056, or pairs of p values like .0101 and .00980: if they were reported simply as n.s., p < .05, and p <.01, respectively, information useful to the reader would be lost. However, when one extends these examples to numbers like .47, or pairs like .0000101 and .00000980, the response becomes, "Who cares?". That is, a probability of .47 is so likely to be the result of chance, that calling it n.s. is probably fine (as long as results like .056 are written out), and also, p values down to the 1/100,000 level are very unlikely to result from chance, and so even fairly large differences (such as between 1/10,000 and 1/1,000,000) are unimportant.

This suggests that there is a critical range of p values where relatively high precision is useful to the reader, but that outside of that range, even very low precision is perfectly adequate. Therefore, it may not be of any particular value to maintain the same degree of precision for all p values. And this is what leads to my proposal.

Many statistical programs print out p values using a fixed, 3-decimal format. That is, p < .001 are printed as 0.000, or in other words, 0 digits of precision. But p < .01 are printed as (e.g., 0.004), namely with one digit of precision. And p < .05 have two digits of precision (i.e., 0.049), as do marginally significant results p < .1 (such as 0.087). If one is determined to use a constant number of digits of precision even for very small p, this is frustrating. But as we just saw, it may not be advisable to maintain a constant precision. Therefore, what I suggest is that we report p values within the range where there could some ambiguity as to the significance of the result with three decimal digits, using other methods for p values outside that range. This will automatically change the precision to be highest where the ambiguity is highest and lower where there is little doubt. Here it is, all spelled out:

A simple proposal

  • If the error mean square is larger than the treatment mean square, that is, F < 1, then no p value need be reported; F(2, 8) < 1
  • Large p values of little significance will be reported to three decimal places,or example: F(2, 4) = 2.295, p = .217; F(2, 8) = 2.62, p = .133
  • For smaller p values in the range sometimes called "marginally significant", p is still written out using three decimal places, but only two signficiant digits: F(2, 8) = 3.85, p = .067
  • For smaller p values greater than or equal to .01, still use two significant digits and three decimal places: F(2, 8) = 5.10, p = .037
  • For smaller p values greater than or equal to .001, use one significant digit (and three decimal places): F(2, 8) = 17.8, p = .002
  • For p less than 0.001, simply write it that way: F(2, 8) = 54.8, p < .001
One might reduce the precision of large p values such as .237, or even not report them at all. However, reporting them using three decimal places as is done in every case when p is greater than or equal to .001 reduces the potential confusion that could result from changing the format. The end result is that all p's are written using three decimal places unless they are less than .001.

I believe that this approach will improve the current situation in psychological research papers. It gives the "right" amount of information to the reader: where the result might be interesting but not reach a traditional "alpha" level, enough information is given for the reader to decide for himself as to what the results mean. Similarly, when there is little doubt that there is a significant effect, no useless extra information is given, thereby avoiding awkward, less readable reports.

In addition to reporting the exact p values of the result, I recommend that partial η² also be reported as an index of the size of the effect. Note that while effect size and p tend to be negatively correlated, it is quite possible to have large effect sizes with large p values and negligible effect sizes with very significant p values. Jacob Cohen (Statistical Power Analysis for the Behavioral Sciences, 2nd Edition, 1988) suggests that η² can be characterized as follows: 0 < η² < .010, negligible; .010 ≤ η² < .059, small effect; .059 ≤ η² < .139, medium effect; η² ≥ .139, large effect. These should also be reported using three decimal places, and we can use η² < .001 for those rare cases where the effect size is extremely tiny and can't be represented at all in three decimal places.

To summarize, results of ANOVA F tests should be reported using the standard format, except that three decimal places are used to report p values. For the F's themselves, use three significant digits (i.e., 12300, 1230, 123, 12.3, 1.23) except that F's less than one are not reported. Also, the effect size will be reported after p as a partial η², also using three decimal places as for p (for example, F(1, 20) = 12.3, p < .001, η² = .347).

One final recommendation is stylistic: instead of simply listing results and classifying them as signficant or not significant, as is done very often, the results of the ANOVA and possibly of any post-hoc tests should be composed as a paragraph based on how the hypotheses, the elements of the design, and the pattern of results interact. That is, in a complex analysis, the hypotheses will be summarized and statistical tests reported in such a way that the reader can easily assimilate the pattern of results and whether and to what degree they confirm or run counter to the hypotheses. Part of this is the report of the effect sizes. The report of the statistical tests per se is subsumed in a description of the results; since the actual p values and effect sizes are fully reported, it is not necessary to classify them into discrete levels of significance; instead, it is more meaningful to refer to p values as "noteworthy", "reliable" and so on, and to effect sizes as "small" or "large". The conjunction of a large effect size and small p could be characterized as "robust" or "clear"; other possibilities include things like "small but reliable effect". For example, "The large but only moderately reliable interaction between Group and Lexicality (F(2,8) = 2.57, p = .021, η² = .210) is compatible with the hypothesis that females generally have disproportionate difficulty pronouncing nonwords".

As is stated in the APA manual, when a large, dense set of statistical results are being reported, such as correlations, or means compared to a standard, p values can be condensed into a handful of levels such as < .05, < .01, and < .001, indicated by *, **, or ***. A similar method could be used for effect sizes, perhaps using plus marks (+, ++, +++).

No comments: