A biologist's guide to statistical thinking and analysis^*

David S. Fay¹^§ and Ken Gerow²
¹Department of Molecular Biology, College of Agriculture and Natural Resources, University of Wyoming, Laramie WY 82071, USA; ²Department of Statistics, College of Arts and Sciences, University of Wyoming, Laramie WY 82071;

This chapter is in WormBook section:
> WormMethods
>> Basic methods for C. elegans

Table of Contents

1. The basics

1.1. Introduction
1.2. Quantifying variation in population or sample data
1.3. Quantifying statistical uncertainty
1.4. Confidence intervals
1.5. What is the best way to report variation in data?
1.6. A quick guide to interpreting different indicators of variation
1.7. The coefficient of variation
1.8. P-values
1.9. Why 0.05?

2. Comparing two means

2.1. Introduction
2.2. Understanding the t-test: a brief foray into some statistical theory
2.3. One- versus two-sample tests
2.4. One versus two tails
2.5. Equal or non-equal variances
2.6. Are the data normal enough?
2.7. Is there a minimum acceptable sample size?
2.8. Paired versus unpaired tests
2.9. The critical value approach

3. Comparisons of more than two means

3.1. Introduction
3.2. Safety through repetition
3.3. The family-wise error rate
3.4. Bonferroni-type corrections
3.5. False discovery rates
3.6. Analysis of variance
3.7. Summary of multiple comparisons methods
3.8. When are multiple comparison adjustments not required?
3.9. A philosophical argument for making no adjustments for multiple comparisons

4. Probabilities and Proportions

4.1. Introduction
4.2. Calculating simple probabilities
4.3. Calculating more-complex probabilities
4.4. The Poisson distribution
4.5. Intuitive methods for calculating probabilities
4.6. Conditional probability: calculating probabilities when events are not independent
4.7. Binomial proportions
4.8. Calculating confidence intervals for binomial proportions
4.9. Tests for differences between two binomial proportions
4.10. Tests for differences between more than one binomial proportion
4.11. Probability calculations for binomial proportions
4.12. Probability calculations when sample sizes are large relative to the population size
4.13. Tests for differences between multinomial proportions

5. Relative differences, ratios, and correlations

5.1. Comparing relative versus incremental differences
5.2. Ratio of means versus mean of ratios
5.3. Log scales
5.4. Correlation and modeling
5.5. Modeling and regression

6. Additional considerations and guidelines

6.1. When is a sample size too small?
6.2. Statistical power
6.3. Can a sample size be too large?
6.4. Dealing with outliers
6.5. Nonparametric tests
6.6. A brief word about survival
6.7. Fear not the bootstrap

7. Acknowledgments

8. References

9. Appendix A: Microsoft Excel tools

10. Appendix B: Recomended reading

11. Appendix C: Useful programs for statistical calculations

12. Appendix D: Useful websites for statistical calculations

Abstract

The proper understanding and use of statistical tools are essential to the scientific enterprise. This is true both at the level of designing one's own experiments as well as for critically evaluating studies carried out by others. Unfortunately, many researchers who are otherwise rigorous and thoughtful in their scientific approach lack sufficient knowledge of this field. This methods chapter is written with such individuals in mind. Although the majority of examples are drawn from the field of Caenorhabditis elegans biology, the concepts and practical applications are also relevant to those who work in the disciplines of molecular genetics and cell and developmental biology. Our intent has been to limit theoretical considerations to a necessary minimum and to use common examples as illustrations for statistical analysis. Our chapter includes a description of basic terms and central concepts and also contains in-depth discussions on the analysis of means, proportions, ratios, probabilities, and correlations. We also address issues related to sample size, normality, outliers, and non-parametric approaches.

1. The basics

1.1. Introduction

At the first group meeting that I attended as a new worm postdoc (1997, D.S.F.), I heard the following opinion expressed by a senior scientist in the field: “If I need to rely on statistics to prove my point, then I'm not doing the right experiment.” In fact, reading this statement today, many of us might well identify with this point of view. Our field has historically gravitated toward experiments that provide clear-cut “yes” or “no” types of answers. Yes, mutant X has a phenotype. No, mutant Y does not genetically complement mutant Z. We are perhaps even a bit suspicious of other kinds of data, which we perceive as requiring excessive hand waving. However, the realities of biological complexity, the sometimes-necessary intrusion of sophisticated experimental design, and the need for quantifying results may preclude black-and-white conclusions. Oversimplified statements can also be misleading or at least overlook important and interesting subtleties. Finally, more and more of our experimental approaches rely on large multi-faceted datasets. These types of situations may not lend themselves to straightforward interpretations or facile models. Statistics may be required.

The intent of these sections will be to provide C. elegans researchers with a practical guide to the application of statistics using examples that are relevant to our field. Namely, which common situations require statistical approaches and what are some of the appropriate methods (i.e., tests or estimation procedures) to carry out? Our intent is therefore to aid worm researchers in applying statistics to their own work, including considerations that may inform experimental design. In addition, we hope to provide reviewers and critical readers of the worm scientific literature with some criteria by which to interpret and evaluate statistical analyses carried out by others. At various points we suggest some general guidelines, which may lead to somewhat more uniformity in how our field conducts and presents statistical findings. Finally, we provide some suggestions for additional readings for those interested in a more systematic and in-depth coverage of the topics introduced (Appendix A).

1.2. Quantifying variation in population or sample data

There are numerous ways to describe and present the variation that is inherent to most data sets. Range (defined as the largest value minus the smallest) is one common measure and has the advantage of being simple and intuitive. Range, however, can be misleading because of the presence of outliers, and it tends to be larger for larger sample sizes even without unusual data values. Standard deviation (SD) is the most common way to present variation in biological data. It has the advantage that nearly everyone is familiar with the term and that its units are identical to the units of the sample measurement. Its disadvantage is that few people can recall what it actually means.

Figure 1 depicts density curves of brood sizes in two different populations of self-fertilizing hermaphrodites. Both have identical average brood sizes of 300. However, the population in Figure 1B displays considerably more inherent variation than the population in Figure 1A. Looking at the density curves, we would predict that 10 randomly selected values from the population depicted in Figure 1B would tend to show a wider range than an equivalent set from the more tightly distributed population in Figure 1A. We might also note from the shape and symmetry of the density curves that both populations are Normally¹ distributed (this is also referred to as a Gaussian distribution). In reality, most biological data do not conform to a perfect bell-shaped curve, and, in some cases, they may profoundly deviate from this ideal. Nevertheless, in many instances, the distribution of various types of data can be roughly approximated by a normal distribution. Furthermore, the normal distribution is a particularly useful concept in classical statistics (more on this later) and in this example is helpful for illustrative purposes.

Figure 1. Two normal distributions.

The vertical red lines in Figure 1A and 1B indicate one SD to either side of the mean. From this, we can see that the population in Figure 1A has a SD of 20, whereas the population in Figure 1B has a SD of 50. A useful rule of thumb is that roughly 67% of the values within a normally distributed population will reside within one SD to either side of the mean. Correspondingly, 95% of values reside within two² SDs, and more than 99% reside within three SDs to either side of the mean. Thus, for the population in Figure 1A, we can predict that about 95% of hermaphrodites produce brood sizes between 260 and 340, whereas for the population in Figure 1B, 95% of hermaphrodites produce brood sizes between 200 and 400.

Often we can never really know the true mean or SD of a population because we cannot usually observe the entire population. Instead, we must use a sample to make an educated guess. In the case of experimental laboratory science, there is often no limit to the number of animals that we could theoretically test or the number of experimental repeats that we could perform. Admittedly, use of the term “populations” in this context can sound rather forced. It's awkward for us to think of a theoretical collection of bands on a western blot or a series of cycle numbers from a qRT-PCR experiment as a population, but from the standpoint of statistics, that's exactly what they are. Thus, our populations tend to be mythical in nature as well as infinite. Moreover, even the most sadistic advisor can only expect a finite number of biological or technical repeats to be carried out. The data that we ultimately analyze are therefore always just a tiny proportion of the population, real or theoretical, from whence they came.

It is important to note that increasing our sample size will not predictably increase or decrease the amount of variation that we are ultimately likely to record. What can be stated is that a larger sample size will tend to give a sample SD that is a more accurate estimate of the population SD. In the same vein, a larger sample size will also provide a more accurate estimation of other parameters, such as the population mean.

In some cases, standard numerical summaries (e.g., mean and SD) may not be sufficient to fully or accurately describe the data. In particular, these measures usually³ tell you nothing about the shape of the underlying distribution. Figure 2 illustrates this point; Panels A and B show the duration (in seconds) of vulval muscle cell contractions in two populations of C. elegans. The data from both panels have nearly identical means and SDs, but the data from panel A are clearly bimodal, whereas the data from Panel B conform more to a normal distribution⁴. One way to present this observation would be to show the actual histograms (in a figure or supplemental figure). Alternatively, a somewhat more concise depiction, which still gets the basic point across, is shown by the individual data plot in Panel C. In any case, presenting these data simply as a mean and SD without highlighting the difference in distributions would be potentially quite misleading, as the populations would appear to be identical.

Figure 2. Two distributions with similar means and SDs. Panels A and B show histograms of simulated data of vulval muscle cell contraction durations derived from underlying populations with distributions that are either bimodal (A) or normal (B). Note that both populations have nearly identical means and SDs, despite major differences in the population distributions. Panel C displays the same information shown in the two histograms using individual data plots. Horizontally arrayed sets of dots represent repeat values.

1.3. Quantifying statistical uncertainty

Before you become distressed about what the title of this section actually means, let's be clear about something. Statistics, in its broadest sense, effectively does two things for us—more or less simultaneously. (1) Statistics provides us with useful quantitative descriptors for summarizing our data. This includes fairly simple stuff such as means and proportions. It also includes more complex statistics such as the correlation between related measurements, the slope of a linear regression, and the odds ratio for mortality under differing conditions. These can all be useful for interpreting our data, making informed conclusions, and constructing hypotheses for future studies. However, statistics gives us something else, too. (2) Statistics also informs us about the accuracy of the very estimates that we've made. What a deal! Not only can we obtain predictions for the population mean and other parameters, we also estimate how accurate those predictions really are. How this comes about is part of the “magic” of statistics, which as stated shouldn't be taken literally, even if it appears to be that way at times.

In the preceding section we discussed the importance of SD as a measure for describing natural variation within an entire population of worms. We also touched upon the idea that we can calculate statistics, such as SD, from a sample that is drawn from a larger population. Intuition also tells us that these two values, one corresponding to the population, the other to the sample, ought to generally be similar in magnitude, if the sample size is large. Finally, we understand that the larger the sample size, the closer our sample statistic will be to the true population statistic. This is true not only for the SD but also for many other statistics as well.

It is now time to discuss SD in another context that is central to the understanding of statistics. We do this with a thought experiment. Imagine that we determine the brood size for six animals that are randomly selected from a larger population. We could then use these data to calculate a sample mean, as well as a sample SD, which would be based on a sample size of n = 6. Not being satisfied with our efforts, we repeat this approach every day for 10 days, each day obtaining a new mean and new SD (Table 1). At the end of 10 days, having obtained ten different means, we can now use each sample mean as though it were a single data point to calculate a new mean, which we can call the mean of the means. In addition, we can calculate the SD of these ten mean values, which we can refer to for now as the SD of the means. We can then pose the following question: will the SD calculated using the ten means generally turn out to be a larger or smaller value (on average) than the SD calculated from each sample of six random individuals? This is not merely an idiosyncratic question posed for intellectual curiosity. The notion of the SD of the mean is critical to statistical inference. Read on.

Table 1. Ten random samples (trials) of brood sizes.

Trial	Brood Sizes^a						Sample Mean	Sample SD	SE^b of Mean
Trial	1	2	3	4	5	6	Sample Mean	Sample SD	SE^b of Mean
1	218	259	271	320	266	392	287.67	60.59	24.73
2	370	237	307	358	295	318	314.17	47.81	19.52
3	324	264	343	269	304	223	287.83	44.13	18.02
4	341	343	277	374	302	308	324.17	34.93	14.26
5	293	362	296	384	270	307	318.67	44.32	18.10
6	366	301	209	336	254	295	293.50	56.25	22.96
7	325	240	304	294	260	310	288.83	32.34	13.20
8	334	327	310	346	320	233	311.67	40.43	16.56
9	339	256	240	329	230	361	292.50	56.89	23.22
10	235	300	271	300	281	253	273.33	25.96	10.60
Mean of values							299.2	44.36	18.11
SD of means							16.66
^aFor each trial, n = 6 worms were assayed for brood size. ^bSE, standard error. When applied to a mean value, also abbreviated as SEM.

Thinking about this, we may realize that the ten mean values, being averages of six worms, will tend to show less total variation than measurements from individual worms. In other words, the variation between means should be less than the variation between individual data values. Moreover, the average of these means will generally be closer to the true population mean than would a mean obtained from just six random individuals. In fact, this idea is born out in Table 1, which used random sampling from a theoretical population (with a mean of 300 and SD of 50) to generate the sample values. We can therefore conclude sample means will generally exhibit less variation than that seen among individual samples. Furthermore, we can consider what might happen if we were to take daily samples of 20 worms instead of 6. Namely, the larger sample size would result in an even tighter cluster of mean values. This in turn would produce an even smaller SD of the means than from the experiment where only six worms were analyzed each day. Thus, increasing sample size will consistently lead to a smaller SD of the means. Note however, as discussed above, increasing sample size will not predictably lead to a smaller or larger SD for any given sample.

It turns out that this concept of calculating the SD of multiple means (or other statistical parameters) is a very important one. The good news is that rather than having to actually collect samples for ten or more days, statistical theory gives us a short cut that allows us to estimate this value based on only a single day's effort. What a deal! Rather than calling this value the “SD of the means”, as might make sense, the field has historically chosen to call this value the “standard error of the mean” (SEM). In fact, whenever a SD is calculated for a statistic (e.g., the slope from a regression or a proportion), it is called the standard error (SE) of that statistic. SD is a term generally reserved for describing variation within a sample or population only. Although we will largely avoid the use of formulas in this review, it is worth knowing that we can estimate the SEM from a single sample of n animals using the following equation:

From this relatively simple formula⁵, we can see that the greater the SD of the sample, the greater the SEM will be. Conversely, the larger our sample size, the smaller the SEM will be. Looking back at Table 1, we can also see that the SEM estimate for each daily sample is reasonably⁶ close, on average, to what we obtained by calculating the observed SD of the means from 10 days. This is not an accident. Rather, chalk one up for statistical theory.

Obviously, having a smaller SEM value reflects more precise estimates of the population mean. For that reason, scientists are typically motivated to keep SEM values as low as possible. In the case of experimental biology, variation within our samples may be due to inherent biological variation or to technical issues related to the methods we use. The former we probably can't control very much. The latter we may be able to control to some extent, but probably not completely. This leaves increasing sample size as a direct route to decreasing SE estimates and thus to improving the precision of the parameter estimates. However, increasing sample size in some instances may not be a practical or efficient use of our time. Furthermore, because the denominator in SE equations typically involves the square root of sample size, increasing sample size will have diminishing returns. In general, a quadrupling of sample size is required to yield a halving of the SEM. Moreover, as discussed elsewhere in this chapter, supporting very small differences with very high sample sizes might lead us to make convincing-sounding statistical statements regarding biological effects of no real importance, which is not something we should aspire to do.

1.4. Confidence intervals

Although SDs and SEs are all well and good, what we typically want to know is the accuracy of our parameter estimates. It turns out that SEs are the key to calculating a more directly useful measure, a confidence interval (CI). Although, the transformation of SEs into CIs isn't necessarily that complex, we will generally want to let computers or calculators perform this conversion for us. That said, for sample means derived from sample sizes greater than about ten, a 95% CI will usually span about two SEMs above and below the mean⁷. When pressed for a definition, most people might say that with a 95% CI, we are 95% certain that the true value of the mean or slope (or whatever parameter we are estimating) is between the upper and lower limits of the given CI. Proper statistical semantics would more accurately state that a 95% CI procedure is such that 95% of properly calculated intervals from appropriately random samples will contain the true value of the parameter. If you can discern the difference, fine. If not, don't worry about it.

One thing to keep in mind about CIs is that, for a given sample, a higher confidence level (e.g., 99%) will invoke intervals that are wider than those created with a lower confidence level (e.g., 90%). Think about it this way. With a given amount of information (i.e., data), if you wish to be more confident that your interval contains the parameter, then you need to make the interval wider. Thus, less confidence corresponds to a narrower interval, whereas higher confidence requires a wider interval. Generally for CIs to be useful, the range shouldn't be too great. Another thing to realize is that there is really only one way to narrow the range associated with a given confidence level, and that is to increase the sample size. As discussed above, however, diminishing returns, as well as basic questions related to biological importance of the data, should figure foremost in any decision regarding sample size.

1.5. What is the best way to report variation in data?

Of course, the answer to this will depend on what you are trying to show and which measures of variation are most relevant to your experiment. Nevertheless, here is an important news flash: with respect to means, the SEM is often not the most informative parameter to display. This should be pretty obvious by now. SD is a good way to go if we are trying to show how much variation there is within a population or sample. CIs are highly informative if we are trying to make a statement regarding the accuracy of the estimated population mean. In contrast, SEM does neither of these things directly, yet remains very popular and is often used inappropriately (Nagele, 2003). Some statisticians have pointed out that because SEM gives the smallest of the error bars, authors may often chose SEMs for aesthetic reasons. Namely, to make their data appear less variable or to convince readers of a difference between values that might not otherwise appear to be very different. In fairness, SEM is a perfectly legitimate descriptor of variation⁸. In contrast to CIs, the size of the SEM is not an artifact of the chosen confidence level. Furthermore, unlike the CI, the validity of the SEM does not require assumptions that relate to statistical normality⁹. However, because the SEM is often less directly informative to readers, presenting either SDs or CIs can be strongly recommended for most data. Furthermore, if the intent of a figure is to compare means for differences or a lack thereof, CIs are the clear choice.

1.6. A quick guide to interpreting different indicators of variation

Figure 3 shows a bar graph containing identical (artificial) data plotted with the SD, SEM, and CI to indicate variation. Note that the SD is the largest value, followed by the CI and SEM. This will be the case for all but very small sample sizes (in which case the CI could be wider than two SDs). Remember: SD is variation among individuals, SE is the variation for a theoretical collection of sample means (acquired in an identical manner to the real sample), and CI is a rescaling of the SE so as to be able to impute confidence regarding the value of the population mean. With larger sample sizes, the SE and CI will shrink, but there is no such tendency for the SD, which tends to remain the same but can also increase or decrease somewhat in a manner that is not predictable.

Figure 3. Illustration of SD, SE, and CI as measures of variability.

Figure 4 shows two different situations for two artificial means: one in which bars do not overlap (Figure 4A), and one in which they do, albeit slightly (Figure 4B). The following are some general guidelines for interpreting error bars, which are summarized in Table 2. (1) With respect to SD, neither overlapping bars nor an absence of overlapping bars can be used to infer whether or not two sample means are significantly different. This again is because SD reflects individual variation and you simply cannot infer anything about significance of differences for the means. End of story. (2) With respect to SEM, overlapping bars (Figure 4B) can be used to infer that the sample means are not significantly different. However, the absence of overlapping bars (Figure 4A) cannot be used to infer that the sample means are different. (3) With respect to CIs, the absence of overlapping bars (Figure 4A) can be used to infer that the sample means are statistically different⁶. If the CI bars do overlap (Figure 4B), however, the answer is “maybe”. Here is why. The correct measure for comparing two means is in fact the SE of the difference between the means. In the case of equal SEMs, as illustrated in Figure 4, the SE of the difference is ∼1.4 times the SEM. To be significantly different,¹⁰ then, two means need to be separated by about twice the SE of the difference (2.8 SEMs). In contrast, visual separation using the CI bars requires a difference of four times the SEM (remember that CI ∼ 2 × SEM above and below the mean), which is larger than necessary to infer a difference between means. Therefore, a slight overlap can be present even when two means differ significantly. If they overlap a lot (e.g., the CI for Mean 1 includes Mean 2), then the two means are for sure not significantly different. If there is any uncertainty (i.e., there is some slight overlap), determination of significance is not possible; the test needs to be formally carried out.

Figure 4. Comparing means using visual measures of precision.

Table 2. General guidelines for interpreting error bars.

Error bar type	Overlapping error bars	Non-overlapping error bars
SD	no inference	no inference
SEM	sample means are not significantly different	no inference
CI	sample means may or may not be significantly different	sample means are significantly different

1.7. The coefficient of variation

In some cases, it may be most relevant to describe the relative variation within a sample or population. Put another way, knowing the sample SD is really not very informative unless we also know the sample mean. Thus, a sample with a SD = 50 and mean = 100 shows considerably more relative variation than a sample with SD = 100 but mean = 10,000. To indicate the level of variation relative to the mean, we can report the coefficient of variation (CV). In the case of sample means (), this can be calculated as follows:

Thus, low CVs indicate relatively little variation within the sample, and higher CVs indicate more variation. In addition, because units will cancel out in this equation, CV is a unitless expression. This is actually advantageous when comparing relative variation between parameters that are described using different scales or distinct types of measurements. Note, however, that in situations where the mean value is zero (or very close to zero), the CV could approach infinity and will not provide useful information. A similar warning applies in cases when data can be negative. The CV is most useful and meaningful only for positively valued data. A variation on the CV is its use as applied to a statistic (rather than to individual variation). Then its name has to reflect the statistic in question; so, for example, . For another example (the role of may be confusing here), suppose one has estimated a proportion (mortality, for instance), and obtained an estimate labeled and its SE, labeled . Then

1.8. P-values

Most statistical tests culminate in a statement regarding the P-value, without which reviewers or readers may feel shortchanged. The P-value is commonly defined as the probability of obtaining a result (more formally a test statistic) that is at least as extreme as the one observed, assuming that the null hypothesis is true. Here, the specific null hypothesis will depend on the nature of the experiment. In general, the null hypothesis is the statistical equivalent of the “innocent until proven guilty” convention of the judicial system. For example, we may be testing a mutant that we suspect changes the ratio of male-to-hermaphrodite cross-progeny following mating. In this case, the null hypothesis is that the mutant does not differ from wild type, where the sex ratio is established to be 1:1. More directly, the null hypothesis is that the sex ratio in mutants is 1:1. Furthermore, the complement of the null hypothesis, known as the experimental or alternative hypothesis, would be that the sex ratio in mutants is different than that in wild type or is something other than 1:1. For this experiment, showing that the ratio in mutants is significantly different than 1:1 would constitute a finding of interest. Here, use of the term “significantly” is short-hand for a particular technical meaning, namely that the result is statistically significant, which in turn implies only that the observed difference appears to be real and is not due only to random chance in the sample(s). Whether or not a result that is statistically significant is also biologically significant is another question. Moreover, the term significant is not an ideal one, but because of long-standing convention, we are stuck with it. Statistically plausible or statistically supported may in fact be better terms.

Getting back to P-values, let's imagine that in an experiment with mutants, 40% of cross-progeny are observed to be males, whereas 60% are hermaphrodites. A statistical significance test then informs us that for this experiment, P = 0.25. We interpret this to mean that even if there was no actual difference between the mutant and wild type with respect to their sex ratios, we would still expect to see deviations as great, or greater than, a 6:4 ratio in 25% of our experiments. Put another way, if we were to replicate this experiment 100 times, random chance would lead to ratios at least as extreme as 6:4 in 25 of those experiments. Of course, you may well wonder how it is possible to extrapolate from one experiment to make conclusions about what (approximately) the next 99 experiments will look like. (Short answer: There is well-established statistical theory behind this extrapolation that is similar in nature to our discussion on the SEM.) In any case, a large P-value, such as 0.25, is a red flag and leaves us unconvinced of a difference. It is, however, possible that a true difference exists but that our experiment failed to detect it (because of a small sample size, for instance). In contrast, suppose we found a sex ratio of 6:4, but with a corresponding P-value of 0.001 (this experiment likely had a much larger sample size than did the first). In this case, the likelihood that pure chance has conspired to produce a deviation from the 1:1 ratio as great or greater than 6:4 is very small, 1 in 1,000 to be exact. Because this is very unlikely, we would conclude that the null hypothesis is not supported and that mutants really do differ in their sex ratio from wild type. Such a finding would therefore be described as statistically significant on the basis of the associated low P-value.

1.9. Why 0.05?

There is a long-standing convention in biology that P-values that are ≤0.05 are considered to be significant, whereas P-values that are >0.05 are not significant¹¹. Of course, common sense would dictate that there is no rational reason for anointing any specific number as a universal cutoff, below or above which results must either be celebrated or condemned. Can anyone imagine a convincing argument by someone stating that they will believe a finding if the P-value is 0.04 but not if it is 0.06? Even a P-value of 0.10 suggests a finding for which there is some chance that it is real.

So why impose “cutoffs”, which are often referred to as the chosen α level, of any kind? Well, for one thing, it makes life simpler for reviewers and readers who may not want to agonize over personal judgments regarding every P-value in every experiment. It could also be argued that, much like speed limits, there needs to be an agreed-upon cutoff. Even if driving at 76 mph isn't much more dangerous than driving at 75 mph, one does have to consider public safety. In the case of science, the apparent danger is that too many false-positive findings may enter the literature and become dogma. Noting that the imposition of a reasonable, if arbitrary, cutoff is likely to do little to prevent the publication of dubious findings is probably irrelevant at this point.

The key is not to change the chosen cutoff—we have no better suggestion¹² than 0.05. The key is for readers to understand that there is nothing special about 0.05 and, most importantly, to look beyond P-values to determine whether or not the experiments are well controlled and the results are of biological interest. It is also often more informative to include actual P-values rather than simply stating P ≤ 0.05; a result where P = 0.049 is roughly three times more likely to have occurred by chance than when P = 0.016, yet both are typically reported as P ≤ 0.05. Moreover, reporting the results of statistical tests as P ≤ 0.05 (or any number) is a holdover to the days when computing exact P-values was much more difficult. Finally, if a finding is of interest and the experiment is technically sound, reviewers need not skewer a result or insist on authors discarding the data just because P ≤ 0.07. Judgment and common sense should always take precedent over an arbitrary number.

¹In theory, we could always capitalize “Normal” to emphasize its role as the name of a distribution, not a reference to “normal”, meaning usual or typical. However, most texts don't bother and so we won't either.

²A useful addendum: Four SDs captures the range of most (here, formally 95%) data values; it turns out this is casually true for the distribution for most real-life variables (i.e., not only those that are normally distributed). Most (but not quite all) of the values will span a range of approximately four SDs.

³For example, in many instances, data values are known to be composed of only non-negative values. In that instance, if the coefficient of variation (SD/mean) is greater than ∼0.6, this would indicate that the distribution is skewed right.

⁴Indeed the data from Panel B was generated from a normal distribution. However, you can see that the distribution of the sample won't necessarily be perfectly symmetric and bell-shape, though it is close. Also note that just because the distribution in Panel A is bimodal does not imply that classical statistical methods are inapplicable. In fact, a simulation study based on those data showed that the distribution of the sample mean was indeed very close to normal, so a usual t-based confidence interval or test would be valid. This is so because of the large sample size and is a predictable consequence of the Central Limit Theorem (see Section 2 for a more detailed discussion).

⁵We note that the SE formula shown here is for the SE of a mean from a random sample. Changing the sample design (e.g., using stratified sampling) or choosing a different statistic requires the use of a different formula.

⁶Our simulation had only ten random samples of size six. Had we used a much larger number of trials (e.g., 100 instead of 10), these two values would have been much closer to each other.

⁷This calculation (two times the SE) is sometimes called the margin of error for the CI.

⁸Indeed, given the ubiquity of “95%” as a usual choice for confidence level, and applying the concept in Footnote 2, a quick-and-dirty “pretty darn sure” (PDS) CI can be constructed by using 2 times the SE as the margin of error. This will approximately coincide with a 95% CI under many circumstances, as long as the sample size is not small.

⁹The requirement for normality in the context of various tests will be discussed in later sections.

¹⁰Here meaning by a statistical test where the P-value cutoff or “alpha level” (α) is 0.05.

¹¹R.A. Fisher, a giant in the field of statistics, chose this value as being meaningful for the agricultural experiments with which he worked in the 1920s.

¹²Although one of us is in favor of 0.056, as it coincides with his age (modulo a factor of 1000).

¹³The term “statistically significant”, when applied to the results of a statistical test for a difference between two means, implies only that it is plausible that the observed difference (i.e., the difference that arises from the data) likely represents a difference that is real. It does not imply that the difference is “biologically significant” (i.e., important). A better phrase would be “statistically plausible” or perhaps “statistically supported”. Unfortunately, “statistically significant” (in use often shortened to just “significant”) is so heavily entrenched that it is unlikely we can unseat it. It's worth a try, though. Join us, won't you?

¹⁴When William Gossett introduced the test, it was in the context of his work for Guinness Brewery. To prevent the dissemination of trade secrets and/or to hide the fact that they employed statisticians, the company at that time had prohibited the publication of any articles by their employees. Gossett was allowed an exception, but the higher-ups insisted that he use a pseudonym. He chose the unlikely moniker “Student”.

¹⁵These are measured by the number of pixels showing fluorescence in a viewing area of a specified size. We will use “billions of pixels” as our unit of measurement.

¹⁶More accurately, it is the distribution of the underlying populations that we are really concerned with, although this can usually only be inferred from the sample data.

¹⁷For data sets with distributions that are perfectly symmetric, the skewness will be zero. In this case the mean and median of the data set are identical. For left-skewed distributions, the mean is less than the median and the skewness will be a negative number. For right-skewed distributions, the mean is more than the median and the skewness will be a positive number.

¹⁸Kurtosis describes the shape or “peakedness” of the data set. In the case of a normal distribution, this number is zero. Distributions with relatively sharp peaks and long tails will have a positive kurtosis value whereas distributions with relatively flat peaks and short tails will have a negative kurtosis value.

¹⁹A-squared (A2) refers to a numerical value produced by the Anderson-Darling test for normality. The test ultimately generates an approximate P-value where the null hypothesis is that the data are derived from a population that is normal. In the case of the data in Figure 5, the conclusion is that there is < 0.5% chance that the sample data were derived from a normal population. The conclusion of non-normality can also be reached informally by a visual inspection of the histograms. The Anderson-Darling test does not indicate whether test statistics generated by the sample data will be sufficiently normal.

²⁰The list is long, but it includes coefficients in regression models and estimated binomial proportions (and differences in proportions from two independent samples). For an illustration of this phenomenon for proportions, see Figure 12 and discussion thereof.

²¹There are actually many Central Limit Theorems, each with the same conclusion: normality prevails for the distribution of the statistic under consideration. Why many? This is so mainly because details of the proof of the theorem depend on the particular statistical context.

²²And, as we all know, good judgment comes from experience, and experience comes from bad judgment.

²³Meaning reasons based on prior experience.

²⁴Meaning “after the fact”.

²⁵Also see discussion on sample sizes (Section 2.7) and Section 5 for a more complete discussion of issues related to western blots.

²⁶This is due to a statistical “law of gravity” called the Central Limit Theorem: as the sample size gets larger, the distribution of the sample mean (i.e., the distribution you would get if you repeated the study ad infinitum) becomes more and more like a normal distribution.

²⁷Estimated from the data; again, this is also called the SEDM.

²⁸In contrast, you can, with data from sample sizes that are not too small, ask whether they (the data and, hence, the population from whence they came) are normal enough. Judging this requires experience, but, in essence, the larger the sample size, the less normal the distribution can be without causing much concern.

Comparison	FDR Critical Value	P-values
1	0.005	0.001*	0.004*	0.006
2	0.010	0.003*	0.008*	0.008
3	0.015	0.012*	0.014*	0.011
4	0.020	0.015*	0.048	0.019
5	0.025	0.019*	0.210	0.020
6	0.030	0.022*	0.346	0.025
7	0.035	0.034*	0.719	0.111
8	0.040	0.056	0.754	0.577
9	0.045	0.127	0.810	0.636
10	0.050	0.633	0.985	0.731
The highlighted values indicate the first P-value that is larger than the significance threshold (i.e., the FDR critical value)]. *Comparisons that were declared significant by the method.

Comparison

FDR Critical Value

P-values

Data Set #1

Data Set #2

Data Set #3

0.005

0.001*

0.004*

0.006

0.010

0.003*

0.008*

0.008

0.015

0.012*

0.014*

0.011

0.020

0.015*

0.048

0.019

0.025

0.019*

0.210

0.020

0.030

0.022*

0.346

0.025

0.035

0.034*

0.719

0.111

0.040

0.056

0.754

0.577

0.045

0.127

0.810

0.636

0.050

0.633

0.985

0.731

The highlighted values indicate the first P-value that is larger than the significance threshold (i.e., the FDR critical value)].

*Comparisons that were declared significant by the method.

²⁹This discussion assumes that the null hypothesis (of no difference) is true in all cases.

³⁰Notice that this is the Bonferroni critical value against which all P-values would be compared.

³¹If the null hypothesis is true, P-values are random values, uniformly distributed between 0 and 1.

³²The name is a bit unfortunate in that all of statistics is devoted to analyzing variance and ascribing it to random sources or certain modeled effects.

³³These are referred to in the official ANOVA vernacular as treatment groups.

³⁴This is true supposing that none are in fact real.

# of F2s/F1	Likelihood of cloning at least one m/m	# F1 plates	# F2 plates	Total # plates	*Expected # m/m* isolated f = 0.01**	*Efficiency (#m/m* per total) f = 0.01**	*Expected # m/m* isolated f = 0.001**	*Efficiency (#m/m* per total) f= 0.001**
1	25.0%	1000	1000	2000	2.50	1.25e^-3	0.250	1.25e^-4
2	43.8%	1000	2000	3000	4.38	1.46e^-3	0.438	1.46e^-4
3	57.8%	1000	3000	4000	5.78	1.45e^-3	0.578	1.45e^-4
4	68.4%	1000	4000	5000	6.84	1.37e^-3	0.684	1.37e^-4
5	76.3%	1000	5000	6000	7.63	1.27e^-3	0.763	1.27e^-4
6	82.2%	1000	6000	7000	8.22	1.17e^-3	0.822	1.17e^-4

# of F2s/F1

Likelihood of cloning at least one m/m

# F1 plates

# F2 plates

Total # plates

Expected # m/m isolated f = 0.01

Efficiency (#m/m per total) f = 0.01

Expected # m/m isolated f = 0.001

Efficiency (#m/m per total) f= 0.001

25.0%

1000

2000

2.50

1.25e^-3

0.250

1.25e^-4

43.8%

1000

2000

3000

4.38

1.46e^-3

0.438

1.46e^-4

57.8%

1000

3000

4000

5.78

1.45e^-3

0.578

1.45e^-4

68.4%

1000

4000

5000

6.84

1.37e^-3

0.684

1.37e^-4

76.3%

1000

5000

6000

7.63

1.27e^-3

0.763

1.27e^-4

82.2%

1000

6000

7000

8.22

1.17e^-3

0.822

1.17e^-4

³⁵Proofs showing this abound on the internet.

³⁶Although this may seem intuitive, we can calculate this using some of the formulae discussed above. Namely, this boils down to a combinations problem described by the formula: Pr(combination) = (# permutations) (probability of any single permutation). For six events in 20, the number of permutations = 20!/6! 14! = 38,760. The probability of any single permutation = (0.08)6 (0.92)14 = 8.16e-8. Multiplying these together we obtain a value of 0.00316. Thus, there is about a 0.3% chance of observing six events if the events are indeed random and independent of each other. Of course, what we really want to know is the chance of observing at least six events, so we also need to include (by simple addition) the probabilities of observing 7, 8, …20 events. For seven events, the probability is only 0.000549, and this number continues to decrease precipitously with increasing numbers of events. Thus, the chance of observing at least six events is still <0.4%, and thus we would suspect that the Poisson distribution does not accurately model our event.

³⁷Given that Table 4 indicates that the optimal number of F2s is between 2 and 3.

³⁸http://shahamlab.rockefeller.edu/cgi-bin/Genetic_screens/screenfrontpage.cgi

³⁹Calculating the probability that a medical patient has a particular (often rare) disease given a positive diagnostic test result is a classic example used to illustrate the utility of Baye's Theorem. Two complimentary examples on the web can be found at: http://vassarstats.net/bayes.html and http://www.tc3.edu/instruct/sbrown/stat/falsepos.htm.

⁴⁰This will often be stated in terms of a margin of error rather than the scientific formalism of a confidence interval.

⁴¹The reasons for this are complex and due in large part to the demonstrated odd behavior of proportions (See Agresti and Coull 1998, Agresti and Caffo 2000; and Brown et al., 2001).

⁴²A more precise description of the A-C method is to add the square of the appropriate z-value to the denominator and half of the square of the z-value to the numerator. Conveniently, for the 95% CI, the z-value is 1.96 and thus we add 1.962 = 3.84 (rounded to 4) to the denominator and 3.84/2 = 1.92 (rounded to 2) to the numerator. For a 99% A-C CI, we would add 6.6 (2.5752) to the denominator and 3.3 (6.6/2) to the numerator. Note that many programs will not accept anything other than integers (whole numbers) for the number of successes and failures and so rounding is necessary.

⁴³Other assumptions for the binomial include random sampling, independence of trials, and a total of two possible outcomes.

⁴⁴Card counters in Las Vegas use this premise to predict the probability of future outcomes to inform their betting strategies, which makes them unpopular with casino owners.

⁴⁵Note that the numbers you will need to enter for each method are slightly different. The binomial calculators will require you to enter the probability of a success (0.00760), the number of trials (1,000), and the number of successes (13). The hyper-geometric calculator will require you to enter the population size (20,000), the number of successes in the population (152), the sample size (1,000), and the number of success in the sample (13). Also note that because of the computational intensity of the hyper-geometric approach, many websites will not accommodate a population size of >1,000. One website that will handle larger populations (http://keisan.casio.com/has10/SpecExec.cgi?id=system/2006/1180573202) may use an approximation method.

	Before	After	Relative Increase
	$100,000	$400,000	4
	$300,000	$600,000	2
Means	$200,000	$500,000	MoR↓
	RoM→	2.5	3

Before

After

Relative Increase

$100,000

$400,000

$300,000

$600,000

Means

$200,000

$500,000

MoR↓

RoM→

2.5

⁴⁶Admittedly, standard western blots would also contain an additional probe to control for loading variability, but this has been omitted for simplification purposes and would not change the analysis following adjustments for differences in loading.

⁴⁷A similar, although perhaps slightly less stringent argument, can be made against averaging cycle numbers from independent qRT-PCR runs. Admittedly, if cDNA template loading is well controlled, qRT-PCR cycle numbers are not as prone to the same arbitrary and dramatic swings as bands on a western. However, subtle differences in the quality or amount of the template, chemical reagents, enzymes, and cycler runs can conspire to produce substantial differences between experiments.

⁴⁸This Excel tool was developed by KG.

⁴⁹The maximum possible P-values can be inferred from the CIs. For example, if a 99% CI does not encompass the number one, the ratio expected if no difference existed, then you can be sure the P-value from a two-tailed test is <0.01.

⁵⁰Admittedly, there is nothing particularly “natural” sounding about 2.718281828…

⁵¹An example of this is described in Doitsidou et al., 2007

⁵²In the case of no correlation, the least-squares fit (which you will read about in a moment) will be a straight line with a slope of zero (i.e., a horizontal line). Generally speaking, even when there is no real correlation, however, the slope will always be a non-zero number because of chance sampling effects.

⁵³For example, nations that that supplement their water with fluoride have higher cancer rates. The reason is not because fluoride is mutagenic. It is because fluoride supplements are carried out by wealthier countries where health care is better and people live longer. Since cancer is largely a disease of old age, increased cancer rates in this case simply reflect a wealthier long-lived population. There is no meaningful cause and effect. On a separate note, it would not be terribly surprising to learn that people who write chapters on statistics have an increased tendency to become psychologically unhinged (a positive correlation). One possibility is that the very endeavor of a writing about statistics results in authors becoming mentally imbalanced. Alternatively, volunteering to write a statistics chapter might be a symptom of some underlying psychosis. In these scenarios cause and effect could be occurring, but we don't know which is the cause and which is the effect.

⁵⁴In truth, SD is affected very slightly by sample size, hence SD is considered to be a “biased” estimator of variation. The effect, however, is small and generally ignored by most introductory texts. The same is true for the correlation coefficient, r.

⁵⁵Rea et al. (2005) Nat. Genet. 37, 894-898. In this case, the investigators did not conclude causation but nevertheless suggested that the reporter levels may reflect a physiological state that leads to greater longevity and robust health. Furthermore, based on the worm-sorting methods used, linear regression was not an applicable outcome of their analysis.

⁵⁶The standard form of simple linear regression equations takes the form y = b₁x + b₀, where y is the predicted value for the response variable, x is the predictor variable, b₁ is the slope coefficient, and b₀ is the y-axis intercept. Thus, because b1 and b0 are known constants, by plugging in a value for x, y can be predicted. For simple linear regression where the slope is a straight line, the slope coefficient will be the same as that derived using the least-squares method.

⁵⁷Although seemingly nonsensical, the output of a linear regression equation can be a curved line. The confusion is the result of a difference between the common non-technical and mathematical uses of the term “linear”. To generate a curve, one can introduce an exponent, such as a square, to the predictor variable (e.g., x²). Thus, the equation could look like this: y = b₁x² + b0.

⁵⁸A multiple regression equation might look something like this: Y = b₁X₁ + b₂X₂ - b₃X₃ + b₀, where X_1-3 represent different predictor variables and b_1-3 represent different slope coefficients determined by the regression analysis, and b0 is the Y-axis intercept. Plugging in the values for X_1-3, Y could thus be predicted.

⁵⁹Even without the use of logistic regression, I can predict with near 100% certainty that I will never agree to author another chapter on statistics! (DF)

⁶⁰An online document describing this issue is available at: firstclinical.com/journal/2007/0703_Power.pdf. In addition, a recent critical analysis of this issue is provided by Bacchetti (2010).

⁶¹The median is essentially a trimmed mean where the trimming approaches 100%!

⁶²More accurately, the tests assume that the populations are normal enough and that the sample size is large enough such that the distribution of the calculated statistic itself will be normal. This was discussed in Section 1.

⁶³Note that there are subtle variations on this theme, which (depending on the text or source) may go by the same name. These can be used to test for differences in additional statistical parameters such the median.

⁶⁴More accurately, nonparametric tests will be less powerful than parametric tests if both tests were to be simultaneously carried out on a dataset that was normal. The diminished power of nonparametric tests in these situations is particularly exacerbated if sample sizes are small. Obviously, if the data were indeed normal, one would hopefully be aware of this and would apply a parametric test. Conversely, nonparametric tests can actually be more powerful than parametric tests when applied to data that are truly non-Gaussian. Of course, if the data are far from Guassian, then the parametric tests likely wouldn't even be valid. Thus, each type of test is actually “better” or “best” when it is used for its intended purpose.

⁶⁵For example, strains A and B might have six and twelve animals remaining on day 5, respectively. If a total of three animals died by the next day (two from strain A and one from strain B) the expected number of deaths for strain B would be twice that of A, since the population on day 5 was twice that of strain A. Namely, two deaths would be expected for strain B and one for strain A. Thus, the difference between expected and observed deaths for strains A and B would be 1 (2−1=1) and −1 (1−2=−1), respectively.

⁶⁶The final calculation also takes into account sample size and the variance for each sample.

⁶⁷Note that, although not as often used, there are also parametric bootstrapping methods.

⁶⁸This is a bad idea in practice. For some statistical parameters, such as SE, several hundred repetitions may be sufficient to give reliable results. For others, such as CIs, several thousand or more repetitions may be necessary. Moreover, because it takes a computer only two more seconds to carry out 4,000 repetitions than it takes for 300, there is no particular reason to scrimp.

⁶⁹Note that this version of the procedure, the percentile bootstrap, differs slightly from the standard bootstrapping method, the bias corrected and accelerated bootstrap (BCa). Differences are due to a potential for slight bias in the percentile bootstrapping procedure that are not worth discussing in this context. Also, don't be unduly put off by the term “bias”. SD is also a “biased” statistical parameter, as are many others. The BCa method compensates for this bias and also adjusts for skewness when necessary.

⁷⁰A brief disclaimer. Like everything else in statistics, there are some caveats to bootstrapping along with limitations and guidelines that one should become familiar with before diving into the deep end.

^*Edited by Oliver Hobert. Last revised January 14, 2013, Published July 9, 2013. This chapter should be cited as: Fay D.S. and Gerow K. A biologist's guide to statistical thinking and analysis (July 9, 2013), WormBook, ed. The C. elegans Research Community, WormBook, doi/10.1895/wormbook.1.159.1, http://www.wormbook.org.

Copyright: © 2013 David S. Fay and Ken Gerow. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

^§To whom correspondence should be addressed. Email: davidfay@uwyo.edu

Creative Commons License All WormBook content, except where otherwise noted, is licensed under a Creative Commons Attribution License.

A biologist's guide to statistical thinking and analysis*

1. The basics

1.1. Introduction

1.2. Quantifying variation in population or sample data

1.3. Quantifying statistical uncertainty

1.4. Confidence intervals

1.5. What is the best way to report variation in data?

1.6. A quick guide to interpreting different indicators of variation

1.7. The coefficient of variation

1.8. P-values

1.9. Why 0.05?

2. Comparing two means

2.1. Introduction

2.2. Understanding the t-test: a brief foray into some statistical theory

2.3. One- versus two-sample tests

2.4. One versus two tails

2.5. Equal or non-equal variances

2.6. Are the data normal enough?

2.7. Is there a minimum acceptable sample size?

2.8. Paired versus unpaired tests

2.9. The critical value approach

3. Comparisons of more than two means

3.1. Introduction

3.2. Safety through repetition

3.3. The family-wise error rate

3.4. Bonferroni-type corrections

3.5. False discovery rates

3.6. Analysis of variance

3.7. Summary of multiple comparisons methods

3.8. When are multiple comparison adjustments not required?

3.9. A philosophical argument for making no adjustments for multiple comparisons

4. Probabilities and Proportions

4.1. Introduction

4.2. Calculating simple probabilities

4.3. Calculating more-complex probabilities

4.4. The Poisson distribution

4.5. Intuitive methods for calculating probabilities

4.6. Conditional probability: calculating probabilities when events are not independent

4.7. Binomial proportions

4.8. Calculating confidence intervals for binomial proportions

4.9. Tests for differences between two binomial proportions

4.10. Tests for differences between more than one binomial proportion

4.11. Probability calculations for binomial proportions

4.12. Probability calculations when sample sizes are large relative to the population size

4.13. Tests for differences between multinomial proportions

5. Relative differences, ratios, and correlations

5.1. Comparing relative versus incremental differences

5.2. Ratio of means versus mean of ratios

5.3. Log scales

5.4. Correlation and modeling

5.5. Modeling and regression

6. Additional considerations and guidelines

6.1. When is a sample size too small?

6.2. Statistical power

6.3. Can a sample size be too large?

6.4. Dealing with outliers

6.5. Nonparametric tests

6.6. A brief word about survival

6.7. Fear not the bootstrap

7. Acknowledgments

8. References

9. Appendix A: Microsoft Excel tools

10. Appendix B: Recomended reading

11. Appendix C: Useful programs for statistical calculations

12. Appendix D: Useful websites for statistical calculations

A biologist's guide to statistical thinking and analysis^*