# A practical introduction to medical statistics

## Abstract

### Key content

• A few key principles are introduced that need to be understood before inferential statistical procedures can be applied or interpreted.
• The ‘inputs’ and ‘outputs’ of a generalised univariate statistical analysis are outlined.
• The greater value of interval estimation, over significance testing, is emphasised.
• Distinction is made between statistical significance and clinical importance.
• Examples of commonly used two-group analyses for independent samples are explained and discussed.

### Learning objectives

• To understand the principles underpinning the application of inferential statistical methods.
• To be able to interpret and apply a few commonly used statistical procedures.
• To identify when parametric and non-parametric tests are suitable to apply to data.
• To distinguish between an odds ratio and a risk ratio.

### Ethical issues

• A study that is statistically flawed is ethically flawed.

## Introduction

Medical statistics is a vast and ever-growing field of academic endeavour, with direct application to developing the robustness of the evidence base in all areas of medicine. Although the complexity of available statistical techniques has continued to increase, fuelled by the rapid data processing capabilities of even desktop/laptop computers, medical practitioners can go a long way towards creating, critically evaluating and assimilating this evidence with an understanding of just a few key statistical concepts. While the concepts of statistics and ethics are not common bedfellows, it should be emphasised that a statistically flawed study is also an unethical study.[1] This review will outline some of these key concepts and explain how to interpret the output of some commonly used statistical analyses. Examples will be confined to two-group tests on independent samples, using both a continuous and a dichotomous/binary outcome measure.

## Some fundamentals

### Populations and samples

If you have access to the relevant data for all individuals in whom you are interested, then at the simplest level, there is no need for statistical ‘tests’, because you have sampled the entire population of interest. All you need do is summarise the data appropriately. This is rarely a realistic scenario. In almost all medical research, we undertake measurements on a convenient but, we hope, representative sample of patients/research participants; we then analyse these data and seek to make generalisations to the population from which the sample was drawn. Examples could be all pregnant women, all new mothers with a specific postnatal condition of interest, all consultant obstetricians and so on. It is important to be clear about the characteristics of the population of interest. This clarity is achieved by the use of appropriate and justified inclusion and exclusion criteria when recruiting the study sample. The concept of ‘representativeness’ is a tricky one, and there is no space to explore it further here, but it is in the context of this concept that random selection and random allocation assume their importance in statistics. Having said this, statistical analyses can be conducted when samples are not randomly selected,[2] and when interventions are not randomly assigned, although they may require more careful interpretation and/or more refined analysis.

### Levels of data/data types

Statisticians and statistical analysis programs generally require all data for analysis to be coded in numerical form. Some kinds of data will naturally be in numerical form (e.g. systolic blood pressure, fasting plasma glucose concentration, baby's birthweight), while other observations may readily be coded into numerical form (e.g. none, mild, moderate, severe: coded to 0, 1, 2, 3; survived, died: coded to 0, 1). In some cases, however, a long and rigorous development and validation process is required in order to capture complex concepts in numerical form (e.g. anxiety/depression, mobility/function and quality of life). It is important to understand the nature and limitations of the ‘numbers’ you are analysing:

• Ratio data: constant units along the scale, with a meaningful zero. This category will include most physical quantities – mass, length, pressure and so on.
• Interval data: constant units along the scale, but no meaningful zero. The data from many compound assessment tools (e.g. for quality of life, anxiety/depression or mobility/function) are analysed as if they conform to this type. In many situations this is a valid approach, although in reality these tools generate data of the next type.
• Ordinal data: a progressive, directional scale but with no guarantee of consistency of unit size (e.g. pain score, satisfaction scores and some compound measures).
• Nominal data: categorical, with no meaningful directional scale (e.g. type of delivery or ethnicity).

The above data types form a hierarchy, with the analytical options becoming increasingly restricted as we move from ratio to nominal data. Two other data types are worth mentioning, which could be accommodated in this hierarchy but have their own analytical methods: binary data (survived/died, not diseased/diseased; generally coded 0/1) and count data (number of previous pregnancies, frequency of asthma attacks and so on).

### Descriptive/summary statistics

Continuous (ratio/interval) data are commonly summarised using the mean as a measure of location/average/central tendency and the standard deviation (SD) as a measure of spread/dispersion. Strictly speaking, the standard deviation is often only useful as a measure of spread if the data approximates reasonably well to a normal distribution. In this case, approximately 68% of the data values will be within one SD of the mean. If the distribution of the data departs markedly from the ‘bell’ shape of the normal distribution, then this interpretation no longer holds.

For non-normal ratio/interval data and for ordinal data it is more appropriate to use the order statistics: minimum, first quartile, median, third quartile, maximum – expressed in that order. The interquartile range (third quartile – first quartile) will contain the central 50% of the data, whatever the shape of the data distribution, and performs an analogous role to that of the SD in the case of normally distributed ratio/interval data. The median replaces the mean as a measure of ‘average’. In fact, these measures could sensibly be used for all continuous measures, because they are always meaningful and their use will obviate the need to test continuous data for normality when describing characteristics of a sample. The box-and-whisker plot is a graphical representation of these order statistics (introduced further below and illustrated in Figure 3).

Categorical data (nominal, binary and some ordinal data) should be summarised using the actual numbers (with percentage in parentheses) in each category.

### Number of groups to compare and relatedness of samples

In an audit, one may wish to compare local performance (e.g. proportion of successful procedures), measured on a sample, with some national ‘standard’ or target value. This would be a ‘one-group’ test, because there is no source of variation or uncertainty in the target. More commonly in research, two groups or more are to be compared. The attention in this review will be focused on the comparison of two independent groups. In addition to the number of groups, it is important to give attention to the relatedness of the samples. In a parallel group randomised controlled trial (RCT), and in many observational studies, the samples will be independent of one another. Each participant has an equal chance of being in either group in an RCT; they independently exist in the exposed or unexposed groups in an observational cohort study; or a particular ‘case’ has no influence on the selection of a specific ‘non-case’ in an unmatched case–control study. If samples are related in any way (e.g. controls matched for age and weeks of gestation with cases, repeated measures on the same individuals pre- and post-intervention), this relatedness needs to be accounted for in the analysis technique chosen.

## Characteristics of a statistical analysis

Figure 1 provides a conceptual overview of the elements of a general univariate statistical analysis. The first input comprises the data type and structure of the data. A clear understanding of this input will lead you to a decision regarding the appropriate analytical method to use. The hypotheses comprise the hypothesis you are directly testing (the null hypothesis) and the hypothesis you will accept (the alternative hypothesis) should your decision criterion lead you to reject the null hypothesis. In most situations the null hypothesis will be an assumption of ‘no effect’, for example ‘the means are equal in the populations from which these two samples are drawn’. The alternative hypothesis will be ‘the means are not equal’. This is an example of a two-tailed test. In some situations a null hypothesis may postulate a non-zero effect, and in other situations a directional, one-tailed hypothesis may be appropriate, where the direction of any interpretable or important effect is pre-specified (e.g. for a non-inferiority trial).

The effect size is the estimate of the size of the effect in the population, which in a simple univariate analysis will be the size of effect actually found between the samples (e.g. the difference in the sample means). However, there will inevitably be some uncertainty in this estimate. Intuitively, we would expect this uncertainty to be smaller the larger our sample size (n), and we would expect it to be larger the more the inherent variation (SD) of the outcome measure in our samples. The mathematical derivation will not be given here, but it should be apparent that the ratio SD/√n, called the standard error of the mean (SE or SEM), has this property. Because they look and sound so similar, the standard deviation and the standard error are often confused. It is important to understand that, despite their superficial similarity, they have completely different meanings. The standard deviation is a measure of the spread of the data; the standard error is a measure of uncertainty in the estimated effect size.

By convention, we generally express the uncertainty in our estimate of the effect size in terms of a 95% confidence interval (95% CI). Again skipping the derivation, this is formed by adding and subtracting approximately two (more precisely 1.96) standard errors from the estimated effect size: 95% CI = effect size +/– 1.96 x SE. Strictly speaking, the factor 1.96 is only correct for the normal distribution, which will be an adequate approximation for large sample sizes. For small sample sizes (where we would use the t distribution, rather than the normal) this factor will be inflated slightly. The strict meaning of the confidence interval is that, if the study were to be repeated over and over again, the confidence interval will include the true value in 95% of the repetitions.

An estimate of effect size and its uncertainty is far more informative than knowing whether an effect is statistically significant or not. Nevertheless, identifying whether or not an effect is ‘significant’ is generally useful for decision-making purposes and for some tests (the non-parametric tests) it is sometimes the only information the test gives us. Whether or not the result from a statistical test is ‘statistically significant’ is determined by the P-value. The philosophical interpretation and practical use of P-values has been subject to strong debate since the development of somewhat differing approaches by Fisher, with one method, and Neyman and Pearson with an alternative formulation, in the first half of the 20th century.[3] It is doubtful whether many practical researchers understand the distinction between these approaches and in practice researchers often adopt a hybrid of the two, a pragmatic exposition of which follows. It should be noted, however, that there are logical flaws in trying to adopt both approaches simultaneously and there is continuing debate on the appropriateness of the terminology we use in hypothesis testing.

A P-value can only be interpreted in the context of the (null) hypothesis that is being tested. Our decision as to whether to accept or reject the null hypothesis is determined by the size of the P-value relative to some threshold value that we have predetermined. Most commonly, a threshold of P = 0.05 is used and it would usually, though not always, be appropriate to assume this default value. If the P-value is greater than P = 0.05 then we accept the null hypothesis. We have not proven the null hypothesis. All we can say is that the data are at least reasonably consistent with it. However, the data could be equally consistent with a smaller effect size that our study sample is too small to detect. The null hypothesis is therefore not only ‘not proven’, but neither can we say anything about the ‘likelihood’ of it being true. For this reason, ‘acceptance’ of the null hypothesis is deemed by many[4] to be inappropriate terminology – some argue that we should ‘fail to reject’, rather than ‘accept’ a null hypothesis; others suggest the phrase ‘insufficient evidence to reject’.[2] If the P-value is less than P = 0.05, then we reject the null hypothesis and accept the alternative hypothesis. We have not disproved the null hypothesis but we have identified that the data are not consistent with it. The lower (closer to zero) the P-value, the stronger is the evidence against the null hypothesis.

The P-value is a ‘probability’, but it is not the probability of the (null) hypothesis being true. Its correct interpretation is somewhat more obscure, but can be defined as: the probability of observing a value for the test statistic that is as extreme as, or more extreme, than that currently observed, assuming that the null hypothesis is true. A logical consequence of this approach is that even if the rules of the game are followed correctly, there is still a chance of getting it wrong: one might reject the null hypothesis when in reality it is true (a type I error); or one might accept (fail to reject) the null hypothesis when in reality it is false (a type II error). The probability of a type I error (α) is determined by the P-value threshold we use for rejection of the null hypothesis, whereas the probability of a type II error (β) is determined by:

• Sample size – a larger sample size is more likely to identify an effect.
• The true effect size in the population – a small effect is more likely to be missed.
• The degree of inherent variation in the data – large sample standard deviations make a true effect harder to detect.
• The type I error (α) – the smaller α we use, the greater the type II error probability, for a given sample size.

The statistical power of a test is given by 1 – β, usually expressed as a percentage. Therefore a type II error probability (β) of 0.2 corresponds to a test that has 80% power, meaning that the test has 80% probability of finding a statistically significant effect if the true effect size is of the value postulated in the formula used for calculating the estimated required sample size. The above considerations comprise the essential input to a sample size estimation, which is a prior estimate of the sample size considered necessary to have a good chance of meeting the goals of a specific study.

### Overuse of P-values

There are many problems associated with the use of P-values and hypothesis testing.[5-7] It is always more informative to produce estimates of effect size and confidence intervals, but the limitations of P-values go beyond this. In an ideal study there should be only one primary outcome measure and associated hypothesis, with perhaps a small number of secondary outcome measures. Further, these should be pre-planned and an appropriate sample size estimation conducted prior to study commencement, rather than dredged ‘post hoc’ from a multitude of recorded variables, generated in the hope of finding at least something. The generation of multiple hypothesis tests raises at least two problems:

• The sensitivity of the tests to detect a clinically important effect will vary, and the sample size is likely to be inadequate for some of them.
• The probability of making a type I error increases, for any given test, the greater the number of tests conducted. There are ways of controlling for this unwanted effect (for example the Bonferroni correction) but such corrections have the effect of reducing the individual test-wise P-value threshold for rejecting the null hypothesis, which further reduces statistical power.

The net effect is that as the number of tests performed increases, the strength of the evidence provided by the data diminishes. Some studies by their very nature are exploratory and may involve the generation of multiple hypothesis tests – for example, when trying to identify influential factors from a potentially large number in a relatively underexplored area of research. In these situations it is important not to overinterpret the findings. Interval estimation will be more informative than P-values and any candidate factors need to be further examined in a confirmatory study before their effect can be interpreted with any confidence. There are other circumstances where P-values are used inappropriately, for example in comparing baseline characteristics between groups in a randomised study – if the randomisation worked properly, any ‘significant’ differences are, by definition, random.

### Statistical significance and clinical importance

The two concepts of statistical significance and clinical importance should not be confused. In a small study a clinically important effect, such as a 40% difference in treatment success rates, may have been found in the groups studied, but the small sample size may have meant that this effect was not found to be statistically significant. Conversely, a very large study may find a very small effect, of no clinical importance, to be statistically significant. These scenarios again illustrate the greater usefulness of interval estimation (where a ‘plausibility range’ for the true effect size can be presented) over ‘significance testing’. In other circumstances, a factor with a statistically significant but small effect may be slightly useful for prediction, but of no consequence in medical decision making. For example, in the example data set used in Figure 2, babies' birthweight was correlated with the ‘normal’ weight of the mother. The correlation coefficient, r = 0.19, represents a weak positive correlation that is nevertheless statistically significant (P = 0.01). However, it is questionable whether this relationship is of any clinical importance in the context of the identification and management of low birthweight babies.

### What test when?

For every combination of data type/distribution, data relatedness and number of groups, there will be at least one appropriate analysis method and there will often be several valid approaches. Introductory statistical texts will often present a flow chart or table, leading one through the decision-making process.[8, 9] The various methods are commonly grouped into two types: parametric methods – used for ratio/interval data where the underpinning assumption of normal distribution of the data (or in the case of regression methods, the residuals) holds; and non-parametric methods – used for some ordinal/categorical data or for ratio/interval data where the assumption of normality does not hold. Parametric methods can also be used for categorical data, for example logistic regression where the outcome measure is expressed as an odds ratio, and in various other situations where the data are distributed according to some other mathematically well-defined distribution (e.g. Poisson, exponential).

## Tests involving comparison of two independent groups

For the purposes of illustrating some of these methods, this review will be restricted to univariate comparisons of two independent groups. As the data structure departs from this simple case, the required analytical techniques rapidly become somewhat more complex, both in their implementation and interpretation, requiring a reasonable degree of statistical proficiency if the extraction of information content from the data is to be optimised.

### Ratio or interval data, normally distributed – the Student's t test

Where we have two independent groups that we believe differ systematically only in the characteristic that we wish to test (e.g. intervention/control in an RCT or exposed/unexposed in a cohort study), and a ratio or interval level outcome measure that is at least approximately normally distributed, then the t test is the most powerful test to compare the two groups.[10] Although, strictly speaking, the t test is only valid for normally distributed data, it is widely considered to be reasonably robust to departures from normality.[11] However, in such situations a non-parametric test (the Wilcoxon rank-sum test) may actually be the more powerful test.[10] For completeness, two additional points should be made here:

• The assumption of normally distributed data should be checked graphically (histogram, box-and-whisker plot or normal probability plot) and/or with an appropriate statistical test (e.g. the Shapiro–Wilk test or Kolmogorov–Smirnov test).
• It should be checked whether the variances (or SDs) in the two groups are approximately equal, again using appropriate graphical and/or statistical methods, because there are two varieties of t test: equal variances assumed and unequal variances assumed.

The t test can be used to compare two groups in either an RCT or an observational study. However, in an observational study there is a greater risk of there being an imbalance between groups in potentially important confounding factors, thereby resulting in a biased estimate of the effect of the intervention/exposure of interest. A confounder is a variable or factor that is related both to the outcome measure and the predictor of interest and its presence could cause either an overestimate or underestimate of the assumed predictor's effect. The effect of confounders can be controlled for using analysis of covariance (ANCOVA). This approach is also naturally accommodated in statistical (linear regression) modelling procedures.

### Example

The data in Figure 2 are taken from Hosmer and Lemeshow[12] and comprise the birthweight of babies born to smoking (n = 74) and non-smoking (n = 115) mothers. All statistical tests in this review are conducted using Stata, Release 12 [StataCorp. 2011. Stata Statistical Software: Release 12. College Station TX: StataCorp LP].

In the case where there are no outliers, the lower and upper adjacent values represent, respectively, the minimum and maximum. The other three order statistics are as indicated on the graph. However, the whisker is defined to have a maximum length of 1.5 times the interquartile range (box width). This rule specifically allows outliers to be identified as individual points beyond the whisker.

The box-and-whisker plots indicate that birthweight is fairly symmetrically distributed in both groups and the spread of the data within the two groups is very similar, suggesting that there is not a large or obvious difference in the variances (or, equivalently, the standard deviations) of the birthweight between the smoking and non-smoking mothers. There is a suggestion that the birthweight of babies born to the smoking mothers may be systematically smaller, but it is by no means obvious from the box-and-whisker plot whether this is a real difference or whether it could plausibly be explained by chance (sample sizes are not shown on the graph).

The next step is to perform a formal test of the hypothesis that the data for both groups come from a normal distribution. Stata uses the Shapiro–Wilk test. The P-values obtained from this test (P = 0.35 and P = 0.43, for the non-smoking and smoking mothers, respectively) suggest that it is reasonable to assume that the birthweights in both groups are normally distributed. We can now test the hypothesis that the variances (or, equivalently, the standard deviations) in both populations are equal. Stata's test for equality of standard deviations, testing the null hypothesis of equal SDs, gives a two-sided P-value of P = 0.22, so we accept the hypothesis that the variances are equal. Having accepted that the data are normally distributed with equal variances, the independent samples t test (assuming equal variances) can be used to test the hypothesis that the mean birthweights in the two populations (smokers and non-smokers) are equal. The Stata command line and output is given in Box 1. The ‘unpaired’ option specifies an independent samples t test, as opposed to a paired samples test.

### Box 1. Stata output for the independent samples t-test

In Box 1 the sample sizes are given as 115 and 74, for the non-smokers and smokers, respectively. The sample means (in grams) for each group are calculated, with the 95% CI for the sample means given in the last two columns. Also calculated is the difference in sample means (283.8 g), which is our estimate of the effect size (babies born to smoking mothers being, on average, of lower birthweight), and its confidence interval (95% CI 72.8–494.8). The latter is interpreted as meaning that the true difference in mean birthweight in the two populations (smoking and non-smoking mothers) from which the samples were drawn, is highly likely to be in this range.

The test of the (null) hypothesis (the line above, commencing Ho:) that the population means are equal produces a two-sided P-value of 0.0087 (P>|t| = 0.0087); the null hypothesis can therefore be rejected (since this P-value is less than 0.05), meaning that we have concluded that the null-hypothesised value is implausible, on the basis of these data. Note, however, that because these data derive from an observational study, we cannot rule out the possibility that this estimated effect is biased because of an imbalance of important confounding factors between the two groups (e.g. age, ethnicity, social class and various lifestyle factors).

### Ordinal data, or non-normally distributed ratio/interval data – The Mann–Whitney U (MWU), also known as the Wilcoxon rank-sum (WRS) test

Whereas the t test produces an estimate of effect size (difference in means), a confidence interval for the effect size and a P-value indicating whether the effect could plausibly be explained by chance, the non-parametric tests generally produce only a P-value. Non-parametric methods can be, but are not often, used to estimate confidence intervals for the difference in medians, but some caution needs to be exercised here, as the robustness of their calculation can be influenced by the distribution of the data.[13] Nevertheless, they perhaps should be more widely used. The null hypothesis tested by the MWU/WRS tests is that the samples come from the same population. Implicit in this assumption is that they have the same distribution. The MWU/WRS tests are sensitive to differences in both location (median values) and distribution (shape and spread of the data).[14] For illustration, applying the WRS test to the data above produces a similar P-value (P = 0.0067) to that of the t test, but provides no estimate of the effect size or its uncertainty and would not be the optimum test for these data.

### Tests for binary/dichotomous data

As an example of a binary/dichotomous variable, the birthweight data[12] considered above could be given a binary categorisation: low birthweight (defined as <2.5 kg) or not. Some types of variable (e.g. died/survived) are naturally binary in nature but it is not generally a good idea to dichotomise continuous data, because in doing so information is lost, along with the loss of statistical power. However, where clinical decision making is based on some threshold-based diagnosis, then this may be a sensible thing to do. For binary data, the most sensible first step in the analysis process is the generation of a 2x2 contingency table. Smoking status is given the value 1 for smokers and 0 for non-smokers; babies are assigned the value 1 if they are of low birthweight, 0 otherwise. The Stata command and (edited) output are given in Box 2.

### Pearson's chi-squared and Fisher's exact test

Here we test the null hypothesis of no association between the two binary categorical variables. Based on the row/column totals in the contingency table, the chi-squared test computes the average expected values (on the assumption that the null hypothesis is true) and the chi-squared statistic, corresponding to the relevant P-value, is determined by the magnitudes of the differences between the actual and expected values (what you would expect to find, on average, if the null hypothesis were true). Fisher's exact test is more computationally intensive, but is recommended for small samples and where any one of the cells has a single digit number.[2]

Inspection of the table shows that there appears to be a considerably higher proportion of babies in the normal birthweight range among the non-smoking mothers than among the smokers. Pearson's chi-squared and Fisher's exact measures of association are given, as requested in the command line. The figures on the right hand side displayed below the table are the P-values (P = 0.026 and P = 0.036) and the conclusions one would draw from each of them are, in this case, consistent. There is a statistically significant association between maternal smoking and the probability of the baby having a low birthweight.

A simple chi-squared test is not the most informative way of analysing data of this type. As a general principle, it is preferable to obtain an estimate of the effect size of the factor of interest (in this case, smoking) along with a measure of uncertainty of the estimate, which will usually be a 95% confidence interval. Several alternative approaches can be adopted in the current context: estimates of the difference in proportions (risks), the risk ratio and the odds ratio.

### Risk difference

Essentially equivalent to a test of independent proportions, the risk difference and its uncertainty can be calculated, in addition to providing a P-value. The P-value produced will be very close to that generated by the chi-squared test. However, as with the case of the chi-squared test, the P-value for the test of proportions (usually calculated using the asymptotic normal approximation) can become unreliable when sample sizes are small. The term ‘asymptotic’ refers to the limiting behaviour of the distribution of the test statistic as sample size increases. In these circumstances an exact binomial P-value may be preferred, though there are differing views on this issue.[11] For these data, the asymptotic normal approximation is reasonable.

### Odds ratio (OR)

If the probability of an event is given by Pevent, the Odds = Pevent/(1 – Pevent). We estimate the odds as: odds = number of people having the event/number of people not having the event. The odds ratio is then given by the ratio of the odds in the two groups to be compared.

An OR of 1 represents equal odds in both groups (i.e. no difference). The OR in one direction is between zero and 1; in the other direction it is between 1 and infinity. In practice, we work with the natural logarithm of the odds and when we are interested in comparing two groups (e.g. treatment and control) we model the natural logarithm of the odds ratio and exponentiate this function to get the OR. The reason for this approach is that the log(OR) is symmetric about zero (log(1)), which makes it much easier to handle mathematically.

### Risk ratio – relative risk (RR)

The risk ratio is more intuitively meaningful than the odds ratio and is simply the ratio of the probability of the event in each group, estimated by the ratio of the proportions measured in each group.

The reason that odds ratios are often used instead of risk ratios is because the odds ratio lends itself more easily to mathematical (logistic regression) modelling where we wish to assess the simultaneous effect of multiple predictors, or where we wish to control for potential confounding factors. When the prevalence of the condition/event of interest is relatively small (below 10%) the odds ratio is very close to the risk ratio and is often loosely interpreted as if it is the same as a risk ratio. However, as the prevalence increases the odds ratio becomes progressively larger than the risk ratio, so needs to be interpreted with caution in these circumstances.[15] In case–control studies it is generally problematic identifying the risk of disease because of the methods of sampling.[16] In such studies, it is the odds of ‘exposure’ that are compared between the cases and controls.

In the (edited) Box 3, generated using Stata's ‘csi’ command for a cohort study, the cases are the low birthweight babies, the non-cases the normal birthweight babies, the exposed group are the smokers and the unexposed group are the non-smokers. The proportions in each group (risk) and the risk ratio (both by default) and the odds ratio (as requested) are calculated.

### Box 3. Stata output for the epidemiological table and associated effect size estimates

Estimates for the proportion of low birthweight babies in each group are 0.252 (25.2%) in the non-smoking mothers and 0.405 (40.5%) in the smoking mothers; the difference in proportions (risk difference) is 0.153 (15.3%), 95% CI 0.016–0.290; the risk ratio is 1.61, 95% CI 1.06–2.44; the odds ratio is 2.02, 95% CI 1.08–3.77. The odds of having a low birthweight baby are doubled in the smoking mothers. The chi-squared P-value is again P = 0.027.

Note that because the prevalence of low birthweight is quite high (25% and 40%), the odds ratio is somewhat higher than the risk ratio. The risk ratio of 1.61 can be interpreted as meaning that there is an estimated 61% increase in risk of having a low birthweight baby if the mother smokes during pregnancy. However, interpreting the 95% CI, the true risk could plausibly be as low as a 6% increase in risk, or as high as a 2.44-fold (more than doubling) of risk. If the odds ratio were to be interpreted as if it were a risk ratio, then the risk of smoking would be overestimated. Finally, as with the t test above, note that this estimate could be biased as a result of uncontrolled confounding factors. Multivariable logistic regression could be used if individual values for likely confounders were available.

## Conclusion

Some of the key fundamental principles underpinning a statistical analysis have been introduced and a few example analyses have been presented. The outputs from a statistical test comprise the effect size and its 95% confidence interval (mainly for parametric tests) and the P-value (for all tests), which tests the plausibility of the null hypothesis (usually the assumption of ‘no effect’) and determines whether or not the result is ‘statistically significant’. Effect size estimation and its associated uncertainty (confidence interval) are more informative than P-values and should be presented wherever possible. Finally, whatever the ‘statistical significance’ of the result, ultimately what matters is its clinical importance. The judgement regarding clinical importance is in the domain of the clinician and answers cannot be provided by the ‘statistics’.

## Contribution to authorship

A.J. Scally is the sole author of this article and is responsible for it in its entirety.

## Disclosure of interests

The author has no relevant interests of any kind to declare.

## Acknowledgements

The final form of the article was strongly influenced by the very useful feedback from the referees.