NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Cook JA, Julious SA, Sones W, et al. Practical help for specifying the target difference in sample size calculations for RCTs: the DELTA2 five-stage study, including a workshop. Southampton (UK): NIHR Journals Library; 2019 Oct. (Health Technology Assessment, No. 23.60.)
Statistical sample size calculation is not an exact, or pure, science. 32 , 172 First, investigators typically make assumptions that are a simplification of the anticipated analysis. For example, the impact of controlling for prognostic factors is very difficult to quantify and, even though the analysis is intended to be adjusted (e.g. when randomisation has been stratified or minimised), 173 the sample size calculation is often based on an unadjusted analysis. Second, the calculated sample size can be very sensitive to the values of the inputs. In some circumstances a relatively small change in the value of one of the inputs (e.g. the control group event proportion for a binary outcome) can lead to a substantial change in the calculated sample size. However, the value used for one of the inputs (e.g. control group event proportion) may not accurately reflect the actual value that will be observed in the study. It is prudent to undertake sensitivity calculations to assess the potential impact of misspecification of key inputs (e.g. SD for a continuous outcome, level of missing data, etc.). This would also help inform decision-making about the continuation of a trial in which accumulating data suggest that the parameter will be substantially different from the one assumed in the main sample size calculation.
The role of the sample size calculation is to determine how many observations are required in order that the planned main analysis of the primary outcome, that is the one chosen to address the primary estimand of interest, is likely to provide a useful result. The sample size may also be chosen with reference to further key analyses (e.g. those focusing on other outcomes and subpopulations that address alternative estimands of interest). Most simply, this can be done by choosing the RCT’s sample size to maximise the number of participants required across the various analyses under consideration.
A variety of statistical approaches are available, although, overwhelmingly, current practice is to use the conventional Neyman–Pearson approach. This is so much the case that the specification of ‘effect size’, ‘significance level’ and ‘power’ are common parlance. The Neyman–Pearson approach is explained in Appendix 2 and the rest of this appendix assumes this approach is being used. Alternative approaches to the sample size calculation are briefly considered in Appendix 4 (see Appendix 4, sections Precision; Bayesian; and Value of information approach).
Often a simple formula can be used to calculate the required sample size. 174 The formula varies according to the type of outcome and, somewhat implicitly, the design of the trial and the planned analysis. Some of the simpler formulae are given in Binary outcome sample size calculation for a superiority trial; Continuous outcome sample size calculation for a superiority trial; Dealing with missing data for binary and continuous outcomes; and Time-to-event sample size calculation for a superiority trial, for the standard RCT design (i.e. a two-arm parallel-group RCT) and for the most common outcome types (binary, continuous and time to event).
The most common approach to the sample size calculation for a RCT is based on what can be described as the Neyman–Pearson, or conventional, approach. In essence, this approach involves adopting a statistical hypothesis testing framework and calculating the sample size required, given the specification of two statistical parameters (the power and significance level – see Glossary for definitions). This approach is sometimes referred to as carrying out a ‘power calculation’. This is a frequentist (as opposed to Bayesian) approach to answering the research question (see Appendix 4).
Although it is often not explicitly stated, this approach involves assuming a null hypothesis for which evidence to reject in favour of an alternative hypothesis is assessed. For a superiority trial with a standard design, the null hypothesis is that there is no difference between the interventions, and the alternative hypothesis is that there is a difference between them (i.e. one is superior to the other with respect to the outcome of interest). This leads to four possible scenarios once the trial is conducted and the data have been collected and analysed (Table 7).
Possible scenarios following the statistical analysis of a superiority trial
There are two scenarios in which a correct conclusion is made and two scenarios in which an incorrect conclusion is made. The chance of these two errors is controlled by the statistical parameters, the significance level and the statistical power. Typically, the probability of the type I error (α) is controlled to be 0.05 (or 5%), which is achieved by using this level as the one with which it is concluded that the result is statistically significant (i.e. a probability of ≤ 0.05 is ‘statistically significant’ and > 0.05 is not). Additionally, this is usually a two-sided significance level, in that it is not prescribed a priori in which direction a difference might be found. In a similar manner, we can also control the type II error rate (β) by ensuring that the statistical power (which is simply 1 minus the type II error rate, i.e. 1 – β) is sufficiently large. Typical values are 0.8 or 0.9 (i.e. 80% or 90% statistical power).
It is worth noting that the presence or absence of a statistically significant result cannot be used to decide whether or not there is an important difference. Often the most that can be concluded from a non-statistically significant result is that there is no statistical evidence of a difference (i.e. a difference cannot be conclusively ruled out). Additionally, it is possible to have a statistically significant result even when the observed difference is smaller than the target difference assumed in a conventional sample size calculation. 175 , 176 This value can be readily calculated for a continuous outcome. Here, this is described as the minimum statistically detectable difference. It should not be confused with the MCDC or the minimum clinically detectable difference, which are entirely different concepts (see the Glossary for brief descriptions). Some recommend calculating and reporting the minimum statistically detectable difference, as well as the target difference and the required sample size. 176
Both the use of the 5% significance level and 80% or 90% power are arbitrary and have no theoretical justification, but are widely used. However, as excluding the possibility of either error is impossible, and the required sample size increases at a greater rate the closer either error rate is set to zero, these values have become the de facto standards. If well chosen, the target difference is a valuable aid to the interpretation of the analysis result, irrespective of whether or not it is statistically significant. It is essential when interpreting the analysis of a trial to consider the uncertainty in the estimate, which is reflected in the CI. A key question of interest is what magnitude of difference can be ruled out. The expected (predicted) width of the CI can be determined for a given target difference and sample size calculation, which is a helpful further aid in making an informed choice about this part of a trial’s design. 98
Given the assumed research hypothesis, the design, the statistical parameters and the target difference, the sample size can be calculated. Formulae vary according to the type of outcome (see Binary outcome sample size calculation for a superiority trial; Continuous outcome sample size calculation for a superiority trial; and Time-to-event sample size calculation for a superiority trial), study design (see Appendix 5 for some common alternative designs) and the planned statistical analysis (see Other topics of interest). The general approach is similar across study designs. In more complex situations, the frequentist properties (e.g. the type I and II error rates) can be estimated using simulations of data and consequential analysis of simulated results for scenarios in which there is and is not a genuine difference between interventions. 177
The conventional approach to sample size calculations is not without limitations. 168 , 178 , 179 Misinterpretation of findings, related at least in part to the statistical approach (such as what a p-value actually is and what can be inferred from it), has been highlighted and various proposal to improve practice have been made. 180 Nevertheless, the conventional approach to clinical trial sample sizes has remained remarkably persistent and is by far the most common currently used. 13 , 181 This reflects to some degree its ease of implementation and training, as well as the uncertainty about alternatives.
This appendix presumes the conventional approach is to be used for the sample size calculation for a two-arm trial with 1 : 1 allocation. Immediately below, simple formulae for the most common outcome types are provided. For completeness, Appendix 4 briefly summarises alternative approaches to calculating the sample size for a RCT. Statistical issues related to conducting a reassessment of the sample size under a conventional and a Bayesian approach are considered elsewhere. 1 , 182 – 184 Adaptive trial design (see Appendix 5) seek to formally incorporate potential changes to the design due to interim data into the trial design.
There are a number of commonly used formulae for calculating the sample size for a binary outcome for a superiority trial (i.e. for a study in which two proportions are to be compared). 1 One formula for the required number of participants per arm, n, for a standard trial (assumed equal allocation and therefore group sizes) is presented in Equation 1 and is relatively straightforward to calculate:
n = ( Z 1 − β + Z 1 − α 2 ) 2 ( π A ( 1 − π A ) + π B ( 1 − π B ) ) ( π B − π A ) 2 ,where n is the required number of observations in each of the two randomised groups. Z1 – x is the value from the standardised normal distribution for which the probability of exceeding it is x. πA and πB are the anticipated probability of an event in groups A and B. α is the statistical significance level (i.e. the type I error rate), and β is the type II error rate and is chosen so that 1 – β is equal to the desired statistical power. The formula assumes even allocation between the treatment arms and a two-sided comparison.
The target difference can be expressed in multiple ways. It can be expressed as the absolute risk difference (πB – πA) or as a ratio, typically the RR:
( π B / ( 1 − π B ) π A / ( 1 − π A ) ) .Different combinations of πA and πB can lead to the same OR or RR, although they may produce very different absolute risk differences. For example, a proportion of 0.4 compared with one of 0.2 represents a RR of 2 and a risk difference of 0.2. Proportions of 0.1 and 0.05 also represent a RR of 2, but the risk difference of 0.05 is far smaller and will require a far larger sample size. Whenever the target difference is expressed as a ratio, the anticipated control (reference) group risk, πA, should also be provided.
The value assumed for πA greatly influences the sample size. 1 In this context the control group proportion can be considered as a nuisance parameter with the target difference, δ, fixed regardless of what the control group proportion is. Estimates of this parameter may come from a pilot trial or existing literature (see Chapter 3, Pilot studies and Review of the evidence base). There needs to be an evaluation of the observed response dependent on the study design, population and analysis in the study from which it is being estimated. The planned analysis, particularly the summary measure used, is important for the calculation as adjusted and unadjusted analyses can be estimating different estimands. 185
For ease of presentation, a slightly simplified formula 124 to estimate the sample size per arm for a superiority trial with a continuous outcome is:
n = 2 ( Z 1 − β + Z 1 − α / 2 ) 2 σ 2 δ 2 + Z 1 − α / 2 2 4 ,where Z1 – β and Z1 – α/2 are defined as before, σ is the population SD and δ is the target mean difference. As before, the formula presented here assumes even allocation between the treatment arms and a two-sided test comparison.
In practice, σ is typically assumed to be known, with an estimate from an existing study, S, used as if it were the population value. The formula can be further simplified by replacing δ by δ σ , the Cohen’s d standardised effect (dSES):
n = 2 ( Z 1 − β + Z 1 − α / 2 ) 2 d S E S 2 + Z 1 − α / 2 2 4 .Specifying the effect on the standardised scale, dSES, is therefore sufficient to calculate the required n for a given significance level and power. However, it should be noted that different combinations of mean and SD values produce the same SES (Cohen’s d). See Chapter 3, Standardised effect size, for further discussion. Although sufficient for the sample size calculation, specifying the target difference as a standardised effect alone can be viewed as an insufficient specification as it does not define the target difference in the original scale.
A key component in the sample size calculation of a continuous measure is the assumed magnitude of variance. An estimate of this parameter (usually expressed as a SD) may come from a pilot trial or existing literature (see Chapter 3, Pilot studies and Review of the evidence base). It is possible to get into a ‘Gordian knot’ when looking for an estimate of the variance. Ideally, an estimate of the variance taken from a large clinical study in the intended trial population with the same interventions would be available. However, if such a study was available, a new trial would probably not be necessary. If a new trial is truly needed, that need implies some limitations in the existing evidence. To decide on the relative utility of the variance estimates, various aspects of the study need to be considered (e.g. study design, population, outcome, analysis conducted, etc.), in a similar manner to the control group proportion and any estimate of a realistic target difference (see Chapter 3, General considerations, Pilot studies and Review of the evidence base). 1 , 124 The accuracy of the variance estimate will obviously influence the sensitivity of the trial to the assumptions made about the variance and will also influence the strategy of an individual clinical trial.
A more accurate, although computationally more demanding calculation if performed by hand, will give a slightly different result from the formula above (see Equation 5) and is used in various sample size software. 186 The difference between the simple and more complicated formulae is that the simple calculation assumes that the population variance, σ, is known for the design and analysis of the trial. The more complicated calculation recognises that, in practice, the sample variance estimate, s, will be used when analysing the trial. The more accurate formula can be found elsewhere. 1
In most studies involving humans, it is likely that withdrawals, losses to follow-up and missing data will occur during the trial. 187 Individuals in a trial could decide that they no longer want to take part and completely withdraw from the trial, they could move during the study and not update the study team, and/or they could decide that they do not want to answer a particular question on a questionnaire. Even in the most well-designed and well-executed trial, some losses to follow-up are inevitable. Additionally, intercurrent events (e.g. death or change in treatment) may preclude the possibility of an outcome under the conditions implied by the trial’s aim and corresponding estimand of interest.
Irrespective of the reasons for missing data, sample sizes are frequently inflated to account for a degree of missing data during the study. The estimate of the extent of missing data is often gathered from a pilot trial, previous studies of the intervention, or trials in a similar population. In the presence of missing data, the power of a trial to detect the same target difference is reduced, hence the need for inflation of the sample size. To inflate the sample size to account for missing data, the overall sample size required, 2n, is divided by the proportion of data anticipated to be available for analysis (p0b):
2 n / p 0 b .For example, if 20% attrition is anticipated, then the target sample size is divided by 0.8. A more complex and accurate approach can be used to deal with loss to follow-up over time, which is particularly pertinent for time-to-event outcomes.
It should be noted that adjustments such as above deal with the impact only in terms of precision of the missing data; a substantial number of missing data may also put the study results at risk of bias (e.g. if the reasons for attrition are related to eventual outcomes).
Owing to varying time of follow-up across study participants, it is not appropriate to analyse the proportion of participants who experience an event using logistic regression or a similar method. The analysis, and therefore the calculation of the sample size, for time-to-event data is also complicated by the fact that not all individuals will experience the event of interest. As a consequence, it is not appropriate to simply compare mean observation times directly between groups. There are three main approaches to the sample size calculation for this type of outcome:
compare Kaplan–Meier survival curves, using the logrank test or one of several other similar methodsassume a particular model form without specifying the survival distribution [e.g. the Cox (proportional hazards) regression approach]
use a mathematical model for the survival times and hence for the survival curve, such as the exponential or the Weibull distributions.
For ease of discussion, the term ‘survival’ is used to refer to the non-occurrence of the event by a specific time point and does not imply restriction of the methods to looking at mortality. The first two sample size methods are much more common than the third. For either a logrank- or Cox regression-based analysis, the analysis does not imply a specific distribution for the survival curve. The proportion surviving at any time point during the follow-up can be estimated to avoid having to assume one for the purpose of the sample size calculation. A target difference is inferred, explicitly or implicitly, for all of the methods. It is commonly expressed as a HR. 188 Similarly, to a binary outcome, adjusted and unadjusted analyses can estimate different estimands. 189
The difference between the two groups can be expressed as a difference between the survival probabilities at a specified time point. The data can be analysed accordingly, using the Greenwood standard errors to compare survival proportions. 190 However, this is statistically not a good way to compare groups, as it depends on the chosen time point and does not use the data on survival beyond that point. A method that takes all of the observed survival times into account, such as the log-rank test, is more convenient and statistically efficient. This is a test of statistical significance that has no explicit associated estimate of the treatment effect. Despite this, a power calculation can be performed by characterising the two survival curves by their median survival time, the time when half of the population in the group is estimated to have experienced an event.
To infer information about the survival curve from the median survival time, it must be assumed that the survival curve follows a known mathematical pattern, even though this assumption may not be used in the analysis. For example, the survival curve can be (and commonly is) assumed to be an exponential decay curve. The survival proportion (πA) for treatment A at some time t can then be used to estimate the median survival time m, as follows:
m = t ( log e ( 1 / 2 ) log e ( π A ) ) .Instead of a difference between mean times and the SD of times that would have occurred if we were comparing the average survival time in which all participants had reached the event, there are two median survival times or, equivalently, the median survival time in one group and the difference between medians, which can be considered the target difference. This is an implicit treatment effect size, although no such estimate is produced by the log-rank test.
Alternatively, an assumption about the difference between the survival curves, the proportional hazards assumption, can be made. This is the assumption that the ratio of the risk of an event in one group over a given short time interval, to the risk of an event in the other group over the same time interval, is constant over the follow-up period. This ratio is the HR and is the parameter that we estimate in Cox proportional hazards regression. This HR can be considered to represent the target difference (albeit on a relative range). However, another parameter is still needed to characterise the survival curve, such as the median survival time in one group.
It is possible to characterise the target difference either as the difference between median survival times or the HR, or by comparing events as an absolute difference in the event rate at a specific time point. Whichever approach is taken, the median survival in the control group or some similar parameter is needed to fully and uniquely specify the target difference. The statistical power of the comparison will depend on the total number of events rather than the total number of participants. A large number of events will imply high power. Participants who do not experience an event contribute little to the power. The median survival time and the planned follow-up time enable the number of events that will occur to be estimated.
Things become more complex if participants are recruited over a time period and then all followed up to the same calendar date. This results in widely varying follow-up times for censored cases. To allow for this, the recruitment period needs to be accounted for in the sample size calculation. If each participant will be followed for the same length of time, such as 1 year, the calculation is as if all were recruited simultaneously.
Methods for estimating the sample size usually rely on the number of events that need to be observed. The additional assumption of an exponential survival curve is typically made. Under these circumstances, the hazard, the instantaneous risk of an event, is a constant over time. The proportional hazards assumption is thus automatically satisfied. The HR can then be calculated as:
lo g e ( π B ) lo g e ( π A ) = HR = m A / m B .Again, under the assumption of an exponential survival distribution for both interventions, we can estimate the required number of events, eA, in one group by: