About Course
Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
1 of 20
10/05/2009 15:55
Print Close Window
Helena C. Kraemer, Alan F. Schatzberg: Chapter 11. Statistics, Placebo Response, and Clinical Trial Design in
Psychopharmacology, in The American Psychiatric Publishing Textbook of Psychopharmacology, 4th Edition. Edited by
Alan F. Schatzberg, Charles B. Nemeroff. Copyright ©2009 American Psychiatric Publishing, Inc. DOI:
10.1176/appi.books.9781585623860.409248. Printed 5/10/2009 from www.psychiatryonline.com
Textbook of Psychopharmacology >
Chapter 11. Statistics, Placebo Response, and Clinical Trial Design in
Psychopharmacology
STATISTICS, PLACEBO RESPONSE, AND CLINICAL TRIAL DESIGN IN
PSYCHOPHARMACOLOGY: INTRODUCTION
Drug development is a highly complex process that involves multiple steps of preclinical and clinical
pharmacological refinement and testing. Preclinical studies include assessing drug bioavailability,
metabolism, and toxicity; effects on known biological targets (e.g., receptor binding); and
performance in various animal models of pathology. After sufficient data are obtained in animal
studies, drug testing in humans can begin. In the United States, human clinical trials are divided
into four phases. Phase I involves testing multiple doses of a drug for bioavailability,
pharmacokinetics, and side effects. Phase II studies are dose-finding studies in patients with a
given disorder. They can be open-label or double-blind trials. Phase III generally includes pivotal
double-blind trials for demonstrating efficacy and safety/tolerability. Phase IV trials, which take
place after a drug has received U.S. Food and Drug Administration (FDA) approval and is on the
market, are conducted to help clarify potential uses of the drug.
Generally, efficacy is established via a randomized, controlled trial (RCT) in which a test drug is
compared with a so-called placebo and/or an active compound. An RCT is an experiment designed
to establish the efficacy of a treatment by comparing the responses in two or more groups of
patients sampled from a relevant clinical population, with one group randomly assigned to receive
the treatment of interest and a second group randomly assigned to receive a control treatment, in
which all subjects are enrolled, treated, and uniformly followed over the same time period (Meinert
1986). In contrast, experiments that are performed under any of the following circumstances are
not RCTs:
- Using human tissues or animals
- For purposes other than establishing efficacy of a treatment
Comparing groups of patients who were selected rather than randomly assigned to receive a given
treatment (observational, or quasi-randomized, clinical trials)
- Assessing response to a treatment in absence of a control group (e.g., pre–post study designs)
Comparing a group of patients given a treatment versus another group given another treatment
observed at different times or places (e.g., historical control subjects)
Although the basic premise underlying the RCT has remained the same over the past half century,
RCTs today differ from RCTs of the 1950s. Advances have occurred not only in statistical
methodology but also in research methods; studies are now better designed to promote the
replicability of the results and thus to protect the validity of inferences drawn from RCTs and
applied to patient populations. The basic premise underlying the RCT defines the “causal effect of
treatment (T) on an individual patient” as a comparison between a patient’s response to T with
what the outcome in that same patient would have been if that patient had not been given T (Rubin
1974, 2004). The condition proposed to represent what would have happened if that patient had
not been given T is generally called the control or comparison group (hereafter referred to as C).
Unfortunately, there is no way to assess response to any treatment for an individual patient under
two conditions, T and C, simultaneously. If a treatment were given to the same patient at different
times, the condition of the patient might change between the first and second time (secular
trends), or the response to treatment the first time may influence the response to treatment thePrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
2 of 20
10/05/2009 15:55
second time (carryover effects). Consequently, how much of any difference seen between the two
responses is attributable to T per se and how much is attributable to other, extraneous influences
cannot be ascertained. Thus, the causal effect of T in an individual patient cannot be assessed.
However, what cannot be done with an individual patient can be done with a clinical population.
Patients in a representative sample from the clinical population of interest can be randomly
assigned to receive either T or C. Under optimal conditions in a rigorously controlled study, the
average responses in the two groups are estimates of what the average responses to receiving T
and to not receiving T would be within the entire population, and the comparison between them is
an estimate of the average causal effect of T on the population sampled.
In designing an optimal RCT the following issues need to be considered, all of which we address at
least briefly in this chapter:
Specific indications and populations to be studied
Drug formulation and doses to be used
Route and time of administration
Instruments to be used as outcome measures
What the comparison groups should be
Power analysis and statistical analysis tools
The question of what the comparison groups should be is especially important. The issue of whether
to compare the treatment with a placebo or an active control has become a contentious social and
ethical issue as well as a scientific issue, particularly regarding psychiatric patients with severe
disorders such as schizophrenia, and we focus strongly on that issue in this chapter.
INDICATIONS
An indication for a drug is really an administrative approval by a regulatory authority (in the United
States, the FDA) for a company to market and sell a product for a specific clinical need. To establish
efficacy in treating psychiatric illness, the FDA has generally required statistically significant results
in trials. Generally the requirement has been for more than one statistically significant study
outcome, although there have been exceptions to this rule. Historically, indications have revolved
around syndromes (e.g., treatment of major depression or schizophrenia). More recently, the FDA
has shown greater flexibility with approvals for drugs to treat symptom dimensions—e.g.,
agitation—that transcend underlying syndromes.
The population to be sampled is a matter of some importance because one cannot generalize the
result beyond the population the sample actually represents. Thus, for example, a study that
samples alcoholic patients but excludes those with comorbidities will generate a sample that
represents a small fraction of the population with that disorder (Humphreys and Weisner 2000;
Humphreys et al. 2005). Whether or not some advantage is found for T over C in such a
nonrepresentative subpopulation gives little insight as to whether that same advantage pertains in
the majority of people with the disorder.
The choice of population to be sampled is also crucial to differentiating efficacy from effectiveness.
Efficacy refers to the extent to which T produces a beneficial result under ideal conditions;
effectiveness refers to the extent to which T produces a beneficial result when used as it would be
in actual clinical practice (Hoagwood et al. 1995). Both issues are best addressed with RCTs. One
crucial difference between RCTs geared toward establishing efficacy and those geared toward
establishing effectiveness is the population that the RCT researchers choose to sample. RCTs
directed more toward establishing efficacy usually set very narrow inclusion/exclusion criteria,
focusing on the subpopulation of patients with the disorder who are most likely to comply and to
respond well. In contrast, RCTs directed more toward effectiveness try to use patients who are
representative of all those with the disorder who might benefit from that treatment for their
disorder. The primary reason for exclusion from effectiveness RCTs is usually that there is some
reason to believe that either T or C is contraindicated (e.g., pregnancy excludes women in many
drug studies) or the patient is unwilling to sign an informed consent form. Because the sample in an
efficacy study is chosen to be more compliant and responsive, and is typically more homogeneousPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
3 of 20
10/05/2009 15:55
than that in an effectiveness study, sample sizes are likely to be much smaller than those in an
effectiveness study, and effect sizes much larger.
The earliest studies of a particular drug are likely to be efficacy studies in order to establish that
the drug works at least under optimal conditions. For pivotal studies, hypotheses are stated a
priori. Generally, these take the form that a tested treatment will be superior to a comparator in
reducing a given set of symptoms over a specific duration of treatment. A statistical analysis plan is
developed to test the a priori hypothesis. The most influential studies on clinical practice, however,
are likely to be effectiveness studies, for they inform clinicians about which treatments might
benefit the patients they are likely to see.
Clinical development of psychiatric agents is such that approval for indications are usually for
adults (i.e., over the age of 18 years) and for both sexes. Specific additional approvals can be
sought for children and the elderly. Obviously, in disorders that occur primarily in childhood (e.g.,
attention-deficit/hyperactivity disorder), studies may begin with youths.
Establishing Diagnoses
Establishing an accurate diagnosis is generally done according to criteria from an established
classification system—e.g., the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition,
Text Revision (DSM-IV-TR; American Psychiatric Association 2000). Criteria can be checked off
using an unstructured clinical interview or by using a structured instrument—the Scheduled Clinical
Interview for DSM-IV (SCID). The latter is more time-consuming but can offer greater information
on diagnoses, particularly in cases of psychiatric comorbidity. Obviously, the use of the SCID
requires training and assessment for interrater reliability, as well as monitoring over the course of
the trial to prevent “slippage” of diagnostic standards.
Rating Instruments
Because in psychiatry we do not yet have reliable biological measures (e.g., a serum lipid panel) to
assess efficacy, we rely on objective rating instruments applied by an observer or on self-reporting.
Before being used in a study on drug efficacy, the instruments need to be assessed for both validity
and clinical relevancy, and reliability and consistency must be established in the population to be
sampled and in the hands of the RCT research staff. This is particularly true in multisite RCTs,
where special efforts are necessary to ensure that the primary outcome measure is the same across
the sites. Lack of reliability, lack of consistency between raters, or lack of consistency across sites
in a multisite RCT is likely to lead to a failed RCT, one unable to establish statistically significant
differences between T and C.
An RCT should have only one primary outcome being measured—or very few—and all the decisions
in study design, measurement, and analysis are directed toward generating valid and powerful
tests for that primary outcome. Other outcomes can be assessed in the study, as long as they do not
compromise assessment of the primary outcome and do not impose undue burden on patients and
research staff that would lead to dropouts and diminished reliability of measures. Outcome
assessments that amplify or elucidate the primary outcome results are generally listed as
secondary outcomes. Finally, it is necessary at baseline to collect enough information to well
characterize the sample, both sociodemographically and clinically, to check the adequacy of
randomization in producing two groups assigned to T and C that are comparable at baseline.
Such baseline data also are valuable in post hoc exploratory analyses to assess the possibility of
moderators of treatment response (identification of subpopulations that have different effect sizes)
(Kraemer 2008; Kraemer et al. 2006, 2008).
During treatment, there may be repeated assessments of the primary and secondary outcomes,
which can be used in analysis of results using methods such as hierarchical modeling that generally
deal much more effectively with dropouts and missing data; such methods usually increase the
power to detect T versus C effects without requiring an increase in the sample size. Events or
changes that happen during the treatment may also be used in post hoc analysis to identifyPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
4 of 20
10/05/2009 15:55
mediators (possible mechanisms) that indicate how and why T works better than C (or does not),
which can then lead to improved treatments for evaluation in future studies.
Such descriptive statistics or the results of post hoc analyses lead not to conclusions but rather to
hypotheses that would be tested in future RCTs, but they are essential to the advancement of
treatments.
In seeking approval for an indication, observer-rated instruments are generally preferred over
self-reports. However, some studies employ self-report measures to test secondary hypotheses.
Over time, a commonly used measure may be supplanted by another that is believed to more
accurately reflect specific symptoms to be studied. Some of these scales are quite different from
each other. For example, the Montgomery-Åsberg Depression Rating Scale (MADRS) has been used
in antidepressant trials as an alternative to the Hamilton Rating Scale for Depression (Ham-D)
because the former focuses less on anxiety symptoms and more on core depression. Other scales
may be expansions on existing ones, adding items to focus on specific symptoms in a particular
disorder. An example is the use of the Positive and Negative Syndrome Scale (PANSS) for
schizophrenia. The scale is an expansion of the classic Brief Psychiatric Rating Scale (BPRS).
Detailed descriptions of rating scales can be found in a source textbook edited by Rush et al.
(2008).
Application of a validated and reliable instrument still requires that raters be trained and that a
high degree of agreement (interrater reliability) be found among raters, both cross-sectionally and
over time. This generally requires formally training raters and then testing multiple raters on
patient vignettes to establish interrater reliability. In many cases, patient interviews might be
videotaped and randomly selected videotapes submitted to independent experts to assess the
reliability and consistency of ratings. Again, this is particularly important with multisite RCTs.
Symptom Severity Criteria
Historically, minimal levels of symptom severity on rating scales have been incorporated into study
inclusion criteria. These criteria are aimed at ensuring that patients have at least a minimum level
of symptom severity at baseline to allow for assessing relative effects of a therapy. Issues arise in
implementing these inclusion criteria, particularly a possible inflation of rating scale scores to
ensure inclusion. This can become an important issue in studies with relatively few patients. To
counter possible severity rating inflation, several tactics have been employed, including audio- or
videotaping of entry interviews with subsequent quality review; use of self-reporting or an
interview rating done outside the study for inclusion; not specifying minimal entry symptom
severity criteria to the investigator; and ensuring that there is funding to cover the costs of
screening failures.
Whenever entry criteria specify a threshold inclusion, there is the risk of regression to the mean in
both the T and C groups. Although regression to the mean (Campbell and Kenny 1999) in an RCT
does not bias the comparison of T versus C effects, the use of an entry criterion may focus attention
on the very subgroup in which T is least, rather than most, effective, which might lead to a failed
RCT. The alternative is to include all those in whom clinicians might use T rather than C
(effectiveness criteria) and then use post hoc moderator analyses to determine whether those
within a certain range of baseline severity are likely to have the greatest effect sizes.
Trial Duration, Drug Dosage, and Route of Administration
Protocol development requires specifying the duration of the trial a priori. The optimal duration
varies across disorders. For example, for acute treatment of major depression, commonly 6–8
weeks of active treatment are employed; for mania, 3 weeks’ duration is common. The shorter trial
length for acute mania is used for convenience to allow sufficient time to assess a novel therapy
without an unduly long duration of placebo treatment that would discourage enrollment.
Route of administration and specific dosage regimens are determined for each agent and vary
according to the agent’s physical properties (e.g., half-life, absorption) as well as the indication.Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
5 of 20
10/05/2009 15:55
(These are frequently determined in Phase II trials.) Drugs that can be prescribed once per day are
obviously more convenient to administer than are agents that require multiple daily dosing, so
manufacturers strive for once-daily dosing.
STATISTICAL ISSUES IN PLANNING
Sampling, Randomization, and Blinding
The basics strategies of the RCT methodology are simple: appropriate patient sampling, choice of
control group, randomization, and blinded assessment. However, each basic requirement can be
satisfied in a myriad of ways, only some of which are valid in a particular RCT. What is ideal in one
RCT may be either precluded or the worst choice in another. However, if all is done appropriately in
the particular context of the RCT, nonrandom differences in response between the T and C groups
reported as a statistically significant effect of T versus C are attributable to the treatment (thus the
effect of T) rather than to biases in sampling, group assignment, measurement, or analysis.
Given that randomization is done to generate two random samples from the same population, any
postrandomization removal of patients from the sample has the potential of biasing those samples
relative to each other, and relative to the parent population. Consequently, RCT methodology also
requires analysis of results by intention to treat. Every patient randomly assigned to a group (i.e.,
with the intention to treat) must be considered in the analysis comparing responses in the T and
the C groups. In absence of an intention-to-treat analysis, when there is any postrandomization
removal of patients from the randomized sample, the trial is no longer an RCT, because it is not
based on comparison with randomized treatment and control groups. It is in this context that
repeated measures of the primary outcome over the duration of treatment are particularly useful,
for partial response data can be used to impute values for missing data or postdropout data, thus
minimizing the associated loss of power.
To ensure that response is measured exactly the same in T and C groups, either objective outcome
measures should be used or the assessor should be unaware of the treatment group to which a
subject was assigned (i.e., “blinded”).
Statistical Significance and Clinical Significance (Effect Sizes)
It must be emphasized that a statistically significant effect of T, with P <0.05, merely means that
the data satisfy minimal requirements to show a nonrandom difference between T and C. If P <0.01
or P <0.001, etc., the data satisfy more than minimal requirements, but the conclusion remains only
that there is a nonrandom difference between T and C. Thus, a statistically significant effect of T
may or may not be large, important, or of any clinical significance.
To show that a statistically significant effect of T is of clinical significance as well, descriptive
statistics and an effect size should be reported (as required by the Consolidated Standards of
Reporting Trials [CONSORT] guidelines) (Altman et al. 2001; Rennie 1996). For example, how likely
is a patient given T to have a response that is clinically preferable to the response of a patient given
C (an effect size called area under the curve [AUC]) (Acion et al. 2006; Grissom 1994; Kraemer and
Kupfer 2006)? Or how many patients would have to be given T to have one more “success” than if
they had all been given C (another effect size, called number needed to treat) (Altman et al. 2001;
Grissom and Kim 2005; Kraemer and Kupfer 2006; Wen et al. 2005)?
By the time there is rationale and justification to propose an RCT to compare T versus C, it is very
unlikely that the null hypothesis of randomness is exactly true (Jones and Tukey 2000; Meehl
1967). In any case, with the best possible RCT, there is still a 5% chance of a false-positive result
even if the null hypothesis of randomness is exactly true. Thus, given a large enough sample size,
and/or given enough RCTs comparing T versus C, every T could eventually be declared statistically
significantly better than any C. Accordingly, in recent years the costs of exclusive emphasis on
statistical significance have been recognized. There is now growing emphasis on reporting effect
sizes that are clinically interpretable (Altman et al. 2001; Grissom and Kim 2005; Kraemer and
Kupfer 2006).Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
6 of 20
10/05/2009 15:55
CHOICE OF CONTROL GROUP
Among all the myriad sampling, design, and analysis questions that need to be addressed in
proposing an RCT, the most contentious continues to be: What is C?; that is, what is the appropriate
control group (or set of groups) in a particular RCT? More specifically, when is a placebo control
group the appropriate choice? If not a placebo control group, then what? These are the questions to
which we here turn our attention, questions that continue to be the focus of much disagreement
and argument, even among experts in RCT methodology or application, ranging from many who
propose that a placebo control group always be included in an RCT to some others who propose that
it never be included.
In what follows we consider a variety of options for a control group, beginning with a few options
that are, by strict definition, not RCTs. It is important to realize why these options are considered
unacceptable, for those are the same considerations that apply when considering other possible
choices for control groups, particularly that of a placebo control group. Then we discuss the choice
of the placebo control group in some detail, ending with consideration of the treatment-as-usual
(TAU) control group and the standard-of-treatment control group, the most common alternatives to
the placebo control group.
Each Patient as His or Her Own Control (Pre–Post Study Design)
The very simplest, but unacceptable, proposal is to compare patients’ conditions prior to and after
treatment with T, with no control group and no randomization (hence not an RCT). Any change in
condition beyond that consistent with random variation is taken to indicate a treatment effect.
Researchers unable to detect a statistically significant difference between T and C groups in an RCT
often report such pre–post comparisons as indicating an equally beneficial “effect” in the groups, a
misinterpretation of statistical significance.
With any treatment, even a completely inert placebo, it is very unusual, given a large enough
sample size, to not see a statistically significant change in patients’ conditions pre- and
posttreatment. Such a change is attributable not to the effect of that treatment but rather to
artifacts of various kinds that a control group is meant to “control” for, as discussed below:
The first artifact is statistical regression to the mean (Blomqvist 1986; Campbell and Kenny 1999; Davis
1976; Senn 1997; Stigler 1997). Patients are selected for participation because they have symptoms of
a disorder severe enough to require treatment (especially if there is an explicit cutoff point on severity
required for inclusion), and frequently the outcome measures evaluate severity of the same symptoms.
However, assessment of symptom severity at entry is never completely reliable. With error of
measurement due to day-to-day inconsistencies in patients’ conditions, coupled with assessor bias and
random error, patients with a false-positive assessment at entry are included in the study, but patients
in the population with a false-negative assessment at entry are excluded. In absence of any effect of
treatment on their clinical condition, those patients with a false-positive assessment at study entry are
likely to later have their symptoms be assessed as closer to their true level, creating the false
appearance of overall improvement.
Second are expectation effects. Both patients and assessors hope and expect to see improvement after
onset of treatment. As Yogi Berra is reported to have said: “If I hadn’t believed it, I wouldn’t have seen
it!” The power of suggestion is very strong, particularly in the process of recruitment into an RCT. The
informed consent process, as well as patients’ awareness of physicians’ sworn obligation to do no harm,
combined with the patients’ pain, needs, and hopes, tends to bolster expectation effects.
Third are possible secular trends both in the condition of the patients and in the measurements of
response. In some cases, patients with a disorder, receiving only the close and caring attention given in
both T and C groups, spontaneously improve. At the same time, how assessors view a patient’s response
can change over time. With blinding, these effects are the same for patients in the T and C groups.
The cumulative effect of these and other artifactual influences constitutes the placebo response,
which is experienced by those in both the T and C groups, whatever C might be, but which is
expressed in its pure form in a placebo control group. With randomization to the T and C groups andPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
7 of 20
10/05/2009 15:55
blinded assessment, the effect attributable to T is the differential response of the T and C groups,
not the responses to T or to C separately. Thus, pre–post comparisons within a single treatment
group may be of some clinical interest, and worth reporting, but are not adequate to establish the
efficacy or effectiveness of any treatment.
Historical Control Groups
The pre–post treatment comparison as an indicator of treatment effect does away with the RCT
requirements of a control group, of randomization, and of evaluator blinding. Another unacceptable
alternative acknowledges the necessity of a control group but tries to do away with randomization
and blinding: the historical control group.
Here, as in pre–post studies, all patients recruited into the study receive T. However, information is
obtained from clinical records and from past RCTs and other research studies on what happened to
patients who received C—a historical rather than a concurrent control group. The treatment effect is
then inferred from comparisons of the T group with the historical control group.
It has long been known that when one recruits patients from different sources (e.g., different sites,
different recruitment pools, or at different times), what is seen, even in response to the same
treatment in samples satisfying the same inclusion/exclusion criteria, is different. It is usually
hoped that the effects of T versus C within each source are much the same (generalizability) in
patients recruited from different sources, but the responses within the individual T and C groups
are expected to differ sharply in patients recruited from one source versus another. With a
historical control group, any difference seen may be due wholly or in part to site/time/context
effects rather than being due to the effect of T versus C, and the effect size that is estimated and
tested is not valid for establishing the efficacy or effectiveness of a treatment.
Withhold Treatment: Waiting List Control Groups
Yet another proposal, one consistent with RCT methodology but still questionable, is to randomly
assign patients either to the T group or to a waiting list control group. Those in the waiting list
control group are asked to refrain from seeking treatment for their disorder for the duration of the
RCT, at the end of which they are given T if they then wish to receive it.
There is an ethical problem here. Patients being recruited into an RCT suffer from some disorder for
which they seek help. It is difficult in many cases for clinical researchers to maintain clinical
equipoise—defined as “a state of genuine uncertainty on the part of the clinical investigator
regarding the comparative therapeutic merits of each arm in a trial” (Freedman 1987)—when the
choices are between using a treatment that has a rationale, and for which there is justification for a
belief that it might be helpful, and denying treatment for the duration of the study. Clinical
equipoise is not solely an issue related to ethics. Lack of clinical equipoise often has repercussions
for the scientific validity of the RCT, because when researchers are sure that T is better than the
selected C, it is difficult to ensure the scientific objectivity necessary to the design, conduct, and
analysis of the trial in order to produce a balanced and fair comparison of T versus C.
Clearly, if the disorder that the patients have is not serious, disabling, or painful; if it is unlikely to
worsen with a delay in treatment; and particularly if no treatments are available for the condition
outside the RCT that the patients would have had access to had they not been in the RCT, there is
no convincing ethical argument for withholding treatment. However, it is not the ethical
considerations but rather practical considerations that make the choice of withholding treatment
untenable.
When patients seeking help for a disorder are able to obtain treatment from clinicians in their
community, patients are less likely to volunteer for an RCT in which they take a chance of having
such help withheld. Thus, the sample that is recruited into an RCT with a waiting list control group
is likely to be nonrepresentative of the population with the disorder, often missing adequate
representation of those most seriously affected by the disorder and those most anxious to be
relieved of the effects of the disorder.Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
8 of 20
10/05/2009 15:55
Moreover, after randomization, patients in the control group are more likely to drop out to seek
outside help, introducing bias to the RCT and loss of power. Indeed, in many cases, clinicians
observing patients in the waiting list control group often relent and transfer patients from the
control group to treatment group prior to the end of the RCT, because of clinical ethical concerns.
Technically, this is a valid RCT proposal, because there is a C (in this case, temporary withholding of
treatment) and randomization to T and C groups. However, in waiting list–controlled RCTs, it is very
difficult to blind assessment of the outcomes. Patients are often very aware that they are receiving
no treatment for their disorder and such awareness can color all measures of response.
Consequently, the problems of sample bias, measurement bias, missing data, dropouts, loss of
power, and bias in resulting effect sizes militate against choosing a waiting list control group in
most cases.
Placebo Control Groups
A placebo is defined as an inert medication or procedure, one that is selected specifically because it
cannot affect the underlying process of the disorder: a saline injection, a sugar pill. When a patient
given placebo appears to respond, such a response is often mistakenly labeled a placebo effect.
However, since this response cannot be attributed to the placebo, the more appropriate label is
placebo response. As noted above, the placebo response is experienced both by those receiving T
and those receiving C, but it is experienced in its pure form only by those in a placebo control group
(see “Each Patient as His or Her Own Control [Pre–Post Study Design]”).
Because of the artifacts associated with pre–post study designs (regression to the mean,
expectation effects, secular trends in the clinical condition of the patient or in the measurement
process, etc.), it is very common for patients randomly assigned to placebo treatment to appear to
respond favorably. In fact, it is reasonable to expect that given a large enough sample size, one will
always find a statistically significant placebo response.
There has long been discussion of how one might 1) understand the placebo response and 2)
reduce the placebo response in RCTs to make it easier to detect the efficacy or effectiveness of
treatment versus placebo. That portion of placebo response related to artifacts, however, is already
well understood. They are controlled in well-designed RCTs to the extent that is possible, by use of
reliable measures and consistent application of measurement protocols over the course of the RCT.
Clinically, the most interesting aspect of placebo response is the expectation effect in both patients
and assessors of response. Why would we want to reduce such expectation effects? Such
expectations have a great deal to do with compliance and cooperation of patients during RCTs, as
well as with their potential response to treatment. It would certainly be worthwhile to understand
on whom one might expect strong expectation effects, how and why they exist, and exactly how
they influence clinical response to treatment. But the goal would be to consider how to increase
expectation effects to maximize the efficacy or effectiveness of any treatment, rather than to
reduce them.
Like the waiting list control group, the placebo control group amounts to a withholding of treatment
for the duration of the study. However, having a placebo control group repairs many, though not all,
of the technical problems associated with a waiting list control group. One clear advantage is that
use of a placebo control group facilitates blinding of outcome measurement. Thus, the problem of
measurement bias expected in waiting list–controlled RCTs can be obviated in a placebo-controlled
trial.
In the past, many patients did not actually understand what a placebo was, and when U.S. drug
studies are conducted in Third World countries, this problem is still of concern. With the growing
sophistication of medical consumers and medical advocates, however, and with the increasing
scrutiny of how the placebo group is described in informed consent forms, this is less of a problem
today. As is the case for use of waiting list control groups, patients with the greatest impairment
from their disorder—who are most anxious to get relief, most in need of effective treatment, and
most likely to cooperate and comply with treatment—are less likely to be willing to agree toPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
9 of 20
10/05/2009 15:55
randomization in a placebo-controlled trial. Moreover, their primary care physicians often have the
same reaction and are reluctant to recommend participation in an RCT for those most in need of
treatment. Thus, placebo-controlled trials run a strong risk of having a nonrepresentative sample
from the relevant clinical population. Accordingly, the CONSORT guidelines for reporting the results
of an RCT (Altman et al. 2001) ask that researchers provide a patient flowchart that includes how
many patients were eligible for the RCT and how many refused participation. This information sheds
light on the possible sample bias in any RCT but particularly in a placebo-controlled one.
Moreover, the problems that waiting list–controlled RCTs have with dropouts can be exacerbated
(Kemmler et al. 2005) by substituting a placebo control group. Now patients in both the active
treatment and placebo control groups are equally aware that they might be receiving a placebo.
Patients in either group who do not experience the relief from their disorder that they hope for are
likely to suspect that they are in the placebo control group, and thus are likely to drop out of the
RCT. Again, their physicians, observing no amelioration of symptoms, may add to the problem by
recommending dropping out and initiation of what they would consider effective treatment. Once
again, such concerns are reflected in the CONSORT guidelines (Altman et al. 2001) with the
requirement that researchers report the number of patients randomly assigned to each group (the
intention-to-treat samples) and the number that drop out in each group, with reasons for such
dropouts. This information is intended to help assess the possible biases that might result from
such dropouts.
The RCT requirement that all randomly assigned patients be considered in assessing the treatment
effect (an intention-to-treat analysis) in the face of any appreciable dropout is difficult to comply
with. Although imputation methods, both simple and complex, are often used, there is no
imputation procedure that completely corrects for the impact of dropouts or missing data on an RCT
that are directly related to the experiences of patients with the treatment that they were assigned
to receive. The fact that a placebo control often increases the risk for dropouts and missing data is
thus an important consideration.
Although missing data and dropouts remain major issues in considering the use of placebo control
groups, the major source of contention arises not from the theory of RCTs, or from the
implementation problems of RCTs, but rather from the ethics of proposing a placebo-controlled RCT
in the first place and from the clinical value of the results of a placebo-controlled RCT. To
understand these arguments, let us consider a few specific contexts in which placebo control
groups might be considered.
When Is Use of a Placebo Control Group Unarguable?
When Usual Treatment Is No Treatment
Little or no argument about the choice of placebo as the control group in an RCT occurs when the
usual treatment by clinicians in the community is no treatment at all. Then, using a placebo in the
control group mimics clinicians’ recommendations exactly in providing no treatment that can
influence the disorder.
In such cases, patients may be better off volunteering for participation in an RCT, where they have
at least a chance of being assigned to receive an effective treatment. Patients might actually
respond better to a placebo in an RCT than to absence of treatment in the community because of
one of the artifactual responses that the control group controls for. For example, the attention,
care, and scrutiny provided to patients in an RCT are usually far beyond those provided in usual
clinical practice, and that may benefit patients. For all these reasons, when usual care in the
community is no treatment, patients have strong inducement to volunteer to participate in a
placebo-controlled RCT.
Moreover, in this situation patients have little inducement to drop out of the RCT, for nothing is
available as an alternative in the community that can be better than what is offered in the placebo
control group, and what is offered in the T group may improve their condition. Thus, when usualPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
10 of 20
10/05/2009 15:55
treatment in the community is no treatment, placebo control groups are the only choice, and an
excellent choice at that.
In Seeking FDA Approval
Even when there are treatments thought to be effective in the community for a disorder (often
other drug companies’ products), drug companies generally prefer placebo control groups in studies
directed to the FDA for drug approval.
The FDA requires that the drug company provide “proof that a drug has a therapeutic effect in at
least some patients,” and “no requirements are imposed regarding the ‘representativeness'” of a
study sample vis-à-vis the population of patients from which it is drawn” (Leber and Davis 1998, p.
179).
Moreover, the FDA does not set any criteria for a clinically significant effect, although it
generally—as noted above in the “Indications” section—requires demonstration in more than one
RCT of a statistically significant effect. Thus, drug companies prefer to use a control group in which
response is minimized to that associated with unavoidable artifacts of participating in an RCT,
because having such a control group maximizes the chance of finding a statistically significant
effect. Given that the sample size necessary to demonstrate a statistically significant effect is
smallest when the effect size is largest, the easiest way to achieve drug approval is to use very
stringent inclusion/exclusion criteria that favor recruitment into the RCT of only those most likely
to respond well to the drug (an efficacy trial) and a placebo control group that minimizes the
benefit in the control group. Pharmaceutical companies are then doing exactly what they need to do
to satisfy FDA requirements to market their drugs.
What the FDA requires or accepts depends on how it interprets the mandate from the U.S.
Congress, under which it operates, and what drug companies do is determined by what the FDA
requires and accepts. For these reasons, issues related to the use of placebo control groups in RCTs
that are performed to establish the efficacy or effectiveness of treatments in clinical populations
might well be considered separately from issues related to FDA requirements for drug approval and
drug companies’ efforts to satisfy those requirements.
However, as described above, the results of studies funded by drug companies to gain FDA approval
of an agent may not generalize to the type of patients clinicians are treating. Even within the
limited populations actually studied in some such studies, the clinical benefits of the drug may well
be statistically significantly better, but not clinically significantly better, than the effects of
withholding treatment (i.e., as in placebo control groups), or the drug’s effects may be both
statistically and clinically worse than the effects of whatever treatments clinicians are already
using for their patients.
Such considerations relate closely to present-day concerns about conflict-of-interest issues when
academic researchers are associated with studies funded by drug companies and explain the
requirements of many research journals and professional organizations that such conflicts of
interests be reported when presenting results of such studies.
In Medical Experiments
Another situation is a medical experiment that is not a clinical trial. Basic researchers in
pharmacology, for example, might be interested in the pharmacokinetics of a particular drug in
human subjects, not in the efficacy or effectiveness of that drug in any patient population. To test
their hypotheses, a control group, randomization, and blinding may also be necessary. However, in
many cases, the individuals recruited into such a study are not patients with a disorder but simply
volunteers, and the outcomes are focused not on the clinical benefits or harm to patients but rather
on aspects of the drug, such as rates of drug absorption and elimination. In many such situations, a
placebo control group is the only scientifically valid choice to test hypotheses. Because the focus
here is not on establishing the efficacy or effectiveness of a treatment, this is a medical experiment
but is not that specific type of medical experiment called the clinical trial. Such medical Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
11 of 20
10/05/2009 15:55
experiments are fundamental to progress in basic science and translational research.
The issues about the use of human subjects in medical experiments, including that of using placebo
control subjects, are specific to the particular research question being addressed in a study and are
generally different from those in an RCT. These are issues that institutional review boards must
assess for each individual research project. Informed consent forms for such studies should not
describe the study as an RCT and should spell out clearly that the focus of the study is to further
biochemical or pharmacological knowledge about the drug, not (at least not in the near future) to
improve treatment for any patients with a disorder.
In Testing a Supplemental Treatment
In many cases, an RCT is performed to test whether a new treatment that is used as a supplement
to an established treatment is more effective or efficacious than the standard treatment alone. This
is a case in which there are few, if any, reservations to the use of a placebo to control the
artifactual effects of offering two treatments rather than one. All patients would receive the
standard treatment and then be randomly assigned to supplementation with either the new
treatment or a placebo. If researchers find a statistically and clinically significant difference
between the two groups, it would represent not an effect of the new treatment but rather an
additive or interactive effect of the old and new treatments together. Here few people would have
ethical or scientific concerns about the use of a placebo control group.
Placebo Control Groups: Arguments Pro and Con
As we have seen, the arguments against the use of placebo control groups are seldom absolute.
Few would argue against using placebo control groups when the usual treatment is no treatment.
Few would argue against using placebo control groups when a new treatment is tested as a
supplement to an old treatment. The general argument against the use of placebo control groups
can be summarized as: There are often better alternatives for a control group than placebo control
subjects, alternatives that are better clinically and ethically, and often the scientific quality and
clinical impact of RCTs using alternative control groups are preferable.
On the other hand, Leber (2000, p. 699), for example, has argued that “in the evaluation of drugs
intended for the management of psychiatric illness, placebo control groups are indispensable, a fact
that has been recognized for almost half a century by literally generations of investigators working
in the fields of clinical pharmacology and psychopharmacology.” Yet A. Bradford Hill (1963), often
recognized as the father of RCTs, stated the following: “Is it ethical to use a placebo? The answer to
this question will depend, I suggest, upon whether there is already available an orthodox treatment
of proved or accepted use.” The arguments about placebo control groups among those most
knowledgeable about RCTs have been around as long as RCTs have been around (Rothman and
Michels 1994).
Arguments for the necessity of placebo control groups are often made on philosophical grounds
(Leber 2000) or in terms of convenience to researchers because placebo-controlled RCTs take less
time and effort and require less funding (with their smaller sample sizes). However, RCTs are
clinical trials proposed to address clinical issues, not philosophical or logistical ones. To benefit
clinical decision making, hence medical consumers and clinicians, the crucial question is this: Is T
clinically preferable to C in the population studied? If physicians in the community are using a
range of treatments that they believe to be effective, there is little value in demonstrating that a
new treatment is better than withholding treatment altogether (as in a placebo control group).
Others argue, not for the necessity of placebo control groups but for their acceptability, that
placebos do no lasting or irrevocable harm. For example, numerous studies have demonstrated that
the suicide risk among psychiatric patients receiving a placebo is not such as to preclude the ethical
use of placebos in that context (Khan et al. 2000, 2001, 2006; Storosum et al. 2005). However, the
Hippocratic oath’s requirement to “do no harm” is not restricted to lasting or irrevocable harm. It is
not the prerogative of the medical researcher to stipulate how much potential harm to a patient is
acceptable. Rather, it is the duty of the medical researcher to assure patients that, to the best ofPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
12 of 20
10/05/2009 15:55
current medical knowledge, the harm that may occur if they participate in the RCT would not be
greater than if they chose not to participate in the RCT, and may well be less.
Alternative Control Groups
Treatment-as-Usual Control Groups
For a variety of reasons, from the perspective of a patient or a clinician, the most attractive choice
for a comparison group is one that uses what clinicians are currently using: the TAU
(treatment-as-usual) control group. Here patients are recruited into the study and randomly
assigned either to T, the new treatment under study, or to the treatment they would have received
if they had not entered the RCT. The responses in both groups are monitored using the same
protocol, with evaluators blinded to group membership. The research question is whether it can be
shown, beyond reasonable doubt, that T produces better results than what community clinicians
generally achieve.
Because patients who enter such an RCT receive at least the treatment that they would have
received had they not entered the RCT (TAU), or perhaps receive a treatment for which there is a
rationale and justification for a belief that it might be better (T), such a control group is an
inducement to participation. Moreover, since dropping out of the study generally means getting
what is now the control treatment, there is little inducement to drop out. Clearly, patients and
treatment providers cannot be blinded to the treatment choice, although assessors of treatment
response can be.
Objections to this choice of control group generally come from researchers, who correctly point out
that community clinicians do not all use the same treatment, and many use treatments that are
known to be far from optimal (or use optimal treatment in a less-than-optimal fashion). Far greater
variability in response is likely in a TAU control group than there would be if one particular
treatment were specified for a control group, or if placebo were used, and greater variability than in
the T group. Consequently, the sample size has to be larger to achieve adequate power than with a
specific choice of comparator treatment or with a placebo as the control intervention.
Clearly, with TAU as the control condition, greater detail is needed in reporting results. It is
important to report which treatments were being used by community clinicians and to explore
whether some of those treatments were likely to be better or worse than T (hypotheses to perhaps
be tested in future RCTs). However, all in all, TAU is a viable choice for a control condition in an RCT
to document whether T is a better choice in general than what is already available in the
community.
Standard-of-Treatment Control Groups
Another viable option is that of selecting and delivering what, on the basis of past, independently
replicated RCTs, appears to be the most effective of the treatments in use in the community. This
allays many of the problems with TAU control groups, because now the one control treatment can
be optimally delivered by research staff, decreasing the heterogeneity within the control group.
Also, blinding can now be more complete and effective. Moreover, patients can be assured that
current medical knowledge indicates that the treatment in the control group is as good as or better
than what they would receive if they were not in the study, which again is an inducement both to
participate in the RCT and to not drop out. If there is persuasive evidence that the new treatment T
is indeed superior (or at least equivalent) to others currently available, significant findings in such
a study might well convince clinicians to use T rather than whatever treatment they had been
using.
EQUIVALENCE AND NONINFERIORITY RCTS
The RCT methodology was originally established to establish the efficacy or effectiveness of a
treatment, to establish whether T is better than C. In recent years attention has also been paid to
issues of clinical significance, as well as to noninferiority and equivalence. Misconceptions Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
13 of 20
10/05/2009 15:55
concerning statistical significance, clinical significance, equivalence, and noninferiority have
unnecessarily confused the issue of the placebo control.
The basic requirements of an RCT still apply to equivalence and noninferiority trials: the necessity
of an appropriate control group, randomization to T versus C groups, blinded assessment of
response, and analysis by intention to treat. In the study population, there is an unknown effect
size, delta ( ), that is zero when there is absolutely no differential response between T and C, is
positive when T tends to be better than C, and is negative when C tends to be better than T. One
way of stating the purpose of an RCT is that it is meant to estimate that unknown effect size
(Borenstein 1994, 1997, 1998).
The usual two-tailed null hypothesis significance test seeks to prove beyond reasonable doubt that
is not zero. In designing such a study, a value of , the threshold of clinical significance, is
designated—say, *. The test is then structured so that when = 0, the probability of a significant
result is less than, say, 5% (the significance level). When the magnitude of is greater than *,
the probability of a significant result is greater than, say, 80% (adequate power). Given the fact
that when there is sufficient rationale and justification for proposing an RCT there is almost no
realistic chance that the effect size is exactly zero, achieving statistical significance in an RCT is
generally a matter of having a large enough sample size in a well-designed study with reliable
outcome measures. To indicate the possible clinical significance of a statistically significant finding,
the effect size and its confidence interval should be reported (as per CONSORT guidelines).
Another way of saying the same thing: with the study design described above, there is a better than
80% probability that a two-tailed 95% confidence interval for the unknown true effect size will
not include = 0, whenever the true effect size is greater than the critical value *.
In contrast, to show that T is clinically superior to C, one needs to show that the entire confidence
interval for that effect size is greater than *. To show that T and C are clinically equivalent, one
needs to show that the confidence interval lies completely between – * and *. To show the
noninferiority of T to C, one needs to show that the entire confidence interval is less than – *.
One can always demonstrate either noninferiority or equivalence, simply by using unreliable
outcome measures, allowing deviations from treatment and measurement protocols, etc. That is, a
badly conducted trial will result in an attenuated effect size between any two drugs, an effect size
closer to zero, that can almost always be labeled as a noninferior or equivalent result.
This issue is highly relevant to the valid interpretation of study results because of a common
confusion between a result being non–statistically significant and two drugs showing equivalence.
To report a non–statistically significant result is only to admit that the sample size was not large
enough, the design not powerful enough, or the measures not reliable enough to demonstrate
beyond reasonable doubt that ≠ 0. That is nowhere near the same thing as reporting a
demonstration beyond reasonable doubt that – *< < * (i.e., equivalence). As the old saying goes:
“Absence of proof is not proof of absence.”
In particular, if one randomly assigned subjects in an RCT to one of two treatment groups (T1 or
T2) or to a placebo control group (C) and found no statistically significant difference between T1
and T2 but found statistically significant differences both between T1 and C and between T2 and C,
that tells nothing about the possible clinical equivalence of T1 and T2. All one would know is that
both T1 and T2 were shown to be better than (not necessarily even clinically superior to) placebo;
the sample was not large enough to detect whatever difference there might be between T1 and T2.
Any conclusion comparing T1 and T2 would be exactly the same whether or not the placebo control
group were included in the design. Yet many arguments for the use of a placebo control group
inappropriately reflect an effort to use the placebo control as a decoy to interpret results comparing
T1 and T2.
But finally, why would it be important to establish beyond reasonable doubt the clinical equivalence
of two treatments, particularly when such a result can be obtained through poor study design (e.g.,
choice of measurement) and execution of the RCT? Often when this question is asked, whether of aPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
14 of 20
10/05/2009 15:55
drug company representative or of an academic researcher, a complete answer often contains an
implicit reinterpretation of equivalence as superiority. For example: Two treatments might have
equivalent effects in reducing symptoms, but one might have a better side-effect profile. Or the two
treatments might have equivalent effects in terms of both symptom reduction and side effects, but
one may be far less costly or have greater ease of use than the other. In all such cases, the goal of
the RCT should be to establish the clinical preference for one treatment or the other over the
control condition by using an outcome measure sensitive to the specific ways one drug might be
clinically preferable to another. But that would then be not an equivalence study but rather the
usual type of RCT, with a primary outcome reflecting the particular way in which T is hypothesized
to be superior to C.
STATISTICAL ANALYSIS AND POWER
Threaded through all the above issues are implicit considerations of statistical analysis and power.
Every single decision about research design has some impact on the appropriate choice of analysis
and on the power to detect clinically significant effects. In fact, the major difference between
well-designed and well-executed present-day RCTs and those done 50 years ago stems from
advances in methods of statistical analysis of results and better understanding of the concept and
application of power in designing RCTs.
With the simplest possible design, randomly assigning a representative sample of patients to a
treatment (T) or control (C) condition, with a binary primary outcome—”success” versus
“failure”—the analytic method would be a 2 x 2 (treatment x outcome) 2 test. This is the least
powerful design (Cohen 1983; Kraemer 1991; Kraemer and Thiemann 1987, 1989; MacCallum et al.
2002), and the study thus requires perhaps twice, perhaps 10 times, as many patients for adequate
power compared with studies with other designs. A valid choice? Yes. A wise choice? No.
Suppose we merely substituted that binary primary outcome with a dimensional one—for example,
symptom level at the end of treatment—and proposed to use the most common RCT analysis
method: the two-sample t test. Immediately, there would be an increase of power (thus requiring a
smaller sample size for adequate power). If the goal is to detect a moderate effect (number needed
to treat [NNT] ≊≊≊≊4), for a 5% two-tailed test one would require 63 patients per group or a total of
126 patients. Detecting a small effect (NNT ≊≊≊≊9) would require 389 patients per group, and
detecting a large effect (NNT ≊≊≊≊2) would require only 26 patients per group.
The two-sample t test is valid when the outcomes being measured are approximately normally
distributed with equal variances in the two groups, but many clinically meaningful outcomes have
asymmetrical distributions, have long tails, or occur with unequal variances in the two groups. Then
one might use instead the nonparametric Wilcoxon rank sum test (Mann-Whitney test). When the t
test gives valid results, the Mann-Whitney test is also valid and has quite similar power. However,
the nonparametric test is valid in many circumstances when the t test is not. This illustrates two
general principles: the choice of the outcome measure must be in accordance with the choice of the
analytic procedure, and better selection of the outcome measure has a major impact on the study’s
power and thus the necessary sample size.
If one uses any of the above for an RCT of depressed patients, most of the sample would be female;
for an RCT of schizophrenic patients, most of the sample would be male. In many cases, it is
proposed to stratify such sample populations in order to equalize the representation of males and
females. Is this wise?
If it is decided that stratification is warranted, to be valid the analytic procedure must acknowledge
that stratification. For a binary outcome, one might use a logistic regression analysis and for a
dimensional outcome, a linear regression analysis, with treatment group, stratum (here, gender),
and their interaction as independent variables in each case.
One common analytic error is to assume that the interaction does not exist and to use analysis of
covariance with gender as the covariate. If that interaction does exist in the study population and is
ignored in the analysis, it often compromises the significance level, and thus the validity of the test,Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
15 of 20
10/05/2009 15:55
and almost inevitably reduces the power. But if the interaction is included in the analysis, care must
be taken to properly center all the independent variables to produce clinically interpretable results
(Kraemer and Blasey 2004).
If this is done correctly, the interaction test assesses whether the treatment effect in women is
different from the effect in men, and the main effect of treatment assesses whether the average
treatment effect across men and women is nonrandom. When there is an interaction effect in the
study population, the main effect of treatment in this analysis is not the same as the effect of
treatment assessed in an unstratified sample. The crucial issue then is which treatment effect is of
interest—the effect in the total population or the average effect across the subpopulations defined
by the strata.
If the decision is that the sample should be stratified, the sample size needed for adequate power is
likely to increase, and the logistical difficulty of accumulating a stratified sample is likely to be
much greater. If, for example, 80% of those with the disorder of interest are women, but it is
decided that 50% of the sample in the RCT should be women, one will have to work much harder to
recruit that oversampling of men into the study. Thus, careful thought should be given to whether
the rationale and justification for stratification are strong enough to necessitate larger samples,
more complex analyses, and a shift in the hypothesis being tested.
The difficulty of such decisions is exacerbated when researchers (or reviewers) seek to control for
multiple covariates (e.g., gender, age, ethnicity, initial severity of the disorder). To truly control the
study for the effects of such variables, one stratifies the sample. However, with gender (two
possibilities), age (say, five age groups), ethnicity (say, five ethnic groups), and initial severity
(say, three levels), one has 2 x 5 x 5 x 3 = 150 strata, and one would have to recruit adequate
numbers into each stratum (for optimal power, an equal number into each stratum). If even a
minimal number of patients per stratum were specified, say, 10 per stratum (5 randomized to T and
5 to C), the minimal sample size would be 1,500!
If these stratification variables are not very strongly associated with treatment effect, the result is
a study with less power than would be achieved with a simple design. If there are collinearities
among these variables (say, women and older patients tend to have high higher initial disease
severity), the power to detect treatment effects might also be reduced. One of the least wise
decisions in RCT design is to try to control for the effects of too many variables, and many
experienced biostatisticians argue against any stratification of the sample unless the primary
hypotheses concern moderators of treatment outcome.
Researchers and review committees, however, often propose another tactic: Instead of controlling
for the effects of these baseline variables through stratification, adjust for them in a mathematical
model. Now the sample would continue to be 80% women, but the analysis would include
consideration of both treatment and gender. What then often happens in analysis is exclusion of all
interactions. Without adjustment, the two-sample t test has N – 2 degrees of freedom (the larger
the number of degrees of freedom, other things being equal, the greater the power), but with a
single covariate, that becomes N – 4 (N – 22 ), and with four covariates, it becomes N – 32 (N – 25 ).
As noted above, if such interactions exist in the study population and are excluded in the model, the
significance level may be compromised and power is almost inevitably lost. Thus, if covariates are
to be included, their interactions must be as well. Unless inclusion of those variables has a major
strengthening effect on effect size, this inevitably means a loss of power. Finally and perhaps most
important, collinearity effects resulting from associations between the variables included cost even
more power. Again, most experienced biostatisticians argue against adjusting for the effects of
baseline variables in the absence of a strong rationale and empirical justification for doing so.
On the other hand, in a multisite RCT, stratification by site is built into the design and must be
included in the analysis, and even then, many researchers and reviewers choose to ignore it.
Multisite RCTs often show that site differences are a major source of variance in the outcome
measurements (MTA Cooperative Group 1999). The most convincing demonstration of the
almost-ubiquitous nature of site differences is not from an RCT but from a study of inbred strains ofPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
16 of 20
10/05/2009 15:55
mice in a genetics study (Crabbe et al. 1999) under controlled laboratory conditions. Even then, site
differences occurred. In an RCT, if samples are drawn from different sites, or in different time
spans, or at the same site at the same time but using different recruitment strategies (e.g.,
referrals from doctors versus responses to advertisement), one should always expect that these
differences will affect the primary outcome. Thus, randomization must be done within each such
stratum, and comparison of T versus C must be a pooled comparison of the within-stratum
comparisons of T versus C (Kraemer and Robinson 2005).
Thus far we have focused on assessing a single primary outcome at the end of treatment, whether a
binary success/failure or a dimensional outcome, and have recommended against using a binary
outcome. But, some would argue, some outcomes are by their nature binary: either the patient dies
or not, the patient recovers or not, the disease remits or not. Are we not then obliged to use a
binary outcome, and “take the hit” by increasing the sample size manyfold?
Outcomes such as these occur over the course of time, and at different times for different patients.
By simply reorienting the analysis to examination of the time to the event, one moves from a binary
outcome to a dimensional one. This is the situation in which survival analyses become the analytic
procedure of choice: Kaplan-Meier estimation-of-survival curves within each group (Kaplan and
Meier 1958), comparison of these survival curves in the T versus C groups, and use of the Cox
proportional model (Andersen et al. 1985), for example, when there are strata or covariates to be
considered. Although the sample sizes for adequate power will be somewhat greater than with
other dimensional outcomes (because some patients will be censored, i.e., they will not have had
the outcome occur before the end of study), the sample size here will be much smaller than when
using a binary outcome, and more useful clinical information will be obtained.
Also, with dimensional outcome measurement, modern analytic tools can lead to increased power
without increasing sample size. For example, instead of assessing the outcome using only the
endpoint of treatment, one could assess that outcome measure at baseline and at fixed times
during the treatment period. Random regression models (also known as hierarchical models, or
growth curves; Berger 1986; deLeeuw and Kreft 1986; Ware 1985) basically model the trajectory of
response within each patient and then test whether the trajectories of response in the T group are
clinically preferable to those in the C group. Because multiple measures per patient are used to
characterize each patient’s response, reliability is increased, and thus power is increased.
Moreover, in the case of missing data or dropouts, partial data on the trajectory per patient enable
stronger imputation methods to facilitate intention-to-treat analyses. Quite aside from the multiple
statistical advantages of designing studies with repeated measures of outcomes over time, such
information is often clinically informative in guiding clinicians to recognize early those patients who
are unlikely to ever respond to a given treatment.
This discussion barely scratches the surface of analytic methods available, but illustrates two
general principles:
For adequate power and to best inform clinical decision making, characterize the response of each
individual patient as precisely and concisely as possible (using reliable measures, preferably
dimensional, with repeated measures over time). That might sometimes complicate the analysis, but
analytic methods are generally available to take advantage of such precision.
Design the study to answer the primary research question, not to answer all possible questions that
might arise. Leave those to secondary or exploratory post hoc analyses. Do not stratify the study
population unless the design requires multiple sites or recruitment sources or the primary research
question is about the strata. Do not try to control or adjust for all possible influences on treatment
effect; instead, focus on controlling those factors empirically shown to strongly influence treatment
effect.
SUMMARY
The sine qua non of an RCT is that there be a control or comparison group, with an appropriate
sample of patients randomly assigned to the treatment and control groups, with blindedPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
17 of 20
10/05/2009 15:55
assessment of outcome, and with analysis by intention to treat. A reasonable analogy of conducting
an RCT is that it is like juggling many balls, trying to keep them all in the air at the same time:
choice of control group, randomization, blinding, sampling, treatment protocol, measurement
protocol, and fidelity to the protocols during the study, analysis, and interpretation of results, etc.
The more research questions that are to be addressed in a particular RCT, and the more complex
each research question is, the more balls are being juggled and the more slippery they are.
Moreover, as soon as one ball drops, the others are likely also to follow. Any mistake in sampling,
for example, is likely to have repercussions in analysis and interpretation of results. Poor
measurement (e.g., use of a binary primary outcome) will have an effect on design (sample size).
Improved design (e.g., repeated measures of the primary outcome) will have an effect on analysis
and interpretation of results. For this reason, the best studies result when there is a focus on the
primary research question, when all research decisions are made to protect the integrity of and
amplify the answers to that primary research question. When the effort is made to answer as many
questions as possible in one study, e.g., addressing multiple outcomes, or controlling or adjusting
for multiple variables, RCTs tend not to answer any research questions well at all.
The choice of an appropriate control group is context specific but should take into consideration
ethical and clinical as well as scientific issues. On the issue of using placebo control groups, we
distinguished RCTs to establish the efficacy or effectiveness of a drug from studies done for the
purpose of gaining FDA approval of a drug versus randomized medical experiments to explicate
basic science questions, which are often performed with subjects who are not patients and for
purposes that have little or nothing to do with efficacy or effectiveness of treatments.
We have discussed in detail when it is best and least controversial to use a placebo control group
(i.e., when there is no better alternative to placebo) and when it may be more appropriate to use a
TAU control group or a standard-of-care control group, particularly in cases where withholding
treatment, as would be done with placebo control subjects, raises ethical questions as well as
logistical questions.
Using placebo control groups as a foil to understand differences not seen between two active
treatments is often based on a misinterpretation of statistical significance. Moreover, use of
placebo control groups as an aid in establishing equivalence or noninferiority is questionable,
because what may appear to be equivalence is often related to poor study execution rather than to
actual equivalence between treatments. Also, it is not clear why equivalence is important to clinical
decision making.
In short, in some circumstances using a placebo control group is the only choice, and in others
using a placebo control group is the best choice. There are also circumstances in which ethical,
clinical, and scientific interests are best served by using other types of control groups. Finally, there
have been circumstances in which inclusion of placebo control groups has misled thinking about the
effects of other drugs.
REFERENCES
Acion L, Peterson JJ, Temple S, et al: Probabilistic index: an intuitive non-parametric approach to measuring
the size of treatment effects. Stat Med 25:591–602, 2006 [PubMed]
Altman DG, Schulz KF, Hoher D, et al: The revised CONSORT statement for reporting randomized trials:
explanation and elaboration. CONSORT Group (Consolidated Standards of Reporting Trials). Ann Intern Med
134:663–694, 2001 [PubMed]
American Psychiatric Association: Diagnostic and Statistical Manual of Mental Disorders, 4th Edition, Text
Revision. Washington, DC, American Psychiatric Association, 2000
Andersen PK, Borch-Johnsen K, Deckert T, et al: A Cox regression model for the relative mortality and its
application to diabetes mellitus survival data. Biometrics 41:921–932, 1985 [PubMed]
Berger MPF: A comparison of efficiencies of longitudinal, mixed longitudinal, and cross-sectional designs.
Journal of Educational Statistics 11:171–181, 1986Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
18 of 20
10/05/2009 15:55
Blomqvist N: On the bias caused by regression toward the mean in studying the relation between change and
the initial value. J Clin Periodontol 13:34–37, 1986
Borenstein M: The case for confidence intervals in controlled clinical trials. Control Clin Trials 15:411–428,
1994 [PubMed]
Borenstein M: Hypothesis testing and effect size estimation in clinical trials. Ann Allergy Asthma Immunol
78:5–16, 1997 [PubMed]
Borenstein M: The shift from significance testing to effect size estimation, in Comprehensive Clinical Psychology
(Bellak AS, Hersen M, editors-in-chief), Vol 3: Research Methods (Schooler NR, volume editor). Burlington, MA,
Elsevier Science, 1998, pp 319–349
Campbell DT, Kenny DA: A Primer on Regression Artifacts. New York, Guilford, 1999
Cohen J: The cost of dichotomization. Applied Psychological Measurement 7:249–253, 1983
Crabbe JC, Wahlsten D, Dudek BC: Genetics of mouse behavior: interactions with laboratory environment.
Science 284:1670–1672, 1999 [PubMed]
Davis CE: The effect of regression to the mean in epidemiologic and clinical studies. Am J Epidemiol
104:493–498, 1976 [PubMed]
deLeeuw J, Kreft I: Random coefficient models for multilevel analysis. Journal of Educational Statistics
11:57–85, 1986
Freedman B: Equipoise and the ethics of clinical research. N Engl J Med 317:141–145, 1987 [PubMed]
Grissom RJ: Probability of the superior outcome of one treatment over another. J Appl Psychol 79:314–316,
1994
Grissom RJ, Kim JJ: Effect Sizes for Research: A Broad Practical Approach. Mahwah, NJ, Lawrence Erlbaum,
2005
Hoagwood K, Hibbs E, Brent D, et al: Introduction to the special section: efficacy and effectiveness in studies
of child and adolescent psychotherapy. J Consult Clin Psychol 63:683–687, 1995 [PubMed]
Hill AB: Medical ethics and controlled trials. BMJ 1(5337):1043–1049, 1963 [PubMed]
Humphreys K, Weisner C: Use of exclusion criteria in selecting research patients and its effect on the
generalizability of alcohol treatment outcome studies. Am J Psychiatry 157:588–594, 2000 [Full Text]
[PubMed]
Humphreys K, Weingardt KR, Horst D, et al: Prevalence and predictors of research participant eligibility criteria
in alcohol treatment outcome studies, 1970–98. Addiction 100:1249–1257, 2005 [PubMed]
Jones LV, Tukey JW: A sensible formulation of the significance test. Psychol Methods 5:411–414, 2000
[PubMed]
Kaplan EL, Meier P: Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481,
562–563, 1958
Kemmler G, Hummer M, Widschwendter C, et al: Dropout rates in placebo-controlled and active-control clinical
trials of antipsychotic drugs. Arch Gen Psychiatry 62:1305–1312, 2005 [PubMed]
Khan A, Warner HA, Brown WA: Symptom reduction and suicide risk in patients treated with placebo in
antidepressant clinical trials: an analysis of the Food and Drug Administration database. Arch Gen Psychiatry
57:311–317, 2000 [PubMed]
Khan A, Khan SR, Leventhal RM, et al: Symptom reduction and suicide risk among patients treated with
placebo in antipsychotic clinical trials: an analysis of the Food and Drug Administration database. Am J
Psychiatry 158:1449–1454, 2001 [Full Text] [PubMed]
Khan A, Kolts RL, Brodhead AE, et al: Suicide risk analysis among patients assigned to psychotropics and
placebo. Psychopharmacol Bull 39:6–14, 2006 [PubMed]Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
19 of 20
10/05/2009 15:55
Kraemer HC: To increase power in randomized clinical trials without increasing sample size. Psychopharmacol
Bull 27:217–224, 1991 [PubMed]
Kraemer HC: Toward non-parametric and clinically meaningful moderators and mediators. Stat Med
27:1679–1692, 2008 [PubMed]
Kraemer HC, Blasey C: Centring in regression analysis: a strategy to prevent errors in statistical inference. Int
J Methods Psychiatr Res 13:141–151, 2004 [PubMed]
Kraemer HC, Kupfer DJ: Size of treatment effects and their importance to clinical research and practice. Biol
Psychiatry 59:990–996, 2006 [PubMed]
Kraemer HC, Robinson TN: Are certain multicenter randomized clinical trial structures misleading clinical and
policy decisions? Contemp Clin Trials 26:518–529, 2005 [PubMed]
Kraemer HC, Thiemann S: How Many Patients? Statistical Power Analysis in Research. Newbury Park, CA, Sage,
1987
Kraemer HC, Thiemann SA: A strategy to use “soft” data effectively in randomized clinical trials. J Consult Clin
Psychol 57:148–154, 1989 [PubMed]
Kraemer HC, Frank E, Kupfer DJ: Moderators of treatment outcomes: clinical, research, and policy importance.
JAMA 296:1286–1289, 2006 [PubMed]
Kraemer HC, Kiernan M, Essex MJ, et al: How and why criteria defining moderators and mediators differ
between the Baron and Kenny and MacArthur approaches. Health Psychol 27:S101–S108, 2008
Leber P: The use of placebo control groups in the assessment of psychiatric drugs: an historical context. Biol
Psychiatry 47:699–706, 2000 [PubMed]
Leber PD, Davis CS: Threats to the validity of clinical trials employing enrichment strategies for sample
selection. Control Clin Trials 19:178–187, 1998 [PubMed]
MacCallum RC, Zhang S, Preacher KJ, et al: On the practice of dichotomization of quantitative variables.
Psychol Methods 7:19–40, 2002 [PubMed]
Meehl PE: Theory testing in psychology and physics: a methodological paradox. Philos Sci 34:103–115, 1967
Meinert CL: Clinical Trials: Design, Conduct, and Analysis. New York, Oxford University Press, 1986
MTA Cooperative Group: A 14-month randomized clinical trial of treatment strategies for
attention-deficit/hyperactivity disorder. Arch Gen Psychiatry 56:1073–1086, 1999
Rennie D: How to report randomized controlled trials: the CONSORT Statement. JAMA 276:649, 1996
[PubMed]
Rothman KJ, Michels KB: The continuing unethical use of placebo controls. N Engl J Med 331:394–398, 1994
[PubMed]
Rubin DB: Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol
66:688–701, 1974
Rubin DB: Teaching statistical inference for causal effects in experiments and observational studies. Journal of
Educational and Behavioral Statistics 29:343–367, 2004
Rush AJ Jr, First MB, Blacker D (eds): Handbook of Psychiatric Measures, 2nd Edition. Washington, DC,
American Psychiatric Publishing, 2008
Senn S: Regression to the mean. Stat Methods Med Res 6:99–102, 1997
Stigler SM: Regression towards the mean, historically considered. Stat Methods Med Res 6:103–114, 1997
[PubMed]
Storosum JG, Wohlfarth T, Gispen-de Wied CC, et al: Suicide risk in placebo-controlled trials of treatment for
acute manic episode and prevention of manic-depressive episode. Am J Psychiatry 162:799–802, 2005Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…
20 of 20
10/05/2009 15:55
(Comment in: Am J Psychiatry 163:329, 2006)
Ware JH: Linear models for the analysis of longitudinal studies. Am Stat 39:95–101, 1985
Wen L, Badgett R, Cornell J: Number needed to treat: a descriptor for weighing therapeutic options. Am J
Health Syst Pharm 62:2031–2036, 2005 [PubMed]
Copyright © 2009 American Psychiatric Publishing, Inc. All Rights Reserved.
Course Content
Introduction to Clinical Trials and Psychopharmacology
-
Overview of Clinical Trials
-
Basics of Psychopharmacology
-
The Role of Placebo in Clinical Trials
-
Introduction to Clinical Trials Quiz
-
Ethical Considerations in Psychopharmacology Trials
Understanding Placebo Effects in Clinical Research
Statistical Methods for Analyzing Clinical Trial Data
Advanced Techniques in Psychopharmacological Studies
Integrating Findings: Case Studies and Real-World Applications
Earn a certificate
Add this certificate to your resume to demonstrate your skills & increase your chances of getting noticed.