Chapter 11. Statistics, Placebo Response, and Clinical Trial Design in Psychopharmacology

By Saif Categories: neuropsychopharmacology

About Course

Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

1 of 20

10/05/2009 15:55

Print Close Window

Helena C. Kraemer, Alan F. Schatzberg: Chapter 11. Statistics, Placebo Response, and Clinical Trial Design in

Psychopharmacology, in The American Psychiatric Publishing Textbook of Psychopharmacology, 4th Edition. Edited by

10.1176/appi.books.9781585623860.409248. Printed 5/10/2009 from www.psychiatryonline.com

Textbook of Psychopharmacology >

Chapter 11. Statistics, Placebo Response, and Clinical Trial Design in

Psychopharmacology

STATISTICS, PLACEBO RESPONSE, AND CLINICAL TRIAL DESIGN IN

PSYCHOPHARMACOLOGY: INTRODUCTION

Drug development is a highly complex process that involves multiple steps of preclinical and clinical

pharmacological refinement and testing. Preclinical studies include assessing drug bioavailability,

metabolism, and toxicity; effects on known biological targets (e.g., receptor binding); and

performance in various animal models of pathology. After sufficient data are obtained in animal

studies, drug testing in humans can begin. In the United States, human clinical trials are divided

into four phases. Phase I involves testing multiple doses of a drug for bioavailability,

pharmacokinetics, and side effects. Phase II studies are dose-finding studies in patients with a

given disorder. They can be open-label or double-blind trials. Phase III generally includes pivotal

double-blind trials for demonstrating efficacy and safety/tolerability. Phase IV trials, which take

place after a drug has received U.S. Food and Drug Administration (FDA) approval and is on the

market, are conducted to help clarify potential uses of the drug.

Generally, efficacy is established via a randomized, controlled trial (RCT) in which a test drug is

compared with a so-called placebo and/or an active compound. An RCT is an experiment designed

to establish the efficacy of a treatment by comparing the responses in two or more groups of

patients sampled from a relevant clinical population, with one group randomly assigned to receive

the treatment of interest and a second group randomly assigned to receive a control treatment, in

which all subjects are enrolled, treated, and uniformly followed over the same time period (Meinert

1986). In contrast, experiments that are performed under any of the following circumstances are

not RCTs:

Using human tissues or animals
For purposes other than establishing efficacy of a treatment

Comparing groups of patients who were selected rather than randomly assigned to receive a given

treatment (observational, or quasi-randomized, clinical trials)

Assessing response to a treatment in absence of a control group (e.g., pre–post study designs)

Comparing a group of patients given a treatment versus another group given another treatment

observed at different times or places (e.g., historical control subjects)

Although the basic premise underlying the RCT has remained the same over the past half century,

RCTs today differ from RCTs of the 1950s. Advances have occurred not only in statistical

methodology but also in research methods; studies are now better designed to promote the

replicability of the results and thus to protect the validity of inferences drawn from RCTs and

applied to patient populations. The basic premise underlying the RCT defines the “causal effect of

treatment (T) on an individual patient” as a comparison between a patient’s response to T with

what the outcome in that same patient would have been if that patient had not been given T (Rubin

1974, 2004). The condition proposed to represent what would have happened if that patient had

not been given T is generally called the control or comparison group (hereafter referred to as C).

Unfortunately, there is no way to assess response to any treatment for an individual patient under

two conditions, T and C, simultaneously. If a treatment were given to the same patient at different

times, the condition of the patient might change between the first and second time (secular

trends), or the response to treatment the first time may influence the response to treatment thePrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

2 of 20

10/05/2009 15:55

second time (carryover effects). Consequently, how much of any difference seen between the two

responses is attributable to T per se and how much is attributable to other, extraneous influences

cannot be ascertained. Thus, the causal effect of T in an individual patient cannot be assessed.

However, what cannot be done with an individual patient can be done with a clinical population.

Patients in a representative sample from the clinical population of interest can be randomly

assigned to receive either T or C. Under optimal conditions in a rigorously controlled study, the

average responses in the two groups are estimates of what the average responses to receiving T

and to not receiving T would be within the entire population, and the comparison between them is

an estimate of the average causal effect of T on the population sampled.

In designing an optimal RCT the following issues need to be considered, all of which we address at

least briefly in this chapter:

Specific indications and populations to be studied

Drug formulation and doses to be used

Route and time of administration

Instruments to be used as outcome measures

What the comparison groups should be

Power analysis and statistical analysis tools

The question of what the comparison groups should be is especially important. The issue of whether

to compare the treatment with a placebo or an active control has become a contentious social and

ethical issue as well as a scientific issue, particularly regarding psychiatric patients with severe

disorders such as schizophrenia, and we focus strongly on that issue in this chapter.

INDICATIONS

An indication for a drug is really an administrative approval by a regulatory authority (in the United

States, the FDA) for a company to market and sell a product for a specific clinical need. To establish

efficacy in treating psychiatric illness, the FDA has generally required statistically significant results

in trials. Generally the requirement has been for more than one statistically significant study

outcome, although there have been exceptions to this rule. Historically, indications have revolved

around syndromes (e.g., treatment of major depression or schizophrenia). More recently, the FDA

has shown greater flexibility with approvals for drugs to treat symptom dimensions—e.g.,

agitation—that transcend underlying syndromes.

The population to be sampled is a matter of some importance because one cannot generalize the

result beyond the population the sample actually represents. Thus, for example, a study that

samples alcoholic patients but excludes those with comorbidities will generate a sample that

represents a small fraction of the population with that disorder (Humphreys and Weisner 2000;

Humphreys et al. 2005). Whether or not some advantage is found for T over C in such a

nonrepresentative subpopulation gives little insight as to whether that same advantage pertains in

the majority of people with the disorder.

The choice of population to be sampled is also crucial to differentiating efficacy from effectiveness.

Efficacy refers to the extent to which T produces a beneficial result under ideal conditions;

effectiveness refers to the extent to which T produces a beneficial result when used as it would be

in actual clinical practice (Hoagwood et al. 1995). Both issues are best addressed with RCTs. One

crucial difference between RCTs geared toward establishing efficacy and those geared toward

establishing effectiveness is the population that the RCT researchers choose to sample. RCTs

directed more toward establishing efficacy usually set very narrow inclusion/exclusion criteria,

focusing on the subpopulation of patients with the disorder who are most likely to comply and to

respond well. In contrast, RCTs directed more toward effectiveness try to use patients who are

representative of all those with the disorder who might benefit from that treatment for their

disorder. The primary reason for exclusion from effectiveness RCTs is usually that there is some

reason to believe that either T or C is contraindicated (e.g., pregnancy excludes women in many

drug studies) or the patient is unwilling to sign an informed consent form. Because the sample in an

efficacy study is chosen to be more compliant and responsive, and is typically more homogeneousPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

3 of 20

10/05/2009 15:55

than that in an effectiveness study, sample sizes are likely to be much smaller than those in an

effectiveness study, and effect sizes much larger.

The earliest studies of a particular drug are likely to be efficacy studies in order to establish that

the drug works at least under optimal conditions. For pivotal studies, hypotheses are stated a

priori. Generally, these take the form that a tested treatment will be superior to a comparator in

reducing a given set of symptoms over a specific duration of treatment. A statistical analysis plan is

developed to test the a priori hypothesis. The most influential studies on clinical practice, however,

are likely to be effectiveness studies, for they inform clinicians about which treatments might

benefit the patients they are likely to see.

Clinical development of psychiatric agents is such that approval for indications are usually for

adults (i.e., over the age of 18 years) and for both sexes. Specific additional approvals can be

sought for children and the elderly. Obviously, in disorders that occur primarily in childhood (e.g.,

attention-deficit/hyperactivity disorder), studies may begin with youths.

Establishing Diagnoses

Establishing an accurate diagnosis is generally done according to criteria from an established

classification system—e.g., the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition,

Text Revision (DSM-IV-TR; American Psychiatric Association 2000). Criteria can be checked off

using an unstructured clinical interview or by using a structured instrument—the Scheduled Clinical

Interview for DSM-IV (SCID). The latter is more time-consuming but can offer greater information

on diagnoses, particularly in cases of psychiatric comorbidity. Obviously, the use of the SCID

requires training and assessment for interrater reliability, as well as monitoring over the course of

the trial to prevent “slippage” of diagnostic standards.

Rating Instruments

Because in psychiatry we do not yet have reliable biological measures (e.g., a serum lipid panel) to

assess efficacy, we rely on objective rating instruments applied by an observer or on self-reporting.

Before being used in a study on drug efficacy, the instruments need to be assessed for both validity

and clinical relevancy, and reliability and consistency must be established in the population to be

sampled and in the hands of the RCT research staff. This is particularly true in multisite RCTs,

where special efforts are necessary to ensure that the primary outcome measure is the same across

the sites. Lack of reliability, lack of consistency between raters, or lack of consistency across sites

in a multisite RCT is likely to lead to a failed RCT, one unable to establish statistically significant

differences between T and C.

An RCT should have only one primary outcome being measured—or very few—and all the decisions

in study design, measurement, and analysis are directed toward generating valid and powerful

tests for that primary outcome. Other outcomes can be assessed in the study, as long as they do not

compromise assessment of the primary outcome and do not impose undue burden on patients and

research staff that would lead to dropouts and diminished reliability of measures. Outcome

assessments that amplify or elucidate the primary outcome results are generally listed as

secondary outcomes. Finally, it is necessary at baseline to collect enough information to well

characterize the sample, both sociodemographically and clinically, to check the adequacy of

randomization in producing two groups assigned to T and C that are comparable at baseline.

Such baseline data also are valuable in post hoc exploratory analyses to assess the possibility of

moderators of treatment response (identification of subpopulations that have different effect sizes)

(Kraemer 2008; Kraemer et al. 2006, 2008).

During treatment, there may be repeated assessments of the primary and secondary outcomes,

which can be used in analysis of results using methods such as hierarchical modeling that generally

deal much more effectively with dropouts and missing data; such methods usually increase the

power to detect T versus C effects without requiring an increase in the sample size. Events or

changes that happen during the treatment may also be used in post hoc analysis to identifyPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

4 of 20

10/05/2009 15:55

mediators (possible mechanisms) that indicate how and why T works better than C (or does not),

which can then lead to improved treatments for evaluation in future studies.

Such descriptive statistics or the results of post hoc analyses lead not to conclusions but rather to

hypotheses that would be tested in future RCTs, but they are essential to the advancement of

treatments.

In seeking approval for an indication, observer-rated instruments are generally preferred over

self-reports. However, some studies employ self-report measures to test secondary hypotheses.

Over time, a commonly used measure may be supplanted by another that is believed to more

accurately reflect specific symptoms to be studied. Some of these scales are quite different from

each other. For example, the Montgomery-Åsberg Depression Rating Scale (MADRS) has been used

in antidepressant trials as an alternative to the Hamilton Rating Scale for Depression (Ham-D)

because the former focuses less on anxiety symptoms and more on core depression. Other scales

may be expansions on existing ones, adding items to focus on specific symptoms in a particular

disorder. An example is the use of the Positive and Negative Syndrome Scale (PANSS) for

schizophrenia. The scale is an expansion of the classic Brief Psychiatric Rating Scale (BPRS).

Detailed descriptions of rating scales can be found in a source textbook edited by Rush et al.

(2008).

Application of a validated and reliable instrument still requires that raters be trained and that a

high degree of agreement (interrater reliability) be found among raters, both cross-sectionally and

over time. This generally requires formally training raters and then testing multiple raters on

patient vignettes to establish interrater reliability. In many cases, patient interviews might be

videotaped and randomly selected videotapes submitted to independent experts to assess the

reliability and consistency of ratings. Again, this is particularly important with multisite RCTs.

Symptom Severity Criteria

Historically, minimal levels of symptom severity on rating scales have been incorporated into study

inclusion criteria. These criteria are aimed at ensuring that patients have at least a minimum level

of symptom severity at baseline to allow for assessing relative effects of a therapy. Issues arise in

implementing these inclusion criteria, particularly a possible inflation of rating scale scores to

ensure inclusion. This can become an important issue in studies with relatively few patients. To

counter possible severity rating inflation, several tactics have been employed, including audio- or

videotaping of entry interviews with subsequent quality review; use of self-reporting or an

interview rating done outside the study for inclusion; not specifying minimal entry symptom

severity criteria to the investigator; and ensuring that there is funding to cover the costs of

screening failures.

Whenever entry criteria specify a threshold inclusion, there is the risk of regression to the mean in

both the T and C groups. Although regression to the mean (Campbell and Kenny 1999) in an RCT

does not bias the comparison of T versus C effects, the use of an entry criterion may focus attention

on the very subgroup in which T is least, rather than most, effective, which might lead to a failed

RCT. The alternative is to include all those in whom clinicians might use T rather than C

(effectiveness criteria) and then use post hoc moderator analyses to determine whether those

within a certain range of baseline severity are likely to have the greatest effect sizes.

Trial Duration, Drug Dosage, and Route of Administration

Protocol development requires specifying the duration of the trial a priori. The optimal duration

varies across disorders. For example, for acute treatment of major depression, commonly 6–8

weeks of active treatment are employed; for mania, 3 weeks’ duration is common. The shorter trial

length for acute mania is used for convenience to allow sufficient time to assess a novel therapy

without an unduly long duration of placebo treatment that would discourage enrollment.

Route of administration and specific dosage regimens are determined for each agent and vary

according to the agent’s physical properties (e.g., half-life, absorption) as well as the indication.Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

5 of 20

10/05/2009 15:55

(These are frequently determined in Phase II trials.) Drugs that can be prescribed once per day are

obviously more convenient to administer than are agents that require multiple daily dosing, so

manufacturers strive for once-daily dosing.

STATISTICAL ISSUES IN PLANNING

Sampling, Randomization, and Blinding

The basics strategies of the RCT methodology are simple: appropriate patient sampling, choice of

control group, randomization, and blinded assessment. However, each basic requirement can be

satisfied in a myriad of ways, only some of which are valid in a particular RCT. What is ideal in one

RCT may be either precluded or the worst choice in another. However, if all is done appropriately in

the particular context of the RCT, nonrandom differences in response between the T and C groups

reported as a statistically significant effect of T versus C are attributable to the treatment (thus the

effect of T) rather than to biases in sampling, group assignment, measurement, or analysis.

Given that randomization is done to generate two random samples from the same population, any

postrandomization removal of patients from the sample has the potential of biasing those samples

relative to each other, and relative to the parent population. Consequently, RCT methodology also

requires analysis of results by intention to treat. Every patient randomly assigned to a group (i.e.,

with the intention to treat) must be considered in the analysis comparing responses in the T and

the C groups. In absence of an intention-to-treat analysis, when there is any postrandomization

removal of patients from the randomized sample, the trial is no longer an RCT, because it is not

based on comparison with randomized treatment and control groups. It is in this context that

repeated measures of the primary outcome over the duration of treatment are particularly useful,

for partial response data can be used to impute values for missing data or postdropout data, thus

minimizing the associated loss of power.

To ensure that response is measured exactly the same in T and C groups, either objective outcome

measures should be used or the assessor should be unaware of the treatment group to which a

subject was assigned (i.e., “blinded”).

Statistical Significance and Clinical Significance (Effect Sizes)

It must be emphasized that a statistically significant effect of T, with P <0.05, merely means that

the data satisfy minimal requirements to show a nonrandom difference between T and C. If P <0.01

or P <0.001, etc., the data satisfy more than minimal requirements, but the conclusion remains only

that there is a nonrandom difference between T and C. Thus, a statistically significant effect of T

may or may not be large, important, or of any clinical significance.

To show that a statistically significant effect of T is of clinical significance as well, descriptive

statistics and an effect size should be reported (as required by the Consolidated Standards of

Reporting Trials [CONSORT] guidelines) (Altman et al. 2001; Rennie 1996). For example, how likely

is a patient given T to have a response that is clinically preferable to the response of a patient given

C (an effect size called area under the curve [AUC]) (Acion et al. 2006; Grissom 1994; Kraemer and

Kupfer 2006)? Or how many patients would have to be given T to have one more “success” than if

they had all been given C (another effect size, called number needed to treat) (Altman et al. 2001;

Grissom and Kim 2005; Kraemer and Kupfer 2006; Wen et al. 2005)?

By the time there is rationale and justification to propose an RCT to compare T versus C, it is very

unlikely that the null hypothesis of randomness is exactly true (Jones and Tukey 2000; Meehl

1967). In any case, with the best possible RCT, there is still a 5% chance of a false-positive result

even if the null hypothesis of randomness is exactly true. Thus, given a large enough sample size,

and/or given enough RCTs comparing T versus C, every T could eventually be declared statistically

significantly better than any C. Accordingly, in recent years the costs of exclusive emphasis on

statistical significance have been recognized. There is now growing emphasis on reporting effect

sizes that are clinically interpretable (Altman et al. 2001; Grissom and Kim 2005; Kraemer and

Kupfer 2006).Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

6 of 20

10/05/2009 15:55

CHOICE OF CONTROL GROUP

Among all the myriad sampling, design, and analysis questions that need to be addressed in

proposing an RCT, the most contentious continues to be: What is C?; that is, what is the appropriate

control group (or set of groups) in a particular RCT? More specifically, when is a placebo control

group the appropriate choice? If not a placebo control group, then what? These are the questions to

which we here turn our attention, questions that continue to be the focus of much disagreement

and argument, even among experts in RCT methodology or application, ranging from many who

propose that a placebo control group always be included in an RCT to some others who propose that

it never be included.

In what follows we consider a variety of options for a control group, beginning with a few options

that are, by strict definition, not RCTs. It is important to realize why these options are considered

unacceptable, for those are the same considerations that apply when considering other possible

choices for control groups, particularly that of a placebo control group. Then we discuss the choice

of the placebo control group in some detail, ending with consideration of the treatment-as-usual

(TAU) control group and the standard-of-treatment control group, the most common alternatives to

the placebo control group.

Each Patient as His or Her Own Control (Pre–Post Study Design)

The very simplest, but unacceptable, proposal is to compare patients’ conditions prior to and after

treatment with T, with no control group and no randomization (hence not an RCT). Any change in

condition beyond that consistent with random variation is taken to indicate a treatment effect.

Researchers unable to detect a statistically significant difference between T and C groups in an RCT

often report such pre–post comparisons as indicating an equally beneficial “effect” in the groups, a

misinterpretation of statistical significance.

With any treatment, even a completely inert placebo, it is very unusual, given a large enough

sample size, to not see a statistically significant change in patients’ conditions pre- and

posttreatment. Such a change is attributable not to the effect of that treatment but rather to

artifacts of various kinds that a control group is meant to “control” for, as discussed below:

The first artifact is statistical regression to the mean (Blomqvist 1986; Campbell and Kenny 1999; Davis

1976; Senn 1997; Stigler 1997). Patients are selected for participation because they have symptoms of

a disorder severe enough to require treatment (especially if there is an explicit cutoff point on severity

required for inclusion), and frequently the outcome measures evaluate severity of the same symptoms.

However, assessment of symptom severity at entry is never completely reliable. With error of

measurement due to day-to-day inconsistencies in patients’ conditions, coupled with assessor bias and

random error, patients with a false-positive assessment at entry are included in the study, but patients

in the population with a false-negative assessment at entry are excluded. In absence of any effect of

treatment on their clinical condition, those patients with a false-positive assessment at study entry are

likely to later have their symptoms be assessed as closer to their true level, creating the false

appearance of overall improvement.

Second are expectation effects. Both patients and assessors hope and expect to see improvement after

onset of treatment. As Yogi Berra is reported to have said: “If I hadn’t believed it, I wouldn’t have seen

it!” The power of suggestion is very strong, particularly in the process of recruitment into an RCT. The

informed consent process, as well as patients’ awareness of physicians’ sworn obligation to do no harm,

combined with the patients’ pain, needs, and hopes, tends to bolster expectation effects.

Third are possible secular trends both in the condition of the patients and in the measurements of

response. In some cases, patients with a disorder, receiving only the close and caring attention given in

both T and C groups, spontaneously improve. At the same time, how assessors view a patient’s response

can change over time. With blinding, these effects are the same for patients in the T and C groups.

The cumulative effect of these and other artifactual influences constitutes the placebo response,

which is experienced by those in both the T and C groups, whatever C might be, but which is

expressed in its pure form in a placebo control group. With randomization to the T and C groups andPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

7 of 20

10/05/2009 15:55

blinded assessment, the effect attributable to T is the differential response of the T and C groups,

not the responses to T or to C separately. Thus, pre–post comparisons within a single treatment

group may be of some clinical interest, and worth reporting, but are not adequate to establish the

efficacy or effectiveness of any treatment.

Historical Control Groups

The pre–post treatment comparison as an indicator of treatment effect does away with the RCT

requirements of a control group, of randomization, and of evaluator blinding. Another unacceptable

alternative acknowledges the necessity of a control group but tries to do away with randomization

and blinding: the historical control group.

Here, as in pre–post studies, all patients recruited into the study receive T. However, information is

obtained from clinical records and from past RCTs and other research studies on what happened to

patients who received C—a historical rather than a concurrent control group. The treatment effect is

then inferred from comparisons of the T group with the historical control group.

It has long been known that when one recruits patients from different sources (e.g., different sites,

different recruitment pools, or at different times), what is seen, even in response to the same

treatment in samples satisfying the same inclusion/exclusion criteria, is different. It is usually

hoped that the effects of T versus C within each source are much the same (generalizability) in

patients recruited from different sources, but the responses within the individual T and C groups

are expected to differ sharply in patients recruited from one source versus another. With a

historical control group, any difference seen may be due wholly or in part to site/time/context

effects rather than being due to the effect of T versus C, and the effect size that is estimated and

tested is not valid for establishing the efficacy or effectiveness of a treatment.

Withhold Treatment: Waiting List Control Groups

Yet another proposal, one consistent with RCT methodology but still questionable, is to randomly

assign patients either to the T group or to a waiting list control group. Those in the waiting list

control group are asked to refrain from seeking treatment for their disorder for the duration of the

RCT, at the end of which they are given T if they then wish to receive it.

There is an ethical problem here. Patients being recruited into an RCT suffer from some disorder for

which they seek help. It is difficult in many cases for clinical researchers to maintain clinical

equipoise—defined as “a state of genuine uncertainty on the part of the clinical investigator

regarding the comparative therapeutic merits of each arm in a trial” (Freedman 1987)—when the

choices are between using a treatment that has a rationale, and for which there is justification for a

belief that it might be helpful, and denying treatment for the duration of the study. Clinical

equipoise is not solely an issue related to ethics. Lack of clinical equipoise often has repercussions

for the scientific validity of the RCT, because when researchers are sure that T is better than the

selected C, it is difficult to ensure the scientific objectivity necessary to the design, conduct, and

analysis of the trial in order to produce a balanced and fair comparison of T versus C.

Clearly, if the disorder that the patients have is not serious, disabling, or painful; if it is unlikely to

worsen with a delay in treatment; and particularly if no treatments are available for the condition

outside the RCT that the patients would have had access to had they not been in the RCT, there is

no convincing ethical argument for withholding treatment. However, it is not the ethical

considerations but rather practical considerations that make the choice of withholding treatment

untenable.

When patients seeking help for a disorder are able to obtain treatment from clinicians in their

community, patients are less likely to volunteer for an RCT in which they take a chance of having

such help withheld. Thus, the sample that is recruited into an RCT with a waiting list control group

is likely to be nonrepresentative of the population with the disorder, often missing adequate

representation of those most seriously affected by the disorder and those most anxious to be

relieved of the effects of the disorder.Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

8 of 20

10/05/2009 15:55

Moreover, after randomization, patients in the control group are more likely to drop out to seek

outside help, introducing bias to the RCT and loss of power. Indeed, in many cases, clinicians

observing patients in the waiting list control group often relent and transfer patients from the

control group to treatment group prior to the end of the RCT, because of clinical ethical concerns.

Technically, this is a valid RCT proposal, because there is a C (in this case, temporary withholding of

treatment) and randomization to T and C groups. However, in waiting list–controlled RCTs, it is very

difficult to blind assessment of the outcomes. Patients are often very aware that they are receiving

no treatment for their disorder and such awareness can color all measures of response.

Consequently, the problems of sample bias, measurement bias, missing data, dropouts, loss of

power, and bias in resulting effect sizes militate against choosing a waiting list control group in

most cases.

Placebo Control Groups

A placebo is defined as an inert medication or procedure, one that is selected specifically because it

cannot affect the underlying process of the disorder: a saline injection, a sugar pill. When a patient

given placebo appears to respond, such a response is often mistakenly labeled a placebo effect.

However, since this response cannot be attributed to the placebo, the more appropriate label is

placebo response. As noted above, the placebo response is experienced both by those receiving T

and those receiving C, but it is experienced in its pure form only by those in a placebo control group

(see “Each Patient as His or Her Own Control [Pre–Post Study Design]”).

Because of the artifacts associated with pre–post study designs (regression to the mean,

expectation effects, secular trends in the clinical condition of the patient or in the measurement

process, etc.), it is very common for patients randomly assigned to placebo treatment to appear to

respond favorably. In fact, it is reasonable to expect that given a large enough sample size, one will

always find a statistically significant placebo response.

There has long been discussion of how one might 1) understand the placebo response and 2)

reduce the placebo response in RCTs to make it easier to detect the efficacy or effectiveness of

treatment versus placebo. That portion of placebo response related to artifacts, however, is already

well understood. They are controlled in well-designed RCTs to the extent that is possible, by use of

reliable measures and consistent application of measurement protocols over the course of the RCT.

Clinically, the most interesting aspect of placebo response is the expectation effect in both patients

and assessors of response. Why would we want to reduce such expectation effects? Such

expectations have a great deal to do with compliance and cooperation of patients during RCTs, as

well as with their potential response to treatment. It would certainly be worthwhile to understand

on whom one might expect strong expectation effects, how and why they exist, and exactly how

they influence clinical response to treatment. But the goal would be to consider how to increase

expectation effects to maximize the efficacy or effectiveness of any treatment, rather than to

reduce them.

Like the waiting list control group, the placebo control group amounts to a withholding of treatment

for the duration of the study. However, having a placebo control group repairs many, though not all,

of the technical problems associated with a waiting list control group. One clear advantage is that

use of a placebo control group facilitates blinding of outcome measurement. Thus, the problem of

measurement bias expected in waiting list–controlled RCTs can be obviated in a placebo-controlled

trial.

In the past, many patients did not actually understand what a placebo was, and when U.S. drug

studies are conducted in Third World countries, this problem is still of concern. With the growing

sophistication of medical consumers and medical advocates, however, and with the increasing

scrutiny of how the placebo group is described in informed consent forms, this is less of a problem

today. As is the case for use of waiting list control groups, patients with the greatest impairment

from their disorder—who are most anxious to get relief, most in need of effective treatment, and

most likely to cooperate and comply with treatment—are less likely to be willing to agree toPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

9 of 20

10/05/2009 15:55

randomization in a placebo-controlled trial. Moreover, their primary care physicians often have the

same reaction and are reluctant to recommend participation in an RCT for those most in need of

treatment. Thus, placebo-controlled trials run a strong risk of having a nonrepresentative sample

from the relevant clinical population. Accordingly, the CONSORT guidelines for reporting the results

of an RCT (Altman et al. 2001) ask that researchers provide a patient flowchart that includes how

many patients were eligible for the RCT and how many refused participation. This information sheds

light on the possible sample bias in any RCT but particularly in a placebo-controlled one.

Moreover, the problems that waiting list–controlled RCTs have with dropouts can be exacerbated

(Kemmler et al. 2005) by substituting a placebo control group. Now patients in both the active

treatment and placebo control groups are equally aware that they might be receiving a placebo.

Patients in either group who do not experience the relief from their disorder that they hope for are

likely to suspect that they are in the placebo control group, and thus are likely to drop out of the

RCT. Again, their physicians, observing no amelioration of symptoms, may add to the problem by

recommending dropping out and initiation of what they would consider effective treatment. Once

again, such concerns are reflected in the CONSORT guidelines (Altman et al. 2001) with the

requirement that researchers report the number of patients randomly assigned to each group (the

intention-to-treat samples) and the number that drop out in each group, with reasons for such

dropouts. This information is intended to help assess the possible biases that might result from

such dropouts.

The RCT requirement that all randomly assigned patients be considered in assessing the treatment

effect (an intention-to-treat analysis) in the face of any appreciable dropout is difficult to comply

with. Although imputation methods, both simple and complex, are often used, there is no

imputation procedure that completely corrects for the impact of dropouts or missing data on an RCT

that are directly related to the experiences of patients with the treatment that they were assigned

to receive. The fact that a placebo control often increases the risk for dropouts and missing data is

thus an important consideration.

Although missing data and dropouts remain major issues in considering the use of placebo control

groups, the major source of contention arises not from the theory of RCTs, or from the

implementation problems of RCTs, but rather from the ethics of proposing a placebo-controlled RCT

in the first place and from the clinical value of the results of a placebo-controlled RCT. To

understand these arguments, let us consider a few specific contexts in which placebo control

groups might be considered.

When Is Use of a Placebo Control Group Unarguable?

When Usual Treatment Is No Treatment

Little or no argument about the choice of placebo as the control group in an RCT occurs when the

usual treatment by clinicians in the community is no treatment at all. Then, using a placebo in the

control group mimics clinicians’ recommendations exactly in providing no treatment that can

influence the disorder.

In such cases, patients may be better off volunteering for participation in an RCT, where they have

at least a chance of being assigned to receive an effective treatment. Patients might actually

respond better to a placebo in an RCT than to absence of treatment in the community because of

one of the artifactual responses that the control group controls for. For example, the attention,

care, and scrutiny provided to patients in an RCT are usually far beyond those provided in usual

clinical practice, and that may benefit patients. For all these reasons, when usual care in the

community is no treatment, patients have strong inducement to volunteer to participate in a

placebo-controlled RCT.

Moreover, in this situation patients have little inducement to drop out of the RCT, for nothing is

available as an alternative in the community that can be better than what is offered in the placebo

control group, and what is offered in the T group may improve their condition. Thus, when usualPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

10 of 20

10/05/2009 15:55

treatment in the community is no treatment, placebo control groups are the only choice, and an

excellent choice at that.

In Seeking FDA Approval

Even when there are treatments thought to be effective in the community for a disorder (often

other drug companies’ products), drug companies generally prefer placebo control groups in studies

directed to the FDA for drug approval.

The FDA requires that the drug company provide “proof that a drug has a therapeutic effect in at

least some patients,” and “no requirements are imposed regarding the ‘representativeness'” of a

study sample vis-à-vis the population of patients from which it is drawn” (Leber and Davis 1998, p.

179).

Moreover, the FDA does not set any criteria for a clinically significant effect, although it

generally—as noted above in the “Indications” section—requires demonstration in more than one

RCT of a statistically significant effect. Thus, drug companies prefer to use a control group in which

response is minimized to that associated with unavoidable artifacts of participating in an RCT,

because having such a control group maximizes the chance of finding a statistically significant

effect. Given that the sample size necessary to demonstrate a statistically significant effect is

smallest when the effect size is largest, the easiest way to achieve drug approval is to use very

stringent inclusion/exclusion criteria that favor recruitment into the RCT of only those most likely

to respond well to the drug (an efficacy trial) and a placebo control group that minimizes the

benefit in the control group. Pharmaceutical companies are then doing exactly what they need to do

to satisfy FDA requirements to market their drugs.

What the FDA requires or accepts depends on how it interprets the mandate from the U.S.

Congress, under which it operates, and what drug companies do is determined by what the FDA

requires and accepts. For these reasons, issues related to the use of placebo control groups in RCTs

that are performed to establish the efficacy or effectiveness of treatments in clinical populations

might well be considered separately from issues related to FDA requirements for drug approval and

drug companies’ efforts to satisfy those requirements.

However, as described above, the results of studies funded by drug companies to gain FDA approval

of an agent may not generalize to the type of patients clinicians are treating. Even within the

limited populations actually studied in some such studies, the clinical benefits of the drug may well

be statistically significantly better, but not clinically significantly better, than the effects of

withholding treatment (i.e., as in placebo control groups), or the drug’s effects may be both

statistically and clinically worse than the effects of whatever treatments clinicians are already

using for their patients.

Such considerations relate closely to present-day concerns about conflict-of-interest issues when

academic researchers are associated with studies funded by drug companies and explain the

requirements of many research journals and professional organizations that such conflicts of

interests be reported when presenting results of such studies.

In Medical Experiments

Another situation is a medical experiment that is not a clinical trial. Basic researchers in

pharmacology, for example, might be interested in the pharmacokinetics of a particular drug in

human subjects, not in the efficacy or effectiveness of that drug in any patient population. To test

their hypotheses, a control group, randomization, and blinding may also be necessary. However, in

many cases, the individuals recruited into such a study are not patients with a disorder but simply

volunteers, and the outcomes are focused not on the clinical benefits or harm to patients but rather

on aspects of the drug, such as rates of drug absorption and elimination. In many such situations, a

placebo control group is the only scientifically valid choice to test hypotheses. Because the focus

here is not on establishing the efficacy or effectiveness of a treatment, this is a medical experiment

but is not that specific type of medical experiment called the clinical trial. Such medical Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

11 of 20

10/05/2009 15:55

experiments are fundamental to progress in basic science and translational research.

The issues about the use of human subjects in medical experiments, including that of using placebo

control subjects, are specific to the particular research question being addressed in a study and are

generally different from those in an RCT. These are issues that institutional review boards must

assess for each individual research project. Informed consent forms for such studies should not

describe the study as an RCT and should spell out clearly that the focus of the study is to further

biochemical or pharmacological knowledge about the drug, not (at least not in the near future) to

improve treatment for any patients with a disorder.

In Testing a Supplemental Treatment

In many cases, an RCT is performed to test whether a new treatment that is used as a supplement

to an established treatment is more effective or efficacious than the standard treatment alone. This

is a case in which there are few, if any, reservations to the use of a placebo to control the

artifactual effects of offering two treatments rather than one. All patients would receive the

standard treatment and then be randomly assigned to supplementation with either the new

treatment or a placebo. If researchers find a statistically and clinically significant difference

between the two groups, it would represent not an effect of the new treatment but rather an

additive or interactive effect of the old and new treatments together. Here few people would have

ethical or scientific concerns about the use of a placebo control group.

Placebo Control Groups: Arguments Pro and Con

As we have seen, the arguments against the use of placebo control groups are seldom absolute.

Few would argue against using placebo control groups when the usual treatment is no treatment.

Few would argue against using placebo control groups when a new treatment is tested as a

supplement to an old treatment. The general argument against the use of placebo control groups

can be summarized as: There are often better alternatives for a control group than placebo control

subjects, alternatives that are better clinically and ethically, and often the scientific quality and

clinical impact of RCTs using alternative control groups are preferable.

On the other hand, Leber (2000, p. 699), for example, has argued that “in the evaluation of drugs

intended for the management of psychiatric illness, placebo control groups are indispensable, a fact

that has been recognized for almost half a century by literally generations of investigators working

in the fields of clinical pharmacology and psychopharmacology.” Yet A. Bradford Hill (1963), often

recognized as the father of RCTs, stated the following: “Is it ethical to use a placebo? The answer to

this question will depend, I suggest, upon whether there is already available an orthodox treatment

of proved or accepted use.” The arguments about placebo control groups among those most

knowledgeable about RCTs have been around as long as RCTs have been around (Rothman and

Michels 1994).

Arguments for the necessity of placebo control groups are often made on philosophical grounds

(Leber 2000) or in terms of convenience to researchers because placebo-controlled RCTs take less

time and effort and require less funding (with their smaller sample sizes). However, RCTs are

clinical trials proposed to address clinical issues, not philosophical or logistical ones. To benefit

clinical decision making, hence medical consumers and clinicians, the crucial question is this: Is T

clinically preferable to C in the population studied? If physicians in the community are using a

range of treatments that they believe to be effective, there is little value in demonstrating that a

new treatment is better than withholding treatment altogether (as in a placebo control group).

Others argue, not for the necessity of placebo control groups but for their acceptability, that

placebos do no lasting or irrevocable harm. For example, numerous studies have demonstrated that

the suicide risk among psychiatric patients receiving a placebo is not such as to preclude the ethical

use of placebos in that context (Khan et al. 2000, 2001, 2006; Storosum et al. 2005). However, the

Hippocratic oath’s requirement to “do no harm” is not restricted to lasting or irrevocable harm. It is

not the prerogative of the medical researcher to stipulate how much potential harm to a patient is

acceptable. Rather, it is the duty of the medical researcher to assure patients that, to the best ofPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

12 of 20

10/05/2009 15:55

current medical knowledge, the harm that may occur if they participate in the RCT would not be

greater than if they chose not to participate in the RCT, and may well be less.

Alternative Control Groups

Treatment-as-Usual Control Groups

For a variety of reasons, from the perspective of a patient or a clinician, the most attractive choice

for a comparison group is one that uses what clinicians are currently using: the TAU

(treatment-as-usual) control group. Here patients are recruited into the study and randomly

assigned either to T, the new treatment under study, or to the treatment they would have received

if they had not entered the RCT. The responses in both groups are monitored using the same

protocol, with evaluators blinded to group membership. The research question is whether it can be

shown, beyond reasonable doubt, that T produces better results than what community clinicians

generally achieve.

Because patients who enter such an RCT receive at least the treatment that they would have

received had they not entered the RCT (TAU), or perhaps receive a treatment for which there is a

rationale and justification for a belief that it might be better (T), such a control group is an

inducement to participation. Moreover, since dropping out of the study generally means getting

what is now the control treatment, there is little inducement to drop out. Clearly, patients and

treatment providers cannot be blinded to the treatment choice, although assessors of treatment

response can be.

Objections to this choice of control group generally come from researchers, who correctly point out

that community clinicians do not all use the same treatment, and many use treatments that are

known to be far from optimal (or use optimal treatment in a less-than-optimal fashion). Far greater

variability in response is likely in a TAU control group than there would be if one particular

treatment were specified for a control group, or if placebo were used, and greater variability than in

the T group. Consequently, the sample size has to be larger to achieve adequate power than with a

specific choice of comparator treatment or with a placebo as the control intervention.

Clearly, with TAU as the control condition, greater detail is needed in reporting results. It is

important to report which treatments were being used by community clinicians and to explore

whether some of those treatments were likely to be better or worse than T (hypotheses to perhaps

be tested in future RCTs). However, all in all, TAU is a viable choice for a control condition in an RCT

to document whether T is a better choice in general than what is already available in the

community.

Standard-of-Treatment Control Groups

Another viable option is that of selecting and delivering what, on the basis of past, independently

replicated RCTs, appears to be the most effective of the treatments in use in the community. This

allays many of the problems with TAU control groups, because now the one control treatment can

be optimally delivered by research staff, decreasing the heterogeneity within the control group.

Also, blinding can now be more complete and effective. Moreover, patients can be assured that

current medical knowledge indicates that the treatment in the control group is as good as or better

than what they would receive if they were not in the study, which again is an inducement both to

participate in the RCT and to not drop out. If there is persuasive evidence that the new treatment T

is indeed superior (or at least equivalent) to others currently available, significant findings in such

a study might well convince clinicians to use T rather than whatever treatment they had been

using.

EQUIVALENCE AND NONINFERIORITY RCTS

The RCT methodology was originally established to establish the efficacy or effectiveness of a

treatment, to establish whether T is better than C. In recent years attention has also been paid to

issues of clinical significance, as well as to noninferiority and equivalence. Misconceptions Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

13 of 20

10/05/2009 15:55

concerning statistical significance, clinical significance, equivalence, and noninferiority have

unnecessarily confused the issue of the placebo control.

The basic requirements of an RCT still apply to equivalence and noninferiority trials: the necessity

of an appropriate control group, randomization to T versus C groups, blinded assessment of

response, and analysis by intention to treat. In the study population, there is an unknown effect

size, delta ( ), that is zero when there is absolutely no differential response between T and C, is

positive when T tends to be better than C, and is negative when C tends to be better than T. One

way of stating the purpose of an RCT is that it is meant to estimate that unknown effect size

(Borenstein 1994, 1997, 1998).

The usual two-tailed null hypothesis significance test seeks to prove beyond reasonable doubt that

is not zero. In designing such a study, a value of , the threshold of clinical significance, is

designated—say, *. The test is then structured so that when = 0, the probability of a significant

result is less than, say, 5% (the significance level). When the magnitude of is greater than *,

the probability of a significant result is greater than, say, 80% (adequate power). Given the fact

that when there is sufficient rationale and justification for proposing an RCT there is almost no

realistic chance that the effect size is exactly zero, achieving statistical significance in an RCT is

generally a matter of having a large enough sample size in a well-designed study with reliable

outcome measures. To indicate the possible clinical significance of a statistically significant finding,

the effect size and its confidence interval should be reported (as per CONSORT guidelines).

Another way of saying the same thing: with the study design described above, there is a better than

80% probability that a two-tailed 95% confidence interval for the unknown true effect size will

not include = 0, whenever the true effect size is greater than the critical value *.

In contrast, to show that T is clinically superior to C, one needs to show that the entire confidence

interval for that effect size is greater than *. To show that T and C are clinically equivalent, one

needs to show that the confidence interval lies completely between – * and *. To show the

noninferiority of T to C, one needs to show that the entire confidence interval is less than – *.

One can always demonstrate either noninferiority or equivalence, simply by using unreliable

outcome measures, allowing deviations from treatment and measurement protocols, etc. That is, a

badly conducted trial will result in an attenuated effect size between any two drugs, an effect size

closer to zero, that can almost always be labeled as a noninferior or equivalent result.

This issue is highly relevant to the valid interpretation of study results because of a common

confusion between a result being non–statistically significant and two drugs showing equivalence.

To report a non–statistically significant result is only to admit that the sample size was not large

enough, the design not powerful enough, or the measures not reliable enough to demonstrate

beyond reasonable doubt that ≠ 0. That is nowhere near the same thing as reporting a

demonstration beyond reasonable doubt that – *< < * (i.e., equivalence). As the old saying goes:

“Absence of proof is not proof of absence.”

In particular, if one randomly assigned subjects in an RCT to one of two treatment groups (T1 or

T2) or to a placebo control group (C) and found no statistically significant difference between T1

and T2 but found statistically significant differences both between T1 and C and between T2 and C,

that tells nothing about the possible clinical equivalence of T1 and T2. All one would know is that

both T1 and T2 were shown to be better than (not necessarily even clinically superior to) placebo;

the sample was not large enough to detect whatever difference there might be between T1 and T2.

Any conclusion comparing T1 and T2 would be exactly the same whether or not the placebo control

group were included in the design. Yet many arguments for the use of a placebo control group

inappropriately reflect an effort to use the placebo control as a decoy to interpret results comparing

T1 and T2.

But finally, why would it be important to establish beyond reasonable doubt the clinical equivalence

of two treatments, particularly when such a result can be obtained through poor study design (e.g.,

choice of measurement) and execution of the RCT? Often when this question is asked, whether of aPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

14 of 20

10/05/2009 15:55

drug company representative or of an academic researcher, a complete answer often contains an

implicit reinterpretation of equivalence as superiority. For example: Two treatments might have

equivalent effects in reducing symptoms, but one might have a better side-effect profile. Or the two

treatments might have equivalent effects in terms of both symptom reduction and side effects, but

one may be far less costly or have greater ease of use than the other. In all such cases, the goal of

the RCT should be to establish the clinical preference for one treatment or the other over the

control condition by using an outcome measure sensitive to the specific ways one drug might be

clinically preferable to another. But that would then be not an equivalence study but rather the

usual type of RCT, with a primary outcome reflecting the particular way in which T is hypothesized

to be superior to C.

STATISTICAL ANALYSIS AND POWER

Threaded through all the above issues are implicit considerations of statistical analysis and power.

Every single decision about research design has some impact on the appropriate choice of analysis

and on the power to detect clinically significant effects. In fact, the major difference between

well-designed and well-executed present-day RCTs and those done 50 years ago stems from

advances in methods of statistical analysis of results and better understanding of the concept and

application of power in designing RCTs.

With the simplest possible design, randomly assigning a representative sample of patients to a

treatment (T) or control (C) condition, with a binary primary outcome—”success” versus

“failure”—the analytic method would be a 2 x 2 (treatment x outcome) 2 test. This is the least

powerful design (Cohen 1983; Kraemer 1991; Kraemer and Thiemann 1987, 1989; MacCallum et al.

2002), and the study thus requires perhaps twice, perhaps 10 times, as many patients for adequate

power compared with studies with other designs. A valid choice? Yes. A wise choice? No.

Suppose we merely substituted that binary primary outcome with a dimensional one—for example,

symptom level at the end of treatment—and proposed to use the most common RCT analysis

method: the two-sample t test. Immediately, there would be an increase of power (thus requiring a

smaller sample size for adequate power). If the goal is to detect a moderate effect (number needed

to treat [NNT] ≊≊≊≊4), for a 5% two-tailed test one would require 63 patients per group or a total of

126 patients. Detecting a small effect (NNT ≊≊≊≊9) would require 389 patients per group, and

detecting a large effect (NNT ≊≊≊≊2) would require only 26 patients per group.

The two-sample t test is valid when the outcomes being measured are approximately normally

distributed with equal variances in the two groups, but many clinically meaningful outcomes have

asymmetrical distributions, have long tails, or occur with unequal variances in the two groups. Then

one might use instead the nonparametric Wilcoxon rank sum test (Mann-Whitney test). When the t

test gives valid results, the Mann-Whitney test is also valid and has quite similar power. However,

the nonparametric test is valid in many circumstances when the t test is not. This illustrates two

general principles: the choice of the outcome measure must be in accordance with the choice of the

analytic procedure, and better selection of the outcome measure has a major impact on the study’s

power and thus the necessary sample size.

If one uses any of the above for an RCT of depressed patients, most of the sample would be female;

for an RCT of schizophrenic patients, most of the sample would be male. In many cases, it is

proposed to stratify such sample populations in order to equalize the representation of males and

females. Is this wise?

If it is decided that stratification is warranted, to be valid the analytic procedure must acknowledge

that stratification. For a binary outcome, one might use a logistic regression analysis and for a

dimensional outcome, a linear regression analysis, with treatment group, stratum (here, gender),

and their interaction as independent variables in each case.

One common analytic error is to assume that the interaction does not exist and to use analysis of

covariance with gender as the covariate. If that interaction does exist in the study population and is

ignored in the analysis, it often compromises the significance level, and thus the validity of the test,Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

15 of 20

10/05/2009 15:55

and almost inevitably reduces the power. But if the interaction is included in the analysis, care must

be taken to properly center all the independent variables to produce clinically interpretable results

(Kraemer and Blasey 2004).

If this is done correctly, the interaction test assesses whether the treatment effect in women is

different from the effect in men, and the main effect of treatment assesses whether the average

treatment effect across men and women is nonrandom. When there is an interaction effect in the

study population, the main effect of treatment in this analysis is not the same as the effect of

treatment assessed in an unstratified sample. The crucial issue then is which treatment effect is of

interest—the effect in the total population or the average effect across the subpopulations defined

by the strata.

If the decision is that the sample should be stratified, the sample size needed for adequate power is

likely to increase, and the logistical difficulty of accumulating a stratified sample is likely to be

much greater. If, for example, 80% of those with the disorder of interest are women, but it is

decided that 50% of the sample in the RCT should be women, one will have to work much harder to

recruit that oversampling of men into the study. Thus, careful thought should be given to whether

the rationale and justification for stratification are strong enough to necessitate larger samples,

more complex analyses, and a shift in the hypothesis being tested.

The difficulty of such decisions is exacerbated when researchers (or reviewers) seek to control for

multiple covariates (e.g., gender, age, ethnicity, initial severity of the disorder). To truly control the

study for the effects of such variables, one stratifies the sample. However, with gender (two

possibilities), age (say, five age groups), ethnicity (say, five ethnic groups), and initial severity

(say, three levels), one has 2 x 5 x 5 x 3 = 150 strata, and one would have to recruit adequate

numbers into each stratum (for optimal power, an equal number into each stratum). If even a

minimal number of patients per stratum were specified, say, 10 per stratum (5 randomized to T and

5 to C), the minimal sample size would be 1,500!

If these stratification variables are not very strongly associated with treatment effect, the result is

a study with less power than would be achieved with a simple design. If there are collinearities

among these variables (say, women and older patients tend to have high higher initial disease

severity), the power to detect treatment effects might also be reduced. One of the least wise

decisions in RCT design is to try to control for the effects of too many variables, and many

experienced biostatisticians argue against any stratification of the sample unless the primary

hypotheses concern moderators of treatment outcome.

Researchers and review committees, however, often propose another tactic: Instead of controlling

for the effects of these baseline variables through stratification, adjust for them in a mathematical

model. Now the sample would continue to be 80% women, but the analysis would include

consideration of both treatment and gender. What then often happens in analysis is exclusion of all

interactions. Without adjustment, the two-sample t test has N – 2 degrees of freedom (the larger

the number of degrees of freedom, other things being equal, the greater the power), but with a

single covariate, that becomes N – 4 (N – 22 ), and with four covariates, it becomes N – 32 (N – 25 ).

As noted above, if such interactions exist in the study population and are excluded in the model, the

significance level may be compromised and power is almost inevitably lost. Thus, if covariates are

to be included, their interactions must be as well. Unless inclusion of those variables has a major

strengthening effect on effect size, this inevitably means a loss of power. Finally and perhaps most

important, collinearity effects resulting from associations between the variables included cost even

more power. Again, most experienced biostatisticians argue against adjusting for the effects of

baseline variables in the absence of a strong rationale and empirical justification for doing so.

On the other hand, in a multisite RCT, stratification by site is built into the design and must be

included in the analysis, and even then, many researchers and reviewers choose to ignore it.

Multisite RCTs often show that site differences are a major source of variance in the outcome

measurements (MTA Cooperative Group 1999). The most convincing demonstration of the

almost-ubiquitous nature of site differences is not from an RCT but from a study of inbred strains ofPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

16 of 20

10/05/2009 15:55

mice in a genetics study (Crabbe et al. 1999) under controlled laboratory conditions. Even then, site

differences occurred. In an RCT, if samples are drawn from different sites, or in different time

spans, or at the same site at the same time but using different recruitment strategies (e.g.,

referrals from doctors versus responses to advertisement), one should always expect that these

differences will affect the primary outcome. Thus, randomization must be done within each such

stratum, and comparison of T versus C must be a pooled comparison of the within-stratum

comparisons of T versus C (Kraemer and Robinson 2005).

Thus far we have focused on assessing a single primary outcome at the end of treatment, whether a

binary success/failure or a dimensional outcome, and have recommended against using a binary

outcome. But, some would argue, some outcomes are by their nature binary: either the patient dies

or not, the patient recovers or not, the disease remits or not. Are we not then obliged to use a

binary outcome, and “take the hit” by increasing the sample size manyfold?

Outcomes such as these occur over the course of time, and at different times for different patients.

By simply reorienting the analysis to examination of the time to the event, one moves from a binary

outcome to a dimensional one. This is the situation in which survival analyses become the analytic

procedure of choice: Kaplan-Meier estimation-of-survival curves within each group (Kaplan and

Meier 1958), comparison of these survival curves in the T versus C groups, and use of the Cox

proportional model (Andersen et al. 1985), for example, when there are strata or covariates to be

considered. Although the sample sizes for adequate power will be somewhat greater than with

other dimensional outcomes (because some patients will be censored, i.e., they will not have had

the outcome occur before the end of study), the sample size here will be much smaller than when

using a binary outcome, and more useful clinical information will be obtained.

Also, with dimensional outcome measurement, modern analytic tools can lead to increased power

without increasing sample size. For example, instead of assessing the outcome using only the

endpoint of treatment, one could assess that outcome measure at baseline and at fixed times

during the treatment period. Random regression models (also known as hierarchical models, or

growth curves; Berger 1986; deLeeuw and Kreft 1986; Ware 1985) basically model the trajectory of

response within each patient and then test whether the trajectories of response in the T group are

clinically preferable to those in the C group. Because multiple measures per patient are used to

characterize each patient’s response, reliability is increased, and thus power is increased.

Moreover, in the case of missing data or dropouts, partial data on the trajectory per patient enable

stronger imputation methods to facilitate intention-to-treat analyses. Quite aside from the multiple

statistical advantages of designing studies with repeated measures of outcomes over time, such

information is often clinically informative in guiding clinicians to recognize early those patients who

are unlikely to ever respond to a given treatment.

This discussion barely scratches the surface of analytic methods available, but illustrates two

general principles:

For adequate power and to best inform clinical decision making, characterize the response of each

individual patient as precisely and concisely as possible (using reliable measures, preferably

dimensional, with repeated measures over time). That might sometimes complicate the analysis, but

analytic methods are generally available to take advantage of such precision.

Design the study to answer the primary research question, not to answer all possible questions that

might arise. Leave those to secondary or exploratory post hoc analyses. Do not stratify the study

population unless the design requires multiple sites or recruitment sources or the primary research

question is about the strata. Do not try to control or adjust for all possible influences on treatment

effect; instead, focus on controlling those factors empirically shown to strongly influence treatment

effect.

SUMMARY

The sine qua non of an RCT is that there be a control or comparison group, with an appropriate

sample of patients randomly assigned to the treatment and control groups, with blindedPrint: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

17 of 20

10/05/2009 15:55

assessment of outcome, and with analysis by intention to treat. A reasonable analogy of conducting

an RCT is that it is like juggling many balls, trying to keep them all in the air at the same time:

choice of control group, randomization, blinding, sampling, treatment protocol, measurement

protocol, and fidelity to the protocols during the study, analysis, and interpretation of results, etc.

The more research questions that are to be addressed in a particular RCT, and the more complex

each research question is, the more balls are being juggled and the more slippery they are.

Moreover, as soon as one ball drops, the others are likely also to follow. Any mistake in sampling,

for example, is likely to have repercussions in analysis and interpretation of results. Poor

measurement (e.g., use of a binary primary outcome) will have an effect on design (sample size).

Improved design (e.g., repeated measures of the primary outcome) will have an effect on analysis

and interpretation of results. For this reason, the best studies result when there is a focus on the

primary research question, when all research decisions are made to protect the integrity of and

amplify the answers to that primary research question. When the effort is made to answer as many

questions as possible in one study, e.g., addressing multiple outcomes, or controlling or adjusting

for multiple variables, RCTs tend not to answer any research questions well at all.

The choice of an appropriate control group is context specific but should take into consideration

ethical and clinical as well as scientific issues. On the issue of using placebo control groups, we

distinguished RCTs to establish the efficacy or effectiveness of a drug from studies done for the

purpose of gaining FDA approval of a drug versus randomized medical experiments to explicate

basic science questions, which are often performed with subjects who are not patients and for

purposes that have little or nothing to do with efficacy or effectiveness of treatments.

We have discussed in detail when it is best and least controversial to use a placebo control group

(i.e., when there is no better alternative to placebo) and when it may be more appropriate to use a

TAU control group or a standard-of-care control group, particularly in cases where withholding

treatment, as would be done with placebo control subjects, raises ethical questions as well as

logistical questions.

Using placebo control groups as a foil to understand differences not seen between two active

treatments is often based on a misinterpretation of statistical significance. Moreover, use of

placebo control groups as an aid in establishing equivalence or noninferiority is questionable,

because what may appear to be equivalence is often related to poor study execution rather than to

actual equivalence between treatments. Also, it is not clear why equivalence is important to clinical

decision making.

In short, in some circumstances using a placebo control group is the only choice, and in others

using a placebo control group is the best choice. There are also circumstances in which ethical,

clinical, and scientific interests are best served by using other types of control groups. Finally, there

have been circumstances in which inclusion of placebo control groups has misled thinking about the

effects of other drugs.

REFERENCES

Acion L, Peterson JJ, Temple S, et al: Probabilistic index: an intuitive non-parametric approach to measuring

the size of treatment effects. Stat Med 25:591–602, 2006 [PubMed]

Altman DG, Schulz KF, Hoher D, et al: The revised CONSORT statement for reporting randomized trials:

explanation and elaboration. CONSORT Group (Consolidated Standards of Reporting Trials). Ann Intern Med

134:663–694, 2001 [PubMed]

American Psychiatric Association: Diagnostic and Statistical Manual of Mental Disorders, 4th Edition, Text

Revision. Washington, DC, American Psychiatric Association, 2000

Andersen PK, Borch-Johnsen K, Deckert T, et al: A Cox regression model for the relative mortality and its

application to diabetes mellitus survival data. Biometrics 41:921–932, 1985 [PubMed]

Berger MPF: A comparison of efficiencies of longitudinal, mixed longitudinal, and cross-sectional designs.

Journal of Educational Statistics 11:171–181, 1986Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

18 of 20

10/05/2009 15:55

Blomqvist N: On the bias caused by regression toward the mean in studying the relation between change and

the initial value. J Clin Periodontol 13:34–37, 1986

Borenstein M: The case for confidence intervals in controlled clinical trials. Control Clin Trials 15:411–428,

1994 [PubMed]

Borenstein M: Hypothesis testing and effect size estimation in clinical trials. Ann Allergy Asthma Immunol

78:5–16, 1997 [PubMed]

Borenstein M: The shift from significance testing to effect size estimation, in Comprehensive Clinical Psychology

(Bellak AS, Hersen M, editors-in-chief), Vol 3: Research Methods (Schooler NR, volume editor). Burlington, MA,

Elsevier Science, 1998, pp 319–349

Campbell DT, Kenny DA: A Primer on Regression Artifacts. New York, Guilford, 1999

Cohen J: The cost of dichotomization. Applied Psychological Measurement 7:249–253, 1983

Crabbe JC, Wahlsten D, Dudek BC: Genetics of mouse behavior: interactions with laboratory environment.

Science 284:1670–1672, 1999 [PubMed]

Davis CE: The effect of regression to the mean in epidemiologic and clinical studies. Am J Epidemiol

104:493–498, 1976 [PubMed]

deLeeuw J, Kreft I: Random coefficient models for multilevel analysis. Journal of Educational Statistics

11:57–85, 1986

Freedman B: Equipoise and the ethics of clinical research. N Engl J Med 317:141–145, 1987 [PubMed]

Grissom RJ: Probability of the superior outcome of one treatment over another. J Appl Psychol 79:314–316,

1994

Grissom RJ, Kim JJ: Effect Sizes for Research: A Broad Practical Approach. Mahwah, NJ, Lawrence Erlbaum,

2005

Hoagwood K, Hibbs E, Brent D, et al: Introduction to the special section: efficacy and effectiveness in studies

of child and adolescent psychotherapy. J Consult Clin Psychol 63:683–687, 1995 [PubMed]

Hill AB: Medical ethics and controlled trials. BMJ 1(5337):1043–1049, 1963 [PubMed]

Humphreys K, Weisner C: Use of exclusion criteria in selecting research patients and its effect on the

generalizability of alcohol treatment outcome studies. Am J Psychiatry 157:588–594, 2000 [Full Text]

[PubMed]

Humphreys K, Weingardt KR, Horst D, et al: Prevalence and predictors of research participant eligibility criteria

in alcohol treatment outcome studies, 1970–98. Addiction 100:1249–1257, 2005 [PubMed]

Jones LV, Tukey JW: A sensible formulation of the significance test. Psychol Methods 5:411–414, 2000

[PubMed]

Kaplan EL, Meier P: Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481,

562–563, 1958

Kemmler G, Hummer M, Widschwendter C, et al: Dropout rates in placebo-controlled and active-control clinical

trials of antipsychotic drugs. Arch Gen Psychiatry 62:1305–1312, 2005 [PubMed]

Khan A, Warner HA, Brown WA: Symptom reduction and suicide risk in patients treated with placebo in

antidepressant clinical trials: an analysis of the Food and Drug Administration database. Arch Gen Psychiatry

57:311–317, 2000 [PubMed]

Khan A, Khan SR, Leventhal RM, et al: Symptom reduction and suicide risk among patients treated with

placebo in antipsychotic clinical trials: an analysis of the Food and Drug Administration database. Am J

Psychiatry 158:1449–1454, 2001 [Full Text] [PubMed]

Khan A, Kolts RL, Brodhead AE, et al: Suicide risk analysis among patients assigned to psychotropics and

placebo. Psychopharmacol Bull 39:6–14, 2006 [PubMed]Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

19 of 20

10/05/2009 15:55

Kraemer HC: To increase power in randomized clinical trials without increasing sample size. Psychopharmacol

Bull 27:217–224, 1991 [PubMed]

Kraemer HC: Toward non-parametric and clinically meaningful moderators and mediators. Stat Med

27:1679–1692, 2008 [PubMed]

Kraemer HC, Blasey C: Centring in regression analysis: a strategy to prevent errors in statistical inference. Int

J Methods Psychiatr Res 13:141–151, 2004 [PubMed]

Kraemer HC, Kupfer DJ: Size of treatment effects and their importance to clinical research and practice. Biol

Psychiatry 59:990–996, 2006 [PubMed]

Kraemer HC, Robinson TN: Are certain multicenter randomized clinical trial structures misleading clinical and

policy decisions? Contemp Clin Trials 26:518–529, 2005 [PubMed]

Kraemer HC, Thiemann S: How Many Patients? Statistical Power Analysis in Research. Newbury Park, CA, Sage,

1987

Kraemer HC, Thiemann SA: A strategy to use “soft” data effectively in randomized clinical trials. J Consult Clin

Psychol 57:148–154, 1989 [PubMed]

Kraemer HC, Frank E, Kupfer DJ: Moderators of treatment outcomes: clinical, research, and policy importance.

JAMA 296:1286–1289, 2006 [PubMed]

Kraemer HC, Kiernan M, Essex MJ, et al: How and why criteria defining moderators and mediators differ

between the Baron and Kenny and MacArthur approaches. Health Psychol 27:S101–S108, 2008

Leber P: The use of placebo control groups in the assessment of psychiatric drugs: an historical context. Biol

Psychiatry 47:699–706, 2000 [PubMed]

Leber PD, Davis CS: Threats to the validity of clinical trials employing enrichment strategies for sample

selection. Control Clin Trials 19:178–187, 1998 [PubMed]

MacCallum RC, Zhang S, Preacher KJ, et al: On the practice of dichotomization of quantitative variables.

Psychol Methods 7:19–40, 2002 [PubMed]

Meehl PE: Theory testing in psychology and physics: a methodological paradox. Philos Sci 34:103–115, 1967

Meinert CL: Clinical Trials: Design, Conduct, and Analysis. New York, Oxford University Press, 1986

MTA Cooperative Group: A 14-month randomized clinical trial of treatment strategies for

attention-deficit/hyperactivity disorder. Arch Gen Psychiatry 56:1073–1086, 1999

Rennie D: How to report randomized controlled trials: the CONSORT Statement. JAMA 276:649, 1996

[PubMed]

Rothman KJ, Michels KB: The continuing unethical use of placebo controls. N Engl J Med 331:394–398, 1994

[PubMed]

Rubin DB: Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol

66:688–701, 1974

Rubin DB: Teaching statistical inference for causal effects in experiments and observational studies. Journal of

Educational and Behavioral Statistics 29:343–367, 2004

Rush AJ Jr, First MB, Blacker D (eds): Handbook of Psychiatric Measures, 2nd Edition. Washington, DC,

American Psychiatric Publishing, 2008

Senn S: Regression to the mean. Stat Methods Med Res 6:99–102, 1997

Stigler SM: Regression towards the mean, historically considered. Stat Methods Med Res 6:103–114, 1997

[PubMed]

Storosum JG, Wohlfarth T, Gispen-de Wied CC, et al: Suicide risk in placebo-controlled trials of treatment for

acute manic episode and prevention of manic-depressive episode. Am J Psychiatry 162:799–802, 2005Print: Chapter 11. Statistics, Placebo Response, and Clinical Trial Desi… http://www.psychiatryonline.com/popup.aspx?aID=409252&print=yes…

20 of 20

10/05/2009 15:55

(Comment in: Am J Psychiatry 163:329, 2006)

Ware JH: Linear models for the analysis of longitudinal studies. Am Stat 39:95–101, 1985

Wen L, Badgett R, Cornell J: Number needed to treat: a descriptor for weighing therapeutic options. Am J

Health Syst Pharm 62:2031–2036, 2005 [PubMed]

Course Content

Introduction to Clinical Trials and Psychopharmacology

Overview of Clinical Trials
Basics of Psychopharmacology
The Role of Placebo in Clinical Trials
Introduction to Clinical Trials Quiz
Ethical Considerations in Psychopharmacology Trials

Understanding Placebo Effects in Clinical Research

Statistical Methods for Analyzing Clinical Trial Data

Advanced Techniques in Psychopharmacological Studies

Integrating Findings: Case Studies and Real-World Applications

Add this certificate to your resume to demonstrate your skills & increase your chances of getting noticed.

Student Ratings & Reviews

No Review Yet

Chapter 11. Statistics, Placebo Response, and Clinical Trial Design in Psychopharmacology

About Course

Course Content

Introduction to Clinical Trials and Psychopharmacology

Overview of Clinical Trials

Basics of Psychopharmacology

The Role of Placebo in Clinical Trials

Introduction to Clinical Trials Quiz

Ethical Considerations in Psychopharmacology Trials

Understanding Placebo Effects in Clinical Research

Introduction to Placebo Effects

Mechanisms Behind Placebo Effects

Placebo Effects in Psychopharmacology

Assessing Your Understanding of Placebo Mechanisms

Designing Clinical Trials to Minimize Placebo Impact

Statistical Methods for Analyzing Clinical Trial Data

Introduction to Statistical Methods in Clinical Trials

Randomization and Blinding Techniques

Analyzing Continuous and Categorical Data

Quiz on Statistical Concepts

Handling Missing Data in Clinical Trials

Advanced Techniques in Psychopharmacological Studies

Understanding Placebo Effects in Psychopharmacology

Statistical Methods for Analyzing Psychopharmacological Data

Quiz on Statistical Techniques in Psychopharmacology

Designing Clinical Trials: Dealing with Confounding Variables

Quiz on Placebo Control and Randomization Techniques

Integrating Findings: Case Studies and Real-World Applications

Analyzing Clinical Trial Case Studies

Real-World Applications of Placebo-Controlled Trials

Statistical Techniques in Psychopharmacology Research

Integrating Clinical Trial Findings

Evaluating Placebo Effects

Earn a certificate