Randomized controlled clinical trials in reproductive health

RANDOMIZED CONTROLLED CLINICAL TRIALS IN REPRODUCTIVE HEALTH

J. Villar and *G. Carolli
Special Programme of Research, Development and Research Training in Human Reproduction,
World Health Organization, 1211 Geneva 27, Switzerland
*Centro Rosarino de Estudios Perinatales (CREP)
Rosario, Argentina

Introduction

Randomized controlled trials (RCT) are now well-accepted as the preferred means to evaluate clinical treatments, preventive-screening manoeuvres and health and educational interventions. In recent years, systematic reviews of randomized trials, generally presented as meta-analysis have been published in many areas of medicine, including obstetrics (5). This is a major development for our specialty, until recently known as one of the areas of medicine with less rigorous evaluation (12). The discussion here will be of relevant practical and methodological issues related to RCT using published examples from the obstetrical literature. A comprehensive discussion of RCT can be obtained from the text book of Meinert (17) or specifically oriented to reproductive health from Wingo et al. (25).

Confounding variables

A recent large prospective study of the effect of physical demand and work during gestation on pregnancy outcome, demonstrated a linear increase in the incidence of small-for-gestational-age (SGA) infants as the physical demand of the mother increases (16). However, it is well known that work (at home or outside) during pregnancy is also independently associated with other maternal characteristics, such as family income, woman’s age, nutritional status and obstetrical history. Furthermore, these maternal characteristics are also known to be independently associated with several pregnancy outcomes, including the rate of SGA.

Therefore, there are characteristics (variables) of these mothers that are related to the two factors of interest in this study: the level of their work and the outcome of the pregnancies. In general, these variables are associated with the disease under study and with the exposure (of the subjects) to some agent or condition. In the evaluation of medical treatment, these variables are associated independently with the treatment under study and the disease to be treated or prevented. These characteristics are considered confounding variables and they falsely obscure or accentuate the relationship between the treatment or exposure and the disease under study (4,15,17). The effect produced by confounding variables should be prevented in the study design or controlled for during the analysis.

In the previous example, family income, maternal age, nutritional status and obstetrical history are possible confounding variables to the relationship between work and pregnancy outcome. For example, women who work in offices had lower rate of small-for-gestational-age infants than those who worked in " manual " activities (16.8% versus 20.8%), but they also had higher income, were taller and had fewer previous low birth weight infants (Table 1). All these factors are associated with both the maternal working conditions (office/ " manual ") and the incidence of SGA; they are confounding variables.

A distinction should be made between two terms that can have similar meaning: confounding variables and selection bias. The most important source of bias in a study evaluating a treatment modality is introduced in the selection of participants for alternative forms of treatment. The distinction between selection bias and a confounding variable is related to the possibility of the selection bias to be controlled in the analysis. If the " cause " or the source of selection bias is known and can be controlled in the analysis, the problem is similar to a confounding variable, because it is at least theoretically removable. Unfortunately, in most of the cases, the selection bias cannot be measured or is very difficult to identify.

The control of confounding variables

The effect of confounding variables can be: (a) prevented, (b) assessed or (c) controlled during different stages of a research project (19). The objective of preventing the confounding effect is to obtain study groups that are similar at the beginning of the treatments in the distribution of the possible confounding variables. Randomization during experimental studies and matching and/or restriction to those individuals without the confounding factors in experimental and non-experimental studies are the available strategies for prevention.

The assessment of the confounding effect in experimental and non-experimental studies is performed by comparing the crude estimate of the association between the risk factor or the treatment and the outcome with the estimate after the confounding effect is removed. The discrepancy between these two estimates gives the magnitude of the confounding effect. " P-value " analysis does not have a place in this process.

An example of the comparison between the crude estimate and the effect after removing the confounding factor is presented in Table 2. The association between ponderal index of newborns (weight/length3) and two indicators of neonatal morbidity were studied in a large prospective study (23). The crude analysis demonstrates that newborns with low ponderal index (underweight for length) have 5.6 times more risk of perinatal distress and 3.0 times more risk of low Apgar score at five minutes than those newborns with normal ponderal index (normal weight for length).

However, birth weight is known to be associated with both neonatal outcomes as well as with ponderal index, for it is part of the calculation. Therefore, an adjusted statistical analysis removing the effect of birth weight was also conducted. As can be seen in the right column of Table 2, after removing the effect of birth weight, the risk of perinatal distress and low Apgar score is reduced for newborns with low ponderal index. The discrepancy between these two estimates gives the magnitude of the confounding effect of birth weight in the association between ponderal index and neonatal morbidity.

Finally, the control of confounding variables can be done in experimental and non-experimental studies using analytic and statistical strategies such as stratified analysis and multivariable analysis. A detailed description of these methods can be found in most standard epidemiological and statistical textbooks and will not be reviewed here.

It should be remembered that all these processes must be considered well in advance of the implementation of the study, in the context of a well-defined hypothesis. In this paper we will review the randomization process and the implementation of randomized controlled trials as a " preventive " measure for confounding effects and selection bias.

Randomization

This is a process in experimental studies in which study subjects are assigned to treatment or control groups randomly. The purpose of this method is to reduce selection bias and prevent confounding. Subjects should have the same probability of being included in the treatment(s) or control(s) groups.

Randomization is a powerful method for preventing confounding, as it is the only one which can control for known and unknown risk factors or confounding variables; if the process of subject allocation is performed correctly, randomization is resistant to external manipulation. Thus, it is expected that known and unknown prognostic variables are randomly distributed between study groups and that randomization will create study groups having similar incidence of the disease or outcome when the treatment under evaluation has no effect. Table 3 summarizes the main properties of randomization.

TABLE 3. Properties of randomization.

Reduces selection bias;
Provides study groups with known statistical properties regarding baseline composition;
Provides statistical basis for tests of significance.

Randomization is less effective in achieving these objectives in studies of small sample size; in most of these cases, variables known to be confounders should also be controlled for in the analysis of these trials. For example, a small clinical trial was conducted by our group to evaluate the effect of calcium supplementation during pregnancy on the rate of blood pressure increase during the third trimester of gestation (21). Twenty-seven patients were randomized to the placebo and 25 to the calcium supplemented group. Despite the randomization, the diastolic blood pressure at the beginning of supplementation (24^th week of gestation) was 53.4 + 11.2 mm Hg in placebo group, lower than the value in the calcium group (56.5 + 8.8 mm Hg).

Blood pressure during early pregnancy is independently associated with blood pressure at term, as well as possibly being associated with the response of blood pressure to the supplementation. A decision was made that initial diastolic blood pressure should be controlled for in all statistical analyses of this randomized controlled trial (21).

Randomized controlled trials

When are RCTs necessary and appropriate? This type of study design is needed to evaluate treatments or interventions for important clinical diseases or for those cases in which there are uncertainties regarding the effectiveness of available treatments or forms of care. The anticipated effect of the new intervention on the selected outcomes is generally moderate. Thus, most of the new treatments for chronic disease, obstetrical and gynecological morbidities or screening methods during pregnancy qualify for this study design. RCTs are not limited only to medical or pharmacological treatments, as psychological and social interventions should be also rigorously evaluated (6,14).

Randomized controlled trials are not indicated in the initial stages of the evaluation of the etiology of a disease or the when prognosis or short or long-term consequences of an exposure or condition are under investigation. Nevertheless, RCT can play an important role in the investigation of a disease’s etiology when clinical research or observational studies are not sufficient. For example, after more than a decade of an iatrogenic epidemic of blindness among premature infants, the role of high oxygen therapy was tested in a large multicentre randomized trial, and the unequivocal results of this trial ended the epidemic (10).

As randomization is a method to prevent confounding, RCTs could be considered unnecessary: (a) in those cases in which all confounding variables are known; (b) when the prognosis of the condition under treatment is also certainly known; and (c) when the expected treatment effect is very large. However, caution should be exercised when deciding against a RCT to evaluate an intervention or treatment, even when the theory behind the effect of such intervention appears to be logically sound.

The randomization process

In most RCTs, the basic subjects of randomization are individuals or patients who are allocated to a placebo or a new or established form of care. The research unit in charge of the randomization process can be centrally located (clinicians communicate with it by fax, telex or telephone) or at the clinic/hospital level. Once an eligible subject agrees to participate in the study, she is randomized and treatment starts. It is very important to reduce the time between randomization and entry of patients to the intervention/ control regimen.

The selection of the randomization method and who will implement it depends on the circumstances of the research project; however, there are important issues to consider when selecting the method: the process must be (a) formal, (b) unpredictable, (c) reproducible, (e) secure and (d) have mathematical properties. As tossing a coin is not reproducible, it is to be discouraged. The most important characteristic is that the treatment allocation of the next patient is unpredictable. For example, systematic schedules such as every other patient and the days of the week do not fulfill this requirement and should be avoided. Furthermore, the treatment schedule should be unknown to all members of the research team until needed for the initiation of the treatment (in unblinded studies), and should be completely masked until the completion of the trial in blinded studies. Thus, birthdate, social security number, and odds-even schemes should also be avoided.

Table 4 lists several randomization methods used in the literature from the less rigorous (*) to the most bias-safe alternatives (*****). Researchers should select methods based on the above-listed characteristics and adopt the one most suitable to their needs, without compromising the quality of the process. It should be kept in mind that this is the cornerstone of the study and even when tedious, the best methodological option should be selected.

A detailed review of the statistical properties of several mechanisms of randomization have been extensively discussed; the conclusions and recommendations were recently published (13). Other recommended reading includes a very useful summary that was prepared for the British Medical Journal (8).

It is often asked whether or not the randomization process has worked. The evaluation of the process should be primarily conducted by continuously monitoring during the study the methods underlying it and being satisfied that no bias has been introduced. Performing statistical calculation leading to a " p-value " during the analysis does not respond to the question. For example, a large difference between treatment groups in the distribution of a baseline variable with a p-value >0.05 does not necessarily mean a random assignment. Small differences in baseline variables but with a p-value <0.05 does not mean that the randomization process did not work or that this variable is a confounding factor.

Study design

The study design of the randomized clinical trial is an important point in the avoidance of moderate bias and random error. The statement of the objective should be clearly specified, including the anticipated effect of the principal measure of outcome. The expected treatment effect is crucial to the calculation of the sample size.

Fig. 1 presents the flow chart of a large RCT recently completed by our group, evaluating an intervention of psychosocial support during pregnancy on birth weight, gestational age and maternal health (14,24). Possible candidates for enrollment in the study were screened using a list of inclusion and exclusion criteria. Complete baseline information was obtained from those who were eligible for the study, but basic descriptive information was also obtained from those women excluded before randomization. These data were used for the evaluation of the representativeness of the study population.

The time of individual randomization (about 22^nd week of gestation in the example), represents the crucial point of the study. Subjects randomized cannot be excluded from the final analysis and should be part of the study population regardless of any follow-up experience.

Subjects excluded before randomization affect the composition of the study population (external validity) and the generalization of the study results. Homogeneous study populations, selected using restricted entry criteria, will have less confounding variables (these were excluded), but the study will lose generalization of its results. Subjects lost after randomization affect the comparability of the groups or treatments, the main objective of the study (internal validity).

The following points should also be considered:

The source of participants and eligibility criteria should be considered, as well as the participants included and those who met the eligibility criteria but did not enter into the study. All this information is collected to ensure the representativeness of the study population and to observe how the participants differed from the non-participants.
Detailed description of the alternative forms of care is needed to ensure the reproducibility of the treatments in any new trial or in the clinical practice.
Randomization: As we said above, this is the crucial point in the design to avoid selection bias of the patients for the different forms of care, which is the most important bias in the study. But because randomization in itself is not enough to avoid bias, the method should be clearly specified. It is essential to establish the blindness of the randomization process in all stages and to confirm the patient registration before entry to the trial; in other words, the randomization should be safe, unpredictable and confirmed previous to trial entry.
Administration of the alternative forms of care. When possible, the blindness of the administration of the alternative forms of care should be kept to avoid bias at this stage of the study.
Assessment of the principal and secondary outcome measures. During the assessment of the outcome measures, the treatment status should be blinded to those responsible for this evaluation. In cases where this approach is not possible, the avoidance of bias should be ensured by using strong endpoints such as " death ", and not soft ones such as interpretation of x-rays, which are more prone to bias.
Compliance with the protocol. Any crossover to the alternative treatment and cointervention must be monitored and reported.
Comparison of the groups should be characterized by the random treatment allocation at entry in an " intention-to-treat " analysis. This approach is free of selection bias and is a pragmatic view of how an intervention works in a clinical setting as opposed to the " explanatory trial ", which describes how an intervention should work in a " laboratory setting ".
Possible side-effects and complications should be described.

Table 5 presents a practical checklist for items to be included in the description of the study design (usually included in the Materials and Methods of published articles).

Baseline comparisons

Randomization does not necessarily produce comparable groups. There can be minor and sometimes major differences in the baseline comparisons between groups (1). These baseline differences between groups should always be evaluated, but not using a statistical test (19). Descriptive statistics such as standard deviation, range and selected centils, as well as the mean or median, should always be used to evaluate the distribution of baseline variables. Standard errors and confidence intervals are not descriptive measures and should not be used for this purpose. Unfortunately, a recent evaluation of 80 reports of randomized clinical trials published in four leading general medical journals indicates that in almost half of trials, the presentation of baseline data was unsatisfactory (1).

Table 6 presents the baseline characteristic of the two groups in the psychological support during pregnancy study (24). Overall, the two groups had similar demographic, obstetrical and psychological characteristics at baseline. Note that no statistical comparisons are presented. Although the baseline distribution of psychological and social support characteristics were very similar between groups (see last five variables in Table 6), stratified analysis was conducted by the last two summary variables because of their possible important role in the effect of the intervention (see Table 5 of [24]).

Table 7 presents descriptive information for patients enrolled in a randomized controlled trial designed to evaluate the effectiveness of 200 mg of doxycycline given orally at the time of IUD insertion in reducing the incidence of pelvic inflammatory disease. Groups had very similar characteristics, although the mean number of previous live births and the years of education had p values <0.05. The differences, however, do not have any biological importance (25) and should not be considered as imbalance among groups, despite the low p-value.

If the randomization was correctly conducted (unbiased), it is expected that by chance 5% of the baseline comparisons performed using statistical test will have a p-value <0.05. For example, in 600 hypothesis tests performed in 46 published trials for baseline comparisons, 24 (4%) were statistically significant at the 5% level. Statistical tests were performed for 17 comparisons between baseline characteristics of two study groups in a randomized controlled trial of the effect of calcium supplementation on preterm delivery (23). Among these comparisons, maternal weight was lower (63.0 + 10.6 kg) in the calcium group than in the placebo group (67.4 kg + 13.5 kg) (p<0.01) (Table 8). This statistical difference can be due to chance; we expect that at a statistically significant level of 5% (p<0.05), approximately one of the 17 comparisons would be significant. The decision which must be made by the investigators is whether this difference is biologically important to independently influence the effect of calcium supplementation and the rate of preterm delivery. In this example, the authors considered that maternal weight differences of this magnitude should not be controlled for in the analysis, and that if any effect was present, it would be in the direction of increasing prematurity in the calcium group. Thus, it is clearly a biological decision and not statistical.

Evaluation of the impact of the treatment

As previously stated, if there is no treatment effect (the new drug and placebo/old drugs are not different in improving the main outcome of interest) and if the randomization is well-conducted with an adequate sample size, the incidence of the main outcome should be similar in both groups.

A simple comparison among incidence rates in both groups, rate ratio (the ratio of the two incidence rates), and confidence intervals of the rate ratios should be sufficient to present the results. This comparison should include all subjects that were originally randomized, regardless if they completed the treatment or any follow-up experience. For the example of the psychosocial intervention during pregnancy study, Table 9 presents the standard format for the presentation of the main crude results.

However, it could be possible that baseline differences between groups are detected and that these characteristics are considered as possible confounding variables. Thus, an unconfounded evaluation of the response to the treatment is desirable. Four analytic strategies can be used to obtain this unconfounded evaluation:

Subgroup analysis or stratified analysis.
Change in the effect, e.g. on blood pressure, from baseline values.
Percentage of change, e.g. blood pressure, from baseline values.
Multiple regression analysis. Methodological considerations of these techniques can be found in standard textbooks of epidemiology or specifically for randomized controlled trials in references (11,17). Examples of its use in the obstetrics field were published by our group for stratified analysis (3,24) and for changes from baseline values and multiple regression analysis (2).

Finally, it is important to keep the number of analyses to a minimum, specifically to those that were originally planned. False positive results are often due to multiple analyses. Secondary analyses should also be limited and considered only hypothesis-generating, not hypothesis testing.

The problem of small trials

Small trials of insufficient number of cases are a very common practice in clinical research. Two recent reviews demonstrated that up to 2/3 of all published reports in leading medical journals did not provide power calculations or did not give reasons for termination of subjects’ recruitment (1,18). True differences observed between treatment groups are more likely to be attributed to chance in small trials (false negative results). Furthermore, statistically significant differences observed in small trials tend to overestimate true biological differences. In other words, when applied to other populations, the treatment effect should be expected to be less dramatic than the one observed in the original small trial.

Conclusion

We think that all these considerations should be carefully reviewed during the planning or evaluation of a randomized controlled trial. As a more general recommendation, we offer this quote from J. Cornfield: " On being asked to talk on the principles of research, my first thought was to rise after the chairman’s introduction, to say ‘be careful’ and to sit down. " (7).

References

Altman, D., and Doré, C. (1990): Lancet, 335:149-153.
Belizán, J.M., Villar, J., Pineda, O., Gonzales, A.E., Sainz, E., Garrera, G., and Sibrian, R. (1983): JAMA, 249:1161-1165.
Belizán, J.M., Villar, J., Gonzales, L., Campodonico, L., and Bergel, E. (1991): N. Eng. J. Med., 325:1399-1405.
Breslow, N.E., and Day, N.E. (1980): Statistical Methods in Cancer Research, Vol. 1, pp. 93. Internat’l. Agency for Research on Cancer, Lyon.
Chalmers, I., Enkin, M., and Keirse, M., editors (1989): Effective Care in Pregnancy and Childbirth, Vol. 1: Pregnancy, Part 1. Oxford Univ. Press.
Clarke, M., Clarke, S., and Jagger, C. (1992): Am. J. Epidemiol., 136:1517-1523.
Cornfield, J. (1959): Am. J. Ment. Defic., 64:240-252.
Gore, S. (1981): Br. Med. J., 282:1958-60.
Grant, A. (1989): Br. J. Obstet. Gynaecol., 96:397-400.
Jacobson, R., and Feinstein, A. (1992): J. Clin. Epidemiol., 45:1265-1287.
Kaiser, L. (1989): Stat. Med., 8:1183-1190.
Lancet editorial (1992): Lancet, 340:1131-1132.
Lanchin, J., Matts, J., and Wei, L. (1988): Controlled Clin. Trials, 9:365-374.
Langer, A., Victora, C., Victora, M., Barros, F., Farnot, U., Belizán, J., and Villar, J. (1993): Soc. Sci. Med., 36:495-507.
Last, J.M., editor (1983): A Dictionary of Epidemiology. Oxford Univ. Press, New York.
Launer, L., Villar, J., Kestler, E., and de Onis, M. (1990): Br. J. Obstet. Gynaecol., 97:62-70.
Meinert, C. (1986): Clinical Trials. Oxford University Press.
Pocock, S.J., Hughes, M.D., and Lee, R.J. (1987): N. Eng. J. Med., 317:426-432.
Rothman, K. (1986): Modern Epidemiology. Little Brown, Boston.
Sinei, S.K.A., Schulz, K.F., Lamptey, P.R. et al. (1990): Br. J. Obstet. Gynaecol., 97:412-419.
Villar, J., Repke, J., Belizán, J.M., and Pareja, G. (1987): Obstet. Gynecol., 70:317-322.
Villar, J., de Onis, M., Kestler, E., Bolanos, F., Cerezo, R., and Bernedes, H. (1990): Am. J. Obstet. Gynecol., 163:151-157.
Villar, J., and Repke, J. (1990): Am. J. Obstet. Gynecol., 163:1124-1131.
Villar, J., Farnot, U., Barros, F., Victora, C., Langer, A., and Belizán, J.M. (1992): N. Eng. J. Med., 327:1266-1271.
Wingo, P., Higgins, J., Rubin, G., and Zahniser, C.S., editors (1991): An Epidemiologic Approach to Reproductive Health. Centers for Disease Control/Family Health Internat’l./World Health Org, Atlanta GA.

Contents