|
Reproductive health
BASIC STATISTICAL METHODS IN REPRODUCTIVE MEDICINE
T.M.M. Farley
Special Programme of Research, Development and Research Training in Human
Reproduction,
World Health Organization, 1211 Geneva 27, Switzerland
Research methodologies
In reviewing different research methodologies used in reproductive medicine
we shall summarize the main types of observational studies and experimental
designs and discuss their strengths and weaknesses. These will range from the
simplest types of observational study starting with informal reports, to more
structured formal comparative studies. The highly structured randomized
controlled trial, that allows careful control over biased comparisons is
presented as the most valid methodology for comparing two or more groups of
subjects.
Informal descriptive studies
Case reports.
Case reports are isolated reports of rare or unexpected conditions occurring
among a small number of patients or subjects. These case reports are often
published in the specialized medical literature as letters or preliminary
observations. They form an important first warning system that there may be
dangers or unexpected events associated with the use of a drug or compound. The
occurrence of rare or unexpected events occurring during a formal study with a
new medicinal compound or drug have to be notified to national drug regulatory
authorities as part of the monitoring of research under the Good Clinical
Practice procedures. These types of report are seldom more widely reported.
Example: Soon after the introduction of oral contraceptives (OCs) in
the 1960s reports appeared in the medical literature of serious side effects
associated with their use, in particular the occurrence of pulmonary embolism in
a woman taking one of the first OCs was reported by an English general
practitioner (4). This report was followed by many similar reports in the
medical literature (2).
These reports may by themselves provide sufficient evidence to change medical
practice, but only for very rare conditions or rare exposures. The reports are
by their very nature unusual and are prone to biased reporting and are thus
difficult to interpret. They tend to be ‘contagious’ in the sense that a single
isolated report may encourage other physicians to report similar observations.
Equally there is a tendency for only the first few such events to be reported,
as there is more glory for being the first to report an association. A major
problem of interpretation of case reports is that there is seldom any measure of
the number of people exposed to the new agent or condition, so it is not
possible to estimate the incidence with which the event is occurring, nor to
compare it with a known or expected incidence rate.
Case reports fulfill an important part of monitoring the safety of medical
procedures, but are suggestive rather than conclusive. Their findings must be
further investigated by formal observational or experimental studies.
Case series.
Case series are a descriptive analysis or report of a small number of
subjects receiving a new therapy or having a particular disease or condition.
They are rather more formal than the case report since the case series is
usually deliberately assembled for the specific purpose of describing or
summarizing a group of similar patients. The case report is even less planned or
structured.
They are however subject to the same limitations and constraints as the case
report—they are difficult to interpret, not least because it is not known how
the subjects relate to typical subjects nor how the selection was made.
Informal comparative assessment.
Example A new therapy for endometriosis is given to a group of women
identified in an infertility practice, and the degree of improvement reported,
together with the numbers achieving pregnancy within 1 year. This is compared to
the pregnancy rate from previous women with endometriosis treated as part of the
infertility practice.
Such a report is only a slight improvement over the case series by virtue of
its comparison with a group of previously treated patients, but it is very
difficult to separate the true therapeutic effect from other factors that may be
responsible for the noted improvement, such as closer monitoring of the
patients, the selection of the subjects to be included in the study and the use
of historical data for comparison. The dangers of using ‘historical controls’
rather than ‘concurrent controls’ are illustrated in the following example.
Example A review of the efficacy of anticoagulant therapy following
acute myocardial infarction based on 18 published studies using only historical
controls (i.e. comparing with previous, untreated patients) showed a 54%
reduction in mortality. This was based on over 9000 patients. Subsequent
randomized trials using concurrent control groups showed an average reduction of
about 21% in mortality (3).
Historical controls
Many factors may change between the time that the control subjects were
accrued compared to the subjects receiving the new therapy. There may be
differences in referral patterns with the characteristics of the patients
changing slightly, or there may have been an overall improvement in hospital
care over the study period. Such factors are almost impossible to control for
adequately, unless much additional data are available. The increased monitoring
and surveillance of subjects receiving the new therapy may have a strong
beneficial effect (the ‘placebo’ effect). However the main criticism of the use
of historical controls is that there are usually restrictions on who is selected
for the new therapy. Only the less seriously ill subjects would be expected to
benefit from the new therapy and thus the most seriously ill patients, or those
with poor prognosis, would be excluded. Such an exclusion could clearly not be
made to the control group and thus comparison between the control and new
therapy groups is seldom valid. There may be other restrictions for inclusion in
the group receiving the new therapy, such as age or other factors related to
prognosis, that invalidate the comparison. The difficulty is that it is not
possible to estimate the magnitude of the selection bias.
Formal studies
Descriptive, non-comparative studies.
In general descriptive, non-comparative studies are straightforward in their
interpretation, but the design of such studies is extremely complex and the
generalization of the results to a wider population is full of pitfalls. Very
careful planning and implementation are required to ensure that valid estimates
of the main factors of interest are obtained. With sample surveys it is
essential that every potential respondent has an equal chance of being selected
for inclusion in the study, or that an adjustment is made to allow for different
selection probabilities. A good, reliable and up to date ‘sampling frame’ is
required which is used for the selection of the respondents.
Descriptive studies of well-defined groups of subjects (such as the
description of all new acceptors in a family planning clinic over a given
period, or of all women delivering in a maternity hospital) are very simple, but
it is important to recognize that the subjects described may not be typical of
the general population. Those subjects who do not deliver in hospital may form a
substantial part of all deliveries and may also have higher morbidity and
mortality rates. It is impossible to estimate from the data the extent of the
bias since there is no measure of coverage or the characteristics of those not
included. The over-enthusiastic generalization of hospital based statistics to a
wider population is a common error.
Retrospective case-control studies.
The main purpose of the formal epidemiological studies is to allow valid
comparisons between groups by means of observations on an index group (cases)
and a control group. The case-control study is appropriate if the outcome of
interest is rare, while the cohort or prospective study is appropriate when the
outcome is expected to occur on a large proportion of subjects. In the
case-control study an index group is identified (cases) and compared with a
similar group of subjects who do not have the disease or endpoint (controls).
The frequency with which the exposure of interest occurs in the two groups is
compared (Table 1).
The basic analysis concentrates on the question whether the proportion of
exposed subjects is similar in the two groups. Adequate adjustment for other
factors both related to the outcome and the exposure (confounding variables)
must be made, but the basic analysis follows the conceptual framework above.
This straightforward design hides major difficulties in the implementation
and interpretation of the case-control study. The selection of controls is
probably the most controversial aspect of the design, for it is essential that
the controls are similar to the cases and are just as likely to have been
exposed to the factor of interest. Any differences in the rates of exposure
would therefore be attributable to the association between the exposure and the
outcome.
TABLE 1. Case-control study: basic model.
|
Outcome |
Exposed |
Not exposed |
|
Case |
x % |
100 - x % |
|
Control |
y % |
100 - y % |
Example The WHO Study of Neoplasia and Steroid Contraception was conducted
in 8 developing and 3 developed countries and recruited cases of cervical,
breast, endometrial, ovarian, liver and gallbladder cancer from participating
hospitals. For each case, controls of a similar age were selected from the same
hospital, but admitted for other diseases not thought to be related to the use
of hormonal contraception. All cases and controls were interviewed using a
standard questionnaire while still in hospital, with particular emphasis on
their use of contraceptive methods. Since women in developing countries have
limited access to health care, hospitalization in case of serious illness and
the use of contraception are probably correlated. Thus the use of hospitalized
controls ensures comparability between the index and control groups. Data on a
range of other variables related to both the exposure and endpoint, such as
demographic and behavioural variables, were collected. The main confounding
variables were age at diagnosis or interview and parity—both of these are
strongly related to the women’s patterns of contraceptive use, as well as their
likelihood of developing any of the study diseases. Certain analyses had to
control for additional factors, such as sexual behaviour in the analysis of the
risk of cervical cancer associated with the use of depot-medroxyprogesterone
acetate (DMPA) (8,9).
In addition to the avoidance of referral bias, other important sources of
bias must be addressed in the design and implementation of case-control studies.
The two groups of subjects may not be truly comparable (selection bias), the
classification of disease may be influenced by knowledge of the exposure
(classification bias), the subjects’ recall of contraceptive use may differ
(recall bias), or the rates of refusal may differ according to the disease
and/or the exposure (response bias).
The advantage of the case-control design is that such studies are relatively
cheap and quick to perform, particularly for rare diseases, but the control of
bias is a major difficulty. The studies are very difficult to perform properly.
Prospective non-randomized studies
Cross-sectional studies.
The basic method of the cross-sectional study is to select a group of
subjects and classify according to two or more variables of interest.
Example To study a possible association between the incidence of
cardiovascular disease and vasectomy, a sample was taken of all men resident in
certain communes in rural China. Vasectomy status was determined from
administrative records and a questionnaire on general health status together
with a detailed cardiovascular assessment was made on each man. The prevalence
of cardiovascular disease was compared between the vasectomized and
non-vasectomized men (7).
Example The association between anemia and the use of different
contraceptive methods was assessed by measuring haemoglobin levels in a randomly
selected sample of women attending a family planning clinic.
The cross sectional design is conceptually very straightforward and simple to
implement. However it is seldom possible to establish any causality as a result
of the observed association. Adequate control of potential confounding factors
is also difficult.
Prospective non-randomized studies: the cohort study.
In a cohort study a group of subjects, some with the exposure of interest
(index subjects) and others without (control subjects), are assembled and
followed over time until the occurrence of the study endpoint. The incidence of
the endpoint occurring in the two groups are then compared (Table 2).
TABLE 2. Cohort study: basic model
|
Outcome |
Yes |
No |
|
Exposed |
x % |
100 - x % |
|
Not exposed |
y % |
100 - y % |
Example Groups of women were enrolled in a long term follow-up study to
assess the morbidity associated with the use of different contraceptive methods.
The incidence of deep vein thrombosis was 0.82 and 0.20 per 1000 years of use in
current users of oral contraceptives and non-users, respectively, giving an
incidence ratio of 4.2 (95% confidence interval 2.1 - 10.9) (6).
Example By means of administrative records and the 1964, 1974 and 1984
census returns, a complete enumeration was made of all men resident in certain
communes in 1964 together with their migration, marriage, and vital status.
Interleaving this list with vasectomy status allowed an assessment to be made of
the death rates in the vasectomized and non-vasectomized groups (7).
The cohort study, by collecting data prospectively, allows the temporal
pattern between the exposure and the endpoint to be identified. Adjustment can
be made for various factors that may confound the relationship of interest (e.g.
smoking) provided adequate data are collected during the period of follow-up.
However it is difficult to be certain that complete adjustment has been made for
the differences between subjects that may have an influence on their
contraceptive choice as well as the incidence of the endpoint. The results of
cohort studies tend to gain wider acceptance than case-control studies since
they are less likely to be subject to hidden and unquantifiable biases, but they
are by no means immune from all bias. For example users of oral contraceptives
may be more likely than non-users to have definitive investigations leading to
the diagnosis of deep vein thrombosis (ascertainment or classification bias). A
further advantage of the cohort design is that more than one endpoint can be
studied and there is the potential to detect unexpected associations as well as
beneficial effects. The expense of assembling and following a cohort over an
extended period is usually a major factor in the implementation of these
designs.
General comment.
The main difficulty with epidemiological studies is the control of bias, or
the assessment of its magnitude. The case-control design is particularly
susceptible to hidden sources of bias, some of which may increase and others
decrease the strength of the observed association. Seldom is a definitive
conclusion reached on the basis of a single epidemiological study. Consistent
results from a range of studies conducted in different settings, using different
methods of case ascertainment or interviewing are required before any consensus
is reached. These studies may use the cross-sectional, case-control and/or
cohort design as well as information obtained from case reports and require a
plausible biological hypothesis to explain the observed association. The main
methodological tool in the observational studies is REPLICATION, i.e. the
repetition of the design in a variety of settings. The apparent simplicity and
ease of implementation of the case-control study must be seen in this context.
It is not necessarily a cheap and simple way to obtain definitive answers to
research questions.
Randomized prospective designs
The randomized controlled trial.
The basic flaw in all other designs is the problem of bias and is the root of
considerable controversy in the interpretation of the study results. The
randomized controlled trial is the ideal study design in that, when properly
conducted, all potential sources of bias are eliminated for the comparison of
interest. In many situations a randomized trial is impossible to conduct for
practical and/or ethical reasons, in which case evidence from some of the other
research designs must suffice. But where randomization is possible, then it is
the simplest way in which to reach a definitive conclusion on the efficacy of a
new product, procedure or treatment.
Example Infertile couples where the male partner was diagnosed as
having poor semen quality of unknown aetiology and the female partner was
‘normal’ were randomized to receive clomiphene citrate or placebo daily for 6
months. The endpoint was pregnancy (11).
Example Women attending family planning clinics requesting an
intrauterine device were randomized to receive the standard or experimental
device, and followed for up to 7 years or until removal of the device (10).
The strength of the randomized controlled trial resides in three main
features—the use of a concurrent control group that is followed and managed in
exactly the same way as the experimental group, the lack of knowledge of the
patient, the assessor or the physician of which group the subject belongs to,
and the use of randomization to ensure balance. Careful adherence to these
principles allows the strong inferences to be drawn from the study:
Everything is identical between the two groups of subjects (experimental and
control) apart form the different treatments under study. Hence any observed
differences must be due to the treatment effect.
The maximum degree of blindness given the experimental material and design
should be used wherever possible in an controlled clinical trial. At the time of
enrollment the physician should not know to which treatment the patient will be
assigned; this will avoid any deliberate or unintentional bias in selecting
patients with better prognosis to one or other treatment group. The patient
should not know which treatment is received; this ensures that he is not
reacting to the presumed benefit of the new therapy, or to any potential adverse
effects. The study personnel who record or evaluate the patient’s response or
take decisions about the termination of participation in the trial should be
unaware of the treatment group assigned; this ensures that all groups are
equally and fairly evaluated.
The use of concurrent controls are essential to permit strict adherence to
the randomization and blindness that are such central features of the design.
Often a placebo control group is used where this is possible, for example if
inert tablets can be packaged in an identical manner to the active compound.
Placebo controls are seldom possible in the comparison between highly effective
treatments (such as in contraceptive studies), but in situations where there is
doubt as to the efficacy of treatment the placebo controls can and should be
used. For contraceptive studies, the randomization may be between the new and
the standard method.
Further details on the design and implementation of the randomized controlled
trial can be found in any of a number of excellent textbooks, such as the one by
Pocock (5) that also provides an interesting historical overview of early
clinical trials.
The normal distribution
The Normal distribution plays a central role in statistical methods. It
frequently arises in practical examples as the distribution of observed data and
also provides the basis for estimation and the construction of confidence
intervals. It has a characteristic bell shaped, symmetric curve and is
summarized by two parameters—the mean (a measure of location denoted by the
symbol µ), and the standard deviation (a measure of spread, denoted by the
symbol s). Figure 1 shows the percentage
distribution of the haemoglobin levels among 96 healthy men participating in a
clinical study. Their mean is 145.0 g/L with standard deviation 9.8 g/L and the
distribution is approximately symmetric.
The Normal distribution, illustrated in Figure 2,
is completely described by the two parameters µ and s.
A property of the distribution is that exactly 95% of the distribution, or area
under the curve, lies between µ - 1.96xs and
µ + 1.96xs, and 99% between the limits µ - 2.58xs
and µ + 2.58xs. The critical values of the standard
Normal distribution are widely tabulated and are denoted by za/2
(see for example Altman (1)). These are the values within which exactly 100(1-a)%
of the population lie and some commonly used values are given in Table 3.
It is not possible to observe the population distribution of haemoglobin
levels without taking measurements on all men and in practice we estimate the
distribution by taking a random sample from the population and computing the
mean ( ) and standard deviation
(SD(x) or s) from the sample. If the sample data are approximately
Normally distributed then approximately 95% of the sample will lie between the
limits - 1.96xs and + 1.96xs.
Such limits may be used to establish a reference range for diagnostic purposes
within which 95% of the population of healthy men would be expected to lie.
TABLE 3. Selected (two-sided) percentage points of the
Normal distribution.
|
a |
0.40 |
0.20 |
0.10 |
0.05 |
0.01 |
0.005 |
0.001 |
|
za/2 |
0.8416 |
1.2816 |
1.6449 |
1.9600 |
2.5758 |
2.8070 |
3.2905 |
The standard error
If we had taken measurements on a different sample of men from the same
population we would expect to obtain slightly different values for the sample
mean and standard deviation. The precision with which the sample mean is
estimated is known as the standard error of the mean, SE( ).
This is simply related to the standard deviation s of the population from which
the sample is taken: SE( ) =
s/Ö n, where n is the number of
observations in the sample. Since the population standard deviation is not
known, this is estimated from the sample standard deviation. Thus SE( )
is estimated by SD(x)/ Ö n, or s/Ö
n.
Confidence intervals
Confidence intervals define the range within which the true, unobservable
value of the population mean is likely to lie, taking into account the person to
person variability and also the variability introduced due to the sampling
process. If the population distribution is exactly Normal then the 100(1-a)%
confidence interval for the population mean, µ, is ( - za/2 s/Ö
n, - za/2 s/Ö
n). Thus the lower limit of the 95% confidence interval for the mean
haemoglobin value in the population of men from which our sample was drawn is
145.0 - 1.96 x 9.8/Ö 196 = 145.0 - 1.96 x 0.70 =
145.0 - 1.37 = 143.6 g/L. The upper limit is 145.0 + 1.37 = 146.4 g/L. If the
number of observations in the sample is small (less than about 100) then the
variability involved in using the sample estimate s instead of the
population standard deviation needs to be taken into account. In this case we
use the critical values from the Student’s distribution instead of the Normal
distribution.
The confidence interval is much smaller than the reference interval. It
strictly defines a range within which the true but unknown population mean value
will lie, and is based on the concept of repeated sampling. If a random sample
of men were to be taken independently from the same population 100 times then 95
of the 95% confidence intervals would be expected to contain the true population
mean. We have no means of knowing whether a reported confidence interval
constructed from one sample does or does not contain the true population mean.
Comparison of two means
A frequent question that arises in medical research is whether two sets of
observations can plausibly come from the same population distribution, or
whether there is evidence that the distributions are different. Because of the
variability introduced by the sampling process we would not expect the mean and
standard deviation of the two samples to be identical even if they did come from
the same distribution. The question is how far apart would we allow the means
from the two samples to be before we would consider them as different. Suppose
that the first sample contains n1 observations with
mean 1 and standard
deviation s1 and the second sample contains n2
observations with mean 2
and standard deviation s2. These samples come from
populations with means µ1 and µ2
and standard deviations s1
and s2, respectively, as
shown in figure 3. It is clear that unless µ1
and µ2 are well separated there will be overlap
between the two samples, and in particular if µ1 = µ2
(i.e. there is no difference in mean levels between the two groups) then the two
samples would be indistinguishable, apart from the random variation introduced
by the sampling process.
The natural statistic to consider to test the hypothesis that µ1
= µ2 is the difference between the sample means, d
= 1
- 2.
This is an estimate of d = µ1
- µ2. Large values of d would lead to the
conclusion that µ1 > µ2
and small values to the conclusion that µ1 < µ2.
In order to assess whether d differs from 0 we need to know its standard
deviation. This is given by
SD(d) = Ö {SD( 1)2
+ SD( 2)2}
= Ö {s12/n1
+ s22/n2}.
An approximate 95% confidence interval for d
is thus (d - 1.96xSD(d), d + 1.96xSD(d)). If this
confidence interval does not include the value 0 then we would conclude that the
two samples came from different populations.
Example In order to study whether the use of an intrauterine device
(IUD) was associated with anemia, haemoglobin levels were measured on 90 IUD
users and compared to a group of 80 oral contraceptive (OC) users. The samples
means were 124.3 and 128.2 g/L, respectively, with standard deviations 11.3 and
8.4 g/L. The difference between the mean values is d = 3.9 and SD(d)
= Ö {11.32/90 + 8.42/80}
= 1.52 g/L. Thus the 95% confidence interval for the difference
d is (3.9 - 1.96x1.52, 3.9 + 1.96x1.52), or (0.9, 6.9) g/L. This excludes
the value 0 corresponding to no difference, and thus we conclude that the IUD
users have lower haemoglobin levels than the OC users. The difference however is
not large.
Significance tests
A closely related procedure to the construction of a confidence interval for
the difference between the two means is to perform a significance test of the
hypothesis that the two population means are in fact the same. This is the
null hypothesis, sometimes written H0: µ1
= µ2. The significance test is performed by computing
the p-value associated with the null hypothesis. This is the probability
that a difference of the magnitude observed or larger would have arisen
just by chance if the null hypothesis were true. If the p-value is
sufficiently small (conventionally less than 5% or 1%), then we conclude that
the null hypothesis is most likely not to be true. If the p-value is not
sufficiently small, then we regard there as being no evidence against the null
hypothesis. It is a common misunderstanding to interpret the p-value as
the probability that the null hypothesis is true. The null hypothesis is either
true or false and all we can estimate from the data available is the strength of
evidence against the null hypothesis.
The p-value is computed from the same summary statistics as before,
but instead of constructing the confidence interval, we compute the z-score
z = d/SD(d). If the null hypothesis is true then z
would have mean 0 and standard deviation 1. If µ1 > µ2
then z would have mean greater than 0, though its standard deviation
would still be 1. We thus refer z to tables of the standard Normal
distribution.
Example In the example above we found that d = 3.9 g/L and SD(d)
= 1.52 g/L. Thus z = 3.9/1.52 = 2.57. Reference to tables of the standard
Normal distribution give p = 0.0103 or 0.01. This is smaller than 5% and
thus we would formally reject the null hypothesis that the two groups have the
same haemoglobin levels.
There is a tendency to report p-values as being less than 0.05, 0.01
or 0.001 and assign one, two or three asterisks (*, ** or ***) accordingly. It
is much more informative to give the exact p-value to a reasonable number
of significant digits than the grouping into these arbitrary categories.
Similarly, confidence intervals are becoming more widely used as summaries of
data since these not only allow an assessment of whether the null hypothesis is
true (rejecting the null hypothesis at the 5% level is equivalent to the 95%
confidence interval excluding the value 0), but also give the likely range in
which the true difference lies.
References
- Altman, D.G. (1991): Practical Statistics for Medical Research.
Chapman and Hall, London.
- Ask-Upmark, E. (1966): Acta. Med. Scand., 179:463-473.
- Chalmers, T.C., Matta, R.J., Smith, H., and Kunzler, A.M. (1977): N.
Engl. J. Med., 297:1091-1096.
- Jordan, W.M. (1961): Lancet, 2:1146-1147.
- Pocock, S.J. (1983):Clinical Trials: A Practical Approach. John
Wiley, Chichester, UK.
- Royal College of General Practitioners (1978): J. R. Coll. Gen. Pract.,
28:393-399.
- Tang Guang-hua, Zhong Yu-hui, Ma Yue-min, Luo Lin, Cui Kai, Luo Jian, Zhang
Guo-hui, An I-min, Luo Dechun, Qiu Shu-hua, Farley, T.M.M., Rosenberg, M.J. and
Strasser, T. (1988): Int. J. Epidemiol., 17:608-617.
- WHO Collaborative Study of Neoplasia and Steroid Contraceptives (1991):
Lancet, 338:833-838.
- WHO Collaborative Study of Neoplasia and Steroid Contraceptives (1992):Contraception,
45:299-312.
- World Health Organization Special Programme of Research, Development and
Research Training in Human Reproduction Task Force on the Safety and Efficacy of
Fertility Regulating Methods (1990): Contraception, 42: 141-158.
- World Health Organization Task Force on the Diagnosis and Management of
Infertility (1992):Int. J. Androl., 15:299-307.
Contents
|