Basic statistical methods in reproductive medicine

BASIC STATISTICAL METHODS IN REPRODUCTIVE MEDICINE

T.M.M. Farley

Special Programme of Research, Development and Research Training in Human Reproduction,
World Health Organization, 1211 Geneva 27, Switzerland

Research methodologies

In reviewing different research methodologies used in reproductive medicine we shall summarize the main types of observational studies and experimental designs and discuss their strengths and weaknesses. These will range from the simplest types of observational study starting with informal reports, to more structured formal comparative studies. The highly structured randomized controlled trial, that allows careful control over biased comparisons is presented as the most valid methodology for comparing two or more groups of subjects.

Informal descriptive studies

Case reports.

Case reports are isolated reports of rare or unexpected conditions occurring among a small number of patients or subjects. These case reports are often published in the specialized medical literature as letters or preliminary observations. They form an important first warning system that there may be dangers or unexpected events associated with the use of a drug or compound. The occurrence of rare or unexpected events occurring during a formal study with a new medicinal compound or drug have to be notified to national drug regulatory authorities as part of the monitoring of research under the Good Clinical Practice procedures. These types of report are seldom more widely reported.
Example: Soon after the introduction of oral contraceptives (OCs) in the 1960s reports appeared in the medical literature of serious side effects associated with their use, in particular the occurrence of pulmonary embolism in a woman taking one of the first OCs was reported by an English general practitioner (4). This report was followed by many similar reports in the medical literature (2).These reports may by themselves provide sufficient evidence to change medical practice, but only for very rare conditions or rare exposures. The reports are by their very nature unusual and are prone to biased reporting and are thus difficult to interpret. They tend to be ‘contagious’ in the sense that a single isolated report may encourage other physicians to report similar observations. Equally there is a tendency for only the first few such events to be reported, as there is more glory for being the first to report an association. A major problem of interpretation of case reports is that there is seldom any measure of the number of people exposed to the new agent or condition, so it is not possible to estimate the incidence with which the event is occurring, nor to compare it with a known or expected incidence rate.Case reports fulfill an important part of monitoring the safety of medical procedures, but are suggestive rather than conclusive. Their findings must be further investigated by formal observational or experimental studies.

Case series.

Case series are a descriptive analysis or report of a small number of subjects receiving a new therapy or having a particular disease or condition. They are rather more formal than the case report since the case series is usually deliberately assembled for the specific purpose of describing or summarizing a group of similar patients. The case report is even less planned or structured. They are however subject to the same limitations and constraints as the case report—they are difficult to interpret, not least because it is not known how the subjects relate to typical subjects nor how the selection was made.

Informal comparative assessment.

Example: A new therapy for endometriosis is given to a group of women identified in an infertility practice, and the degree of improvement reported, together with the numbers achieving pregnancy within 1 year. This is compared to the pregnancy rate from previous women with endometriosis treated as part of the infertility practice.Such a report is only a slight improvement over the case series by virtue of its comparison with a group of previously treated patients, but it is very difficult to separate the true therapeutic effect from other factors that may be responsible for the noted improvement, such as closer monitoring of the patients, the selection of the subjects to be included in the study and the use of historical data for comparison. The dangers of using ‘historical controls’ rather than ‘concurrent controls’ are illustrated in the following example.
Example: A review of the efficacy of anticoagulant therapy following acute myocardial infarction based on 18 published studies using only historical controls (i.e. comparing with previous, untreated patients) showed a 54% reduction in mortality. This was based on over 9000 patients. Subsequent randomized trials using concurrent control groups showed an average reduction of about 21% in mortality (3).

Historical controls

Many factors may change between the time that the control subjects were accrued compared to the subjects receiving the new therapy. There may be differences in referral patterns with the characteristics of the patients changing slightly, or there may have been an overall improvement in hospital care over the study period. Such factors are almost impossible to control for adequately, unless much additional data are available. The increased monitoring and surveillance of subjects receiving the new therapy may have a strong beneficial effect (the ‘placebo’ effect). However the main criticism of the use of historical controls is that there are usually restrictions on who is selected for the new therapy. Only the less seriously ill subjects would be expected to benefit from the new therapy and thus the most seriously ill patients, or those with poor prognosis, would be excluded. Such an exclusion could clearly not be made to the control group and thus comparison between the control and new therapy groups is seldom valid. There may be other restrictions for inclusion in the group receiving the new therapy, such as age or other factors related to prognosis, that invalidate the comparison. The difficulty is that it is not possible to estimate the magnitude of the selection bias.

Formal studies

Descriptive, non-comparative studies.

In general descriptive, non-comparative studies are straightforward in their interpretation, but the design of such studies is extremely complex and the generalization of the results to a wider population is full of pitfalls. Very careful planning and implementation are required to ensure that valid estimates of the main factors of interest are obtained. With sample surveys it is essential that every potential respondent has an equal chance of being selected for inclusion in the study, or that an adjustment is made to allow for different selection probabilities. A good, reliable and up to date ‘sampling frame’ is required which is used for the selection of the respondents. Descriptive studies of well-defined groups of subjects (such as the description of all new acceptors in a family planning clinic over a given period, or of all women delivering in a maternity hospital) are very simple, but it is important to recognize that the subjects described may not be typical of the general population. Those subjects who do not deliver in hospital may form a substantial part of all deliveries and may also have higher morbidity and mortality rates. It is impossible to estimate from the data the extent of the bias since there is no measure of coverage or the characteristics of those not included. The over-enthusiastic generalization of hospital based statistics to a wider population is a common error.

Retrospective case-control studies.

The main purpose of the formal epidemiological studies is to allow valid comparisons between groups by means of observations on an index group (cases) and a control group. The case-control study is appropriate if the outcome of interest is rare, while the cohort or prospective study is appropriate when the outcome is expected to occur on a large proportion of subjects. In the case-control study an index group is identified (cases) and compared with a similar group of subjects who do not have the disease or endpoint (controls). The frequency with which the exposure of interest occurs in the two groups is compared (Table 1). The basic analysis concentrates on the question whether the proportion of exposed subjects is similar in the two groups. Adequate adjustment for other factors both related to the outcome and the exposure (confounding variables) must be made, but the basic analysis follows the conceptual framework above.This straightforward design hides major difficulties in the implementation and interpretation of the case-control study. The selection of controls is probably the most controversial aspect of the design, for it is essential that the controls are similar to the cases and are just as likely to have been exposed to the factor of interest. Any differences in the rates of exposure would therefore be attributable to the association between the exposure and the outcome.

TABLE 1. Case-control study: basic model.

Outcome	Exposed	Not exposed
Case	x %	100 - x %
Control	y %	100 - y %

Example: The WHO Study of Neoplasia and Steroid Contraception was conducted in 8 developing and 3 developed countries and recruited cases of cervical, breast, endometrial, ovarian, liver and gallbladder cancer from participating hospitals. For each case, controls of a similar age were selected from the same hospital, but admitted for other diseases not thought to be related to the use of hormonal contraception. All cases and controls were interviewed using a standard questionnaire while still in hospital, with particular emphasis on their use of contraceptive methods. Since women in developing countries have limited access to health care, hospitalization in case of serious illness and the use of contraception are probably correlated. Thus the use of hospitalized controls ensures comparability between the index and control groups. Data on a range of other variables related to both the exposure and endpoint, such as demographic and behavioural variables, were collected. The main confounding variables were age at diagnosis or interview and parity—both of these are strongly related to the women’s patterns of contraceptive use, as well as their likelihood of developing any of the study diseases. Certain analyses had to control for additional factors, such as sexual behaviour in the analysis of the risk of cervical cancer associated with the use of depot-medroxyprogesterone acetate (DMPA) (8,9).In addition to the avoidance of referral bias, other important sources of bias must be addressed in the design and implementation of case-control studies. The two groups of subjects may not be truly comparable (selection bias), the classification of disease may be influenced by knowledge of the exposure (classification bias), the subjects’ recall of contraceptive use may differ (recall bias), or the rates of refusal may differ according to the disease and/or the exposure (response bias).The advantage of the case-control design is that such studies are relatively cheap and quick to perform, particularly for rare diseases, but the control of bias is a major difficulty. The studies are very difficult to perform properly.

Prospective non-randomized studies

Cross-sectional studies.

The basic method of the cross-sectional study is to select a group of subjects and classify according to two or more variables of interest.
Example: To study a possible association between the incidence of cardiovascular disease and vasectomy, a sample was taken of all men resident in certain communes in rural China. Vasectomy status was determined from administrative records and a questionnaire on general health status together with a detailed cardiovascular assessment was made on each man. The prevalence of cardiovascular disease was compared between the vasectomized and non-vasectomized men (7).
Example: The association between anemia and the use of different contraceptive methods was assessed by measuring haemoglobin levels in a randomly selected sample of women attending a family planning clinic.The cross sectional design is conceptually very straightforward and simple to implement. However it is seldom possible to establish any causality as a result of the observed association. Adequate control of potential confounding factors is also difficult.

Prospective non-randomized studies: the cohort study.

In a cohort study a group of subjects, some with the exposure of interest (index subjects) and others without (control subjects), are assembled and followed over time until the occurrence of the study endpoint. The incidence of the endpoint occurring in the two groups are then compared (Table 2).

TABLE 2. Cohort study: basic model.

Outcome	Yes	No
Exposed	x %	100 - x %
Not exposed	y %	100 - y %

Example: Groups of women were enrolled in a long term follow-up study to assess the morbidity associated with the use of different contraceptive methods. The incidence of deep vein thrombosis was 0.82 and 0.20 per 1000 years of use in current users of oral contraceptives and non-users, respectively, giving an incidence ratio of 4.2 (95% confidence interval 2.1 - 10.9) (6).
Example: By means of administrative records and the 1964, 1974 and 1984 census returns, a complete enumeration was made of all men resident in certain communes in 1964 together with their migration, marriage, and vital status. Interleaving this list with vasectomy status allowed an assessment to be made of the death rates in the vasectomized and non-vasectomized groups (7).The cohort study, by collecting data prospectively, allows the temporal pattern between the exposure and the endpoint to be identified. Adjustment can be made for various factors that may confound the relationship of interest (e.g. smoking) provided adequate data are collected during the period of follow-up. However it is difficult to be certain that complete adjustment has been made for the differences between subjects that may have an influence on their contraceptive choice as well as the incidence of the endpoint. The results of cohort studies tend to gain wider acceptance than case-control studies since they are less likely to be subject to hidden and unquantifiable biases, but they are by no means immune from all bias. For example users of oral contraceptives may be more likely than non-users to have definitive investigations leading to the diagnosis of deep vein thrombosis (ascertainment or classification bias). A further advantage of the cohort design is that more than one endpoint can be studied and there is the potential to detect unexpected associations as well as beneficial effects. The expense of assembling and following a cohort over an extended period is usually a major factor in the implementation of these designs.

General comment.

The main difficulty with epidemiological studies is the control of bias, or the assessment of its magnitude. The case-control design is particularly susceptible to hidden sources of bias, some of which may increase and others decrease the strength of the observed association. Seldom is a definitive conclusion reached on the basis of a single epidemiological study. Consistent results from a range of studies conducted in different settings, using different methods of case ascertainment or interviewing are required before any consensus is reached. These studies may use the cross-sectional, case-control and/or cohort design as well as information obtained from case reports and require a plausible biological hypothesis to explain the observed association. The main methodological tool in the observational studies is REPLICATION, i.e. the repetition of the design in a variety of settings. The apparent simplicity and ease of implementation of the case-control study must be seen in this context. It is not necessarily a cheap and simple way to obtain definitive answers to research questions.

Randomized prospective designs

The randomized controlled trial.

The basic flaw in all other designs is the problem of bias and is the root of considerable controversy in the interpretation of the study results. The randomized controlled trial is the ideal study design in that, when properly conducted, all potential sources of bias are eliminated for the comparison of interest. In many situations a randomized trial is impossible to conduct for practical and/or ethical reasons, in which case evidence from some of the other research designs must suffice. But where randomization is possible, then it is the simplest way in which to reach a definitive conclusion on the efficacy of a new product, procedure or treatment.
Example: Infertile couples where the male partner was diagnosed as having poor semen quality of unknown aetiology and the female partner was ‘normal’ were randomized to receive clomiphene citrate or placebo daily for 6 months. The endpoint was pregnancy (11).
Example: Women attending family planning clinics requesting an intrauterine device were randomized to receive the standard or experimental device, and followed for up to 7 years or until removal of the device (10).The strength of the randomized controlled trial resides in three main features—the use of a concurrent control group that is followed and managed in exactly the same way as the experimental group, the lack of knowledge of the patient, the assessor or the physician of which group the subject belongs to, and the use of randomization to ensure balance. Careful adherence to these principles allows the strong inferences to be drawn from the study:Everything is identical between the two groups of subjects (experimental and control) apart form the different treatments under study. Hence any observed differences must be due to the treatment effect.The maximum degree of blindness given the experimental material and design should be used wherever possible in an controlled clinical trial. At the time of enrollment the physician should not know to which treatment the patient will be assigned; this will avoid any deliberate or unintentional bias in selecting patients with better prognosis to one or other treatment group. The patient should not know which treatment is received; this ensures that he is not reacting to the presumed benefit of the new therapy, or to any potential adverse effects. The study personnel who record or evaluate the patient’s response or take decisions about the termination of participation in the trial should be unaware of the treatment group assigned; this ensures that all groups are equally and fairly evaluated. The use of concurrent controls are essential to permit strict adherence to the randomization and blindness that are such central features of the design. Often a placebo control group is used where this is possible, for example if inert tablets can be packaged in an identical manner to the active compound. Placebo controls are seldom possible in the comparison between highly effective treatments (such as in contraceptive studies), but in situations where there is doubt as to the efficacy of treatment the placebo controls can and should be used. For contraceptive studies, the randomization may be between the new and the standard method.Further details on the design and implementation of the randomized controlled trial can be found in any of a number of excellent textbooks, such as the one by Pocock (5) that also provides an interesting historical overview of early clinical trials.

The normal distribution

The Normal distribution plays a central role in statistical methods. It frequently arises in practical examples as the distribution of observed data and also provides the basis for estimation and the construction of confidence intervals. It has a characteristic bell shaped, symmetric curve and is summarized by two parameters—the mean (a measure of location denoted by the symbol µ), and the standard deviation (a measure of spread, denoted by the symbol s). Figure 1 shows the percentage distribution of the haemoglobin levels among 96 healthy men participating in a clinical study. Their mean is 145.0 g/L with standard deviation 9.8 g/L and the distribution is approximately symmetric.The Normal distribution, illustrated in Figure 2, is completely described by the two parameters µ and s. A property of the distribution is that exactly 95% of the distribution, or area under the curve, lies between µ - 1.96xs and µ + 1.96xs, and 99% between the limits µ - 2.58xs and µ + 2.58xs. The critical values of the standard Normal distribution are widely tabulated and are denoted by za/2 (see for example Altman (1)). These are the values within which exactly 100(1-a)% of the population lie and some commonly used values are given in Table 3.It is not possible to observe the population distribution of haemoglobin levels without taking measurements on all men and in practice we estimate the distribution by taking a random sample from the population and computing the mean (

) and standard deviation (SD(x) or s) from the sample. If the sample data are approximately Normally distributed then approximately 95% of the sample will lie between the limits

- 1.96xs and

+ 1.96xs. Such limits may be used to establish a reference range for diagnostic purposes within which 95% of the population of healthy men would be expected to lie.

TABLE 3. Selected (two-sided) percentage points of the Normal distribution.

a	0.40	0.20	0.10	0.05	0.01	0.005	0.001
za/2	0.8416	1.2816	1.6449	1.9600	2.5758	2.8070	3.2905

The standard error

If we had taken measurements on a different sample of men from the same population we would expect to obtain slightly different values for the sample mean and standard deviation. The precision with which the sample mean is estimated is known as the standard error of the mean, SE(

). This is simply related to the standard deviation s of the population from which the sample is taken: SE(

) = s/Ö n, where n is the number of observations in the sample. Since the population standard deviation is not known, this is estimated from the sample standard deviation. Thus SE(

) is estimated by SD(x)/ Ö n, or s/Ö n.

Confidence intervals

Confidence intervals define the range within which the true, unobservable value of the population mean is likely to lie, taking into account the person to person variability and also the variability introduced due to the sampling process. If the population distribution is exactly Normal then the 100(1-a)% confidence interval for the population mean, µ, is (

- za/2 s/Ö n,

- za/2 s/Ö n). Thus the lower limit of the 95% confidence interval for the mean haemoglobin value in the population of men from which our sample was drawn is 145.0 - 1.96 x 9.8/Ö 196 = 145.0 - 1.96 x 0.70 = 145.0 - 1.37 = 143.6 g/L. The upper limit is 145.0 + 1.37 = 146.4 g/L. If the number of observations in the sample is small (less than about 100) then the variability involved in using the sample estimate s instead of the population standard deviation needs to be taken into account. In this case we use the critical values from the Student’s distribution instead of the Normal distribution.The confidence interval is much smaller than the reference interval. It strictly defines a range within which the true but unknown population mean value will lie, and is based on the concept of repeated sampling. If a random sample of men were to be taken independently from the same population 100 times then 95 of the 95% confidence intervals would be expected to contain the true population mean. We have no means of knowing whether a reported confidence interval constructed from one sample does or does not contain the true population mean.

Comparison of two means

A frequent question that arises in medical research is whether two sets of observations can plausibly come from the same population distribution, or whether there is evidence that the distributions are different. Because of the variability introduced by the sampling process we would not expect the mean and standard deviation of the two samples to be identical even if they did come from the same distribution. The question is how far apart would we allow the means from the two samples to be before we would consider them as different. Suppose that the first sample contains n1 observations with mean

1 and standard deviation s1 and the second sample contains n2 observations with mean

2 and standard deviation s2. These samples come from populations with means µ1 and µ2 and standard deviations s1 and s2, respectively, as shown in figure 3. It is clear that unless µ1 and µ2 are well separated there will be overlap between the two samples, and in particular if µ1 = µ2 (i.e. there is no difference in mean levels between the two groups) then the two samples would be indistinguishable, apart from the random variation introduced by the sampling process.The natural statistic to consider to test the hypothesis that µ1 = µ2 is the difference between the sample means, d =

1 -

2. This is an estimate of d = µ1 - µ2. Large values of d would lead to the conclusion that µ1 > µ2 and small values to the conclusion that µ1 < µ2. In order to assess whether d differs from 0 we need to know its standard deviation. This is given bySD(d) = Ö {SD(1)2 + SD(2)2} = Ö {s12/n1 + s22/n2}. An approximate 95% confidence interval for d is thus (d - 1.96xSD(d), d + 1.96xSD(d)). If this confidence interval does not include the value 0 then we would conclude that the two samples came from different populations.
Example: In order to study whether the use of an intrauterine device (IUD) was associated with anemia, haemoglobin levels were measured on 90 IUD users and compared to a group of 80 oral contraceptive (OC) users. The samples means were 124.3 and 128.2 g/L, respectively, with standard deviations 11.3 and 8.4 g/L. The difference between the mean values is d = 3.9 and SD(d) = Ö {11.32/90 + 8.42/80} = 1.52 g/L. Thus the 95% confidence interval for the difference d is (3.9 - 1.96x1.52, 3.9 + 1.96x1.52), or (0.9, 6.9) g/L. This excludes the value 0 corresponding to no difference, and thus we conclude that the IUD users have lower haemoglobin levels than the OC users. The difference however is not large.

Significance tests

A closely related procedure to the construction of a confidence interval for the difference between the two means is to perform a significance test of the hypothesis that the two population means are in fact the same. This is the null hypothesis, sometimes written H0: µ1 = µ2. The significance test is performed by computing the p-value associated with the null hypothesis. This is the probability that a difference of the magnitude observed or larger would have arisen just by chance if the null hypothesis were true. If the p-value is sufficiently small (conventionally less than 5% or 1%), then we conclude that the null hypothesis is most likely not to be true. If the p-value is not sufficiently small, then we regard there as being no evidence against the null hypothesis. It is a common misunderstanding to interpret the p-value as the probability that the null hypothesis is true. The null hypothesis is either true or false and all we can estimate from the data available is the strength of evidence against the null hypothesis.The p-value is computed from the same summary statistics as before, but instead of constructing the confidence interval, we compute the z-score z = d/SD(d). If the null hypothesis is true then z would have mean 0 and standard deviation 1. If µ1 > µ2 then z would have mean greater than 0, though its standard deviation would still be 1. We thus refer z to tables of the standard Normal distribution.
Example: In the example above we found that d = 3.9 g/L and SD(d) = 1.52 g/L. Thus z = 3.9/1.52 = 2.57. Reference to tables of the standard Normal distribution give p = 0.0103 or 0.01. This is smaller than 5% and thus we would formally reject the null hypothesis that the two groups have the same haemoglobin levels.There is a tendency to report p-values as being less than 0.05, 0.01 or 0.001 and assign one, two or three asterisks (*, ** or ***) accordingly. It is much more informative to give the exact p-value to a reasonable number of significant digits than the grouping into these arbitrary categories. Similarly, confidence intervals are becoming more widely used as summaries of data since these not only allow an assessment of whether the null hypothesis is true (rejecting the null hypothesis at the 5% level is equivalent to the 95% confidence interval excluding the value 0), but also give the likely range in which the true difference lies.References

Altman, D.G. (1991): Practical Statistics for Medical Research. Chapman and Hall, London.
Ask-Upmark, E. (1966): Acta. Med. Scand., 179:463-473.
Chalmers, T.C., Matta, R.J., Smith, H., and Kunzler, A.M. (1977): N. Engl. J. Med., 297:1091-1096.
Jordan, W.M. (1961): Lancet, 2:1146-1147.
Pocock, S.J. (1983):Clinical Trials: A Practical Approach. John Wiley, Chichester, UK.
Royal College of General Practitioners (1978): J. R. Coll. Gen. Pract., 28:393-399.
Tang Guang-hua, Zhong Yu-hui, Ma Yue-min, Luo Lin, Cui Kai, Luo Jian, Zhang Guo-hui, An I-min, Luo Dechun, Qiu Shu-hua, Farley, T.M.M., Rosenberg, M.J. and Strasser, T. (1988): Int. J. Epidemiol., 17:608-617.
WHO Collaborative Study of Neoplasia and Steroid Contraceptives (1991): Lancet, 338:833-838.
WHO Collaborative Study of Neoplasia and Steroid Contraceptives (1992):Contraception, 45:299-312.
World Health Organization Special Programme of Research, Development and Research Training in Human Reproduction Task Force on the Safety and Efficacy of Fertility Regulating Methods (1990): Contraception, 42: 141-158.
World Health Organization Task Force on the Diagnosis and Management of Infertility (1992):Int. J. Androl., 15:299-307.

Contents