A primary goal for clinicians is to evaluate the benefits and harms from the treatment options before them. *Primum non nocere* is a basic principle of ethical practice. “Evidence-based medicine is the conscientious, explicit, and judicious use of the current best evidence in making decisions about the care of individual patients. The practice of evidence-based medicine means integrating the individual clinical expertise with the best available external evidence from systematic review.”^{1} Since the 1990s, evidence-based medicine (EBM) has been gaining a strong foothold in most disciplines of medicine. In order to practice EBM, however, it is paramount that clinicians are capable of assessing and synthesizing medical literature. Effective physicians comprehend the balance between personal clinical experience and evidence from appropriately conducted research. Clinicians need to proficiently assess the quality of published studies with respect to design, conduct, and analysis, and then to determine the relevance and external validity to their specific patient. Not all evidence is equal, nor does it offer definite clinical plans of care. One needs to understand the basic threats to the validity of studies, namely bias, confounding, and chance. The randomized controlled trial (RCT) and meta-analysis are regarded as the gold standard of EBM. However, RCTs are difficult to perform in critical care medicine.

Critical care medicine is a rapidly evolving discipline that deals with a complex disease spectrum. Sepsis and ARDS, for example, are still poorly understood disease processes with multiple etiologies and phenotypes. Sepsis syndrome may arise from a number of sources all presenting differently in different patients. Consequently, there are many challenges in the conduct of sound clinical research in this field: critical illness is difficult to precisely define; patient populations may differ drastically with respect to severity of disease, treatment modalities, and other characteristics; subject recruitment may be difficult.

A sound knowledge of fundamental statistics will lay the basis for critical appraisal of the literature, an imperative clinical tool in every practitioner’s armamentarium. Statistical tools can be used to describe the performance of diagnostic tests, characterize relationships between clinical decisions and outcomes, and estimate survival probabilities, just to name a few of their many uses. Ultimately, one would like to pair a critical appraisal of evidence with comprehensive clinical judgment at the bedside, where *n* = 1.

Perhaps as important as an understanding of the methods of analysis used in clinical studies is an understanding of the various types of study designs, their defining characteristics, and the strengths and weaknesses of each. The most commonly used designs that involve collection and analysis of original data (“primary studies”) can be broken into 2 subtypes: experimental or intervention studies and observational studies. Yet another class of studies (“secondary studies”) seeks to synthesize information from multiple existing studies of the same research question. For experimental designs, we describe the RCT and other clinical trial designs. For observational studies, we consider cohort studies, case-control studies, cross-sectional studies, case series, and case reports. Finally, we discuss meta-analyses and systematic reviews as types of secondary studies.

Partly because of its desirable properties, the RCT is widely used in critical care medicine. In this design, patients who meet eligibility criteria are randomly assigned to 1 of 2 or more groups, called the arms of the trial. The groups are often defined by the treatment regimen to be received by the subjects in the group. This is an experimental design, meaning that the investigator actively chooses to give, for example, 1 treatment to 1 group of patients and another to a second group. Thus, the RCT is not suited to research questions where it is not practical or ethical to experimentally assign subjects to the exposure (treatment) of interest.

A key feature of the RCT is the *randomization* of subjects to the arms of the trial. The purpose of this randomization is to create groups of subjects that *tend* to be comparable on other prognostic factors related to outcome, thus ensuring a “fair” comparison. We stress that the groups tend to be comparable, because randomization does not guarantee this. For this reason, it is imperative to compare the subjects in each treatment arm with respect to other factors known to be related to outcome in order to either assure that the groups are similar or to account for any dissimilarities in the analysis of the trial data.

An important issue in the design of an RCT is *blinding* or *masking*. Blinding of a study keeps knowledge of the random treatment assignment of each subject hidden from 1 or more parties. In a so-called double-blind study, both the investigators and the study subjects are thus blinded. The purpose of blinding is to minimize the potential for bias in the study results. Bias can often occur in subtle, unintended ways from well-meaning individuals, and so blinding attempts to remove the potential for many types of bias. For some types of interventions (eg, surgery vs medical management), blinding is not feasible. When this occurs, to the extent possible it is advisable to blind the investigators who assess outcomes. This can be achieved by having outcomes assessed by members of the research team who are not involved in clinical care of the patient or administration of the intervention.

Another important factor, specific to experimental studies, and thus RCTs, is *crossover*, and related to that, the *intention-to-treat principle*. In some studies, patients who are randomly assigned to 1 treatment group may, at some point during the trial, end up not receiving that treatment and possibly receiving a treatment associated with one of the other arms of the trial. These subjects are said to have crossed over, and the existence of such crossover complicates the analysis of the trial results. The intention-to-treat principle, which is the accepted method of dealing with crossover when it occurs, dictates that data collected from subjects should be analyzed with the group to which the subject was randomized, irrespective of what treatment the subject actually received. Although this may initially seem counterintuitive, this protects the validity of any observed differences between groups. Crossover cannot be assumed to occur at random, and thus to remove crossover subjects from the group to which they were randomly assigned would compromise the randomization, potentially leading to invalid results. Under the intention-to-treat principle, the crossover may dilute the degree of difference between the groups on whatever outcome measure is employed. But the integrity of any observed difference (that is found despite any such dilution) is intact. At the design phase of a trial, if a fair amount of crossover is expected, investigators may be wise to plan for somewhat larger sample sizes to compensate for the dilution of effect sizes due to crossover.

In observational studies, unlike in studies with an experimental design, the investigator does not decide who has which treatment or exposure but merely observes what happens within each exposure group. As such, extra care must be taken, usually in the analysis, to account for imbalances between the groups on factors related to outcome other than the exposure(s) of interest.

In a cohort design, a group of subjects (the “cohort”) is identified and followed forward over time to observe the rate of occurrence of the outcome(s) of interest. The cohort is typically divided into 2 or more groups based on exposure to a particular risk factor for the outcome (eg, treatment), with those exposure groups being compared at the end of the study with respect to the outcome rate. Most cohort studies are prospective and proceed as described earlier. Occasionally, if reliable exposure data are available from an earlier time, an investigator can assemble the cohort retrospectively, by determining who had exposure at some specific earlier time and comparing outcomes over the time subsequent to that exposure. These latter studies are alternately referred to as retrospective cohort or historical cohort studies.

In the case-control design, groups of subjects are chosen after the outcome of interest has already occurred. A case group of subjects with a particular disease (or other outcome) are selected, and a comparison group of subjects (the controls) without the disease are chosen. Information about past exposures and risk factors of interest is then collected and compared between the cases and controls. If the exposure data rely on self-report from the study subjects, and if case subjects recall their exposures differently from control subjects, these studies can suffer from a resulting bias known as recall bias. Nonetheless, case-control studies can often be a useful way of addressing a question that needs an observational design.

Another type of observational design is the cross-sectional study. This is sometimes simply called a survey. In a cross-sectional design, no time has passed between the assessment of exposure(s) and outcome(s). Rather, both are assessed or measured at the same point in time. As such, a major limitation of such studies is that the temporal relationship between one factor and the other cannot be established. Cross-sectional designs are often employed for large-scale, population-based surveys to look at the health characteristics of a population.

As its name implies, a case report is a description of the characteristics of a single patient. A case series describes a usually small group (series) of patients. These types of studies do not typically involve any data analysis or comparison of subgroups, but merely report on characteristics of interest in the case or series of cases. As such, they are especially useful when something out of the ordinary has occurred that may shed new light on existing knowledge or perhaps even identify a new disease type or syndrome.

In secondary studies, new data are not presented. Rather, existing data from other studies are collected and perhaps reanalyzed to synthesize the body of evidence on a particular topic. All such studies need to consider the possible role of what is known as publication bias, which stems from the belief that studies that find an effect (“positive studies”) are more likely to be published than studies that do not (“negative studies”). If this belief is warranted, a synthesis of only published studies will be biased toward finding an effect.

There are 3 main types of secondary studies: review articles, systematic reviews, and meta-analyses. In a review article, the authors identify as many studies as can be found related to a particular research question (and perhaps related topics). They then present in an organized way the key findings of those studies, and draw conclusions based on critical evaluation of the full body of evidence on the question.

Systematic reviews have largely taken the place of the basic review article. In a systematic review, the investigators must spell out the systematic approach used in identifying the research to be included in the review. Typically, a very wide net is cast and a very large number of studies need to be vetted for inclusion or exclusion from the review.

In a meta-analysis, investigators gather all the studies relevant to a particular question in a similar manner as one would do for a systematic review. They then do a new analysis of the combined set of data from those studies, in order to present an overall estimate of the effect of interest that is more precise due to the much larger number of subjects from all studies combined.

Most clinical studies involve one or more statistical hypothesis tests to address the research question(s) at hand. There are many types of such hypothesis tests. For complex questions or designs, a biostatistician should be consulted, but some of the basic tests used, and issues related to their proper use, are presented here. The types of tests can be grouped as those related to continuous outcomes, categorical or discrete outcomes, and time-to-event (survival) data.

**Continuous outcomes—**Many of the outcomes we measure are quantitative in nature and lend themselves to inference based on continuous distributions. Many of the tests of this type involve comparing the mean (or other measure of central tendency) of 2 or more populations. When only 2 groups are to be compared, a *t-test for 2 independent samples* is often employed. If a comparison group internal to the study is not available, sometimes the mean of a single group is compared to a known population value using a *one-sample t-test*. If the 2 samples of data to be compared are not from 2 independently selected groups, but are either from the same subjects (paired data) or a comparison group especially selected to match the first group on important characteristics (matched data), then a *paired t-test* is often employed. When more than 2 groups are to be compared to test for equality of means, a method called *analysis of variance* (ANOVA) can be used.

All of these types of *t*-tests and the ANOVA method make an assumption that the data within each population are (approximately) normally distributed. Minor deviations from this assumption can be tolerated, but when this assumption is not realistic, as is often the case, the analyst has 2 options: (1) the data can be transformed in a way that results in a more symmetric distribution (eg, positively skewed data may benefit from a logarithmic transformation) or (2) a *nonparametric* test may be employed instead of the *t*-test or ANOVA. These nonparametric tests, as their name implies, do not make distributional assumptions about the data. They are insensitive to extreme outliers as they make use of the rank ordering of the observations, not their actual values. So, for example, if the 5 oldest patients in a sample were 102, 64, 63, 59, and 56 years of age, in a nonparametric analysis these values would be converted to ranks of 1, 2, 3, 4, and 5 thus diminishing the influence of the extreme value of 102. Examples of nonparametric tests (not an exhaustive list) include the Mann-Whitney U test or the Wilcoxon rank-sum test in place of the 2-sample *t*-test, the Wilcoxon signed rank test in place of the paired *t*-test, and Friedman’s test in place of ANOVA.

**Categorical outcomes—**When the outcome of interest is a qualitative factor, we say it is a *categorical* variable, meaning that it can only take on a discrete set of possible values. Categorical variables may further be classified as either *nominal* (named categories, such as blood type) or *ordinal* (ordered categories, such as severity of pain if measured as none, mild, moderate, or severe).

Analysis of categorical outcomes often involves using one of a set of tests known collectively as *chi-square tests*. In the *Pearson chi-square test*, 2 or more groups are compared with respect to the proportion exhibiting the outcome. The data are often summarized in a *contingency table or cross-tabulation*, in which subjects are cross-classified by the combination of their exposure group and whether or not they have the outcome. The cross-tabulation presents counts and percentages for each possible combination of the 2 factors, with one factor listed on the rows and the other on the columns. The Pearson chi-square uses a continuous distribution to approximate the behavior of the categorical data; when the number of subjects in 1 or more of the groups (defined by either exposure or outcome) is small, however, the validity of that approximation is questionable, in that case a *Fisher’s exact test* should be used instead. Most statistical software is programmed to warn users when use of a Fisher’s exact test is preferred. Just as is the case with continuous outcomes, if the groups to be compared are not independently selected but are instead paired or matched, then the method of analysis needs to account for the interdependence within the pairs. In this case, one could use a *McNemar’s test* for paired data. The contingency table is set up differently, however, as the unit of observation becomes the matched pairs instead of the individual subjects. Each pair is cross-classified with respect to the outcome for each of the 2 members of the pair.

**Time-to-event outcomes (survival analysis)—**When the outcome is an event that occurs over time, one is often interested in both whether subjects experience the event and how soon they experience the event. If the event is one for which all subjects can be observed completely with respect to either the fact or the timing of the event, then the methods described earlier can be employed. Many studies, however, follow subjects for the occurrence of an event where, at the end of the study, some subjects have not experienced the event but have been at risk for that event for a period of time. The time to event for such subjects is said to be *censored*, in that we only partially observe it (ie, we know the time is at least as long as the amount of time they were at risk, but do not know how much longer). This arises often in studies where the event of interest is mortality, as such studies will often not be continued until all subjects have died. As such, the analytic methods for dealing with censored time-to-event data are often referred to as *survival analysis*. The survival (or event-free time) experience of 1 or more groups can be estimated using a variety of methods, including the *Kaplan–Meier* or *product-limit* method. This can be illustrated graphically with a Kaplan–Meier curve (actually a step function) that plots study time on the horizontal axis against survival (event-free) probability on the vertical access. The survival experience of 2 or more groups can be compared with various methods, including the *log-rank test*.

**P-values—**Regardless of the type of statistical test used, a hypothesis test is performed by comparing the observed data to what would be expected to be observed under what is termed the *null hypothesis* which, generally speaking, represents the absence of a difference between groups, or an effect, association, etc. If what was observed would be unlikely under the null hypothesis, then that null hypothesis is rejected in favor of the alternative (eg, that there is a difference or an effect). Specifically, a *test statistic* is calculated (it may be a t statistic, a chi-square statistic, or some other) whose probability distribution under the null hypothesis is known. When “unlikely” values of the test statistic occur, that is, when we reject the null hypothesis. How we define unlikely is determined by the *significance level* of the test, denoted α, which is typically set by convention at 0.05. So, with α = 0.05, if the test statistic falls in the range of values that are the 5% *least* likely to occur, that is, when we reject the null hypothesis. But, in addition to this reject or do not reject dichotomy, our statistical tests produce a measure of the likelihood of our test statistic known as the *P value*. Specifically, the *P* value is the chance of observing a test statistic as large (in absolute value) as the one observed *if* the null hypothesis were true. Because the test statistic is a function of the observed data, one can think of the *P* value as the chances of observing as great a degree of difference as was observed if, in fact, the populations represented by the groups do not differ at all. One thing that the *P* value is *not*, though it is often misinterpreted as such, is the probability that a conclusion to reject or not reject is due to chance alone.

The aims of a study often include a desire to quantify the frequency with which, or the rate at which, a disease (or other outcome) occurs and/or the association of 1 or more risk factors with the occurrence of disease. For this, we need to have metrics to describe both disease occurrence and the associations between risk factors and outcomes.

The proportion of a particular population that develops a particular disease is called the *risk* of disease. The risk is estimated simply as the number with the disease over the total number at risk of the disease. The proportion developing a disease per unit time is called the *rate* of disease. The numerator for the rate is still the same, but the denominator is no longer a count of the at-risk population, but the *person-time* of observation. To calculate person-time, say in years, the number of years that each subject in the at-risk group was followed is summed to get the total person-years for the rate calculation. The line between risks and rates may sometimes be fine, as a risk is often for a fixed period of time, for example, the 5-year risk of cancer recurrence. We do not often think of the *odds* of disease, but we sometimes need to calculate odds for measures of association (described later). If the probability of an outcome is *P*, then the odds of that outcome is *P*/(1 – *P*), that is, the chance of it happening divided by the chance of it not happening. Note that if an outcome is rare (*P* close to zero), then the chance of it not happening is close to one, and the odds and the risk are almost the same.

If 2 groups are being compared with respect to the occurrence of an outcome, they can be compared with either ratio measures or difference measures. We first consider ratio measures. The *relative risk* is simply the risk in the group of interest divided by the risk in the comparison group. If rates rather than risks are being compared, one may instead compute a *rate ratio*. In the context of survival analysis, the rate of the event in one group divided by the rate in the other is called the *hazard (rate) ratio*. For the case-control study design, risk in each comparison group cannot be observed directly, and we instead calculate an *odds ratio*, defined as the odds of exposure in the cases over the odds of exposure in the controls. If the outcome is rare, one can exploit the fact that odds and risk are almost the same and employ the so-called rare-disease assumption to interpret an odds ratio as an estimate of relative risk. When outcomes are not rare, however, the odds ratio tends to overestimate the relative risk (ie, be further away from a null ratio of one).

Sometimes the absolute difference in risk or rate between 2 groups is of greater interest than the relative difference. In such cases, a difference measure would be preferred over a ratio measure. The risk difference is simply the risk in one group minus the risk in the other. In clinical studies, where the goal is often to test an experimental treatment to see if it *reduces* risk of an adverse outcome relative to a control treatment, these measures sometimes go by more specific names. The *control event rate* and the *experimental event rate* are simply the rates (or risks) of the outcome in the control group and experimental group, respectively. The difference between these is known as the *absolute risk reduction* (ARR), and is a measure of how much of the event (disease, mortality, etc) could be prevented (or cured) by replacement of the control treatment with the experimental treatment. Another measure of association, related to the ARR and often important in clinical studies, is the *number needed to treat* (NNT). It is defined as NNT = 1/ARR. It is an estimate of how many patients would need to be treated with the experimental treatment in order to prevent 1 adverse outcome (eg, death). So, for example, if a new treatment reduced mortality in a certain patient population from 10% to 4%, the relative risk (of death) would be 0.40 (4%/10%), the ARR would be 6% (10% – 4%), and NNT = 1/0.06 = 16.67, suggesting that 17 patients would need to be treated to prevent 1 death. The *number needed to harm* is an analogous measure in studies where the groups are being compared with respect to a risk factor or exposure that increases, rather than decreases, the risk of the outcome. It is still calculated as 1 divided by the absolute difference in risk, but has an opposite interpretation because the group with the factor of interest is harmed rather than helped.

Any of the measures mentioned earlier can be estimated from a given set of data. Such an estimate is called a *point estimate*. Because it is based on limited sample data, the point estimate will differ from the actual value it is intended to estimate due to sampling variability. To get a better idea of what the true value of a population parameter may be, one may calculate and present a *confidence interval* (CI) around the point estimate. A CI has a confidence *level* associated with it, denoted (1 – α) and usually expressed as a percent, corresponding to a significance level α for a related hypothesis test. Typically, α = 0.05, corresponding to 95% CIs. Even though a CI gives us a better idea of what the true population value is than does the point estimate, any given CI from a single study may or may not contain that true value. But the formulas for calculating 95% CIs are constructed in such a way that, over repeated studies, they will contain the correct value 95% of the time. It is from this that CIs get their name, as we claim 95% *confidence* that the interval will contain the true value.

Diagnostic and screening tests are used to classify patients with respect to the presence or absence of a disease, syndrome, or other condition. These tests are not always accurate, and their performance is described by various metrics that relate to the correct classification of those with and without disease. To define these measures, consider a population of *N* patients represented in the following “truth table.” The columns of the table represent the patients’ true presence or absence of disease while the rows represent their test result. In practice, when evaluating new tests, we may not know with absolute certainty the true disease status of a set of patients, but instead compare a new test to a “gold standard” that is assumed to represent true disease status (**Table 116–1**).

True Disease Status | ||||
---|---|---|---|---|

Disease | No Disease | Total | ||

Test Result | Positive | a | b | a + b |

Negative | c | d | c + d | |

Total | a + c | b + d | N |

The *sensitivity* of a test refers to how well it correctly classifies those *with* disease, and is equal to [*a*/(*a* + *c*)], the proportion of those with disease who test positive. The *specificity* of a test refers to how well it correctly classifies those *without* disease, and is equal to [*d*/(*b* + *d*)], the proportion of those without disease who test negative. The *positive predictive value* (PPV) of a test refers to how likely it is that a positive test result indicates the presence of disease, and is equal to [*a*/(*a* + *b*)], the proportion of all those with a positive test result who actually have disease. The *negative predictive value* (NPV) of a test refers to how likely it is that a negative test result indicates the absence of disease, and is equal to [*d*/(*c* + *d*)], the proportion of all those with a negative test result who actually do not have disease.

The sensitivity and specificity of a test are characteristics of the test itself and thus do not change unless something about the test itself is changed (eg, by altering a cutoff for what is considered a positive test result). The PPV and NPV, however, depend not only on how accurate the test is but also on how prevalent the disease is in the population to whom the test is applied. The *prevalence* of a disease is the proportion of the population that has the disease. In our truth table discussed earlier, the prevalence is [(*a* + *c*)/*N*]. In particular, if the prevalence is low in the population being tested, a test with fairly high sensitivity and specificity can still have very poor PPV. In such instances, more specific confirmatory tests may be needed in those who initially test positive.

The development of comprehensive expertise in statistical methods is beyond the scope of study for most physicians, and only basic methods have been presented here. But the clinician who is equipped with these tools is better prepared to critically appraise the medical literature, where innovations that have the potential to affect clinical practice may appear. The clinical investigator who is equipped with these tools is better prepared to design sound, unbiased studies, and to effectively communicate with coinvestigators (including biostatisticians) in order to make valid inferences from their data. Finally, the physician who can appreciate when the claims of published studies are validly supported by the available data, and when they are not, it is better prepared to treat patients in an optimal, evidence-based manner.

*BMJ*. 1996;312(7023):71–72. [PubMed: 8555924]