Introduction

In the Western elderly population acute colonic diverticulitis (ACD) is a common disease of the gastro-intestinal tract. The prevalence of diverticulosis, the underlying pathological condition, ranges from 10% in people under 40 years to as high as 60% in people older than 80 years. Approximately 10% to 20% of affected people will develop one or more episodes of ACD [1, 2].

A widely shared view nowadays is that imaging is mandatory in the initial assessment of patients suspected of ACD [35] to cope with clinical misdiagnosis rates, the discrepancy between clinical presentation and the extent of ACD, and the possibility that other diseases mimicking ACD are missed.

Ultrasonography (US) and computed tomography (CT) are used in daily practice to complement clinical assessment and physical examination in diagnosing ACD. Those in favour of US stress its lower costs, wider availability, and the lack of radiation exposure and use of contrast material. CT imaging is embraced by others because they claim CT is less operator dependent than US in obtaining high diagnostic accuracy, generates fewer inconclusive results, and is able to assist in surgical planning when intervention is needed [2, 5, 6].

Reported sensitivities and specificities vary, both for US and CT [5, 7]. A systematic review of prospective studies may be able to summarise the diagnostic accuracy of both investigations, providing estimates with greater precision. Ideally, this analysis would merely include studies investigating the diagnostic accuracy of US and CT in the same population (head-to-head comparison). Since such comparative studies are scarce, we performed a systematic review and meta-analysis of prospective comparative studies, as well as prospective studies investigating US or CT separately.

Methods

Search strategy and study eligibility

We performed a literature search to identify studies investigating the diagnostic accuracy of US and CT in human subjects suspected of ACD. We searched MEDLINE and EMBASE databases for papers published between January 1966 and January 2007, using the following keywords: [“Diverticulitis”(MeSH) OR “Diverticulitis, Colonic”(MeSH)] AND [“Radiography”[MeSH] OR “radiography”(Subheading) OR “Radiography, Thoracic”(MeSH) OR “Radiography, Abdominal”(MeSH) OR “Tomography, X-Ray Computed”(MeSH) OR “Tomography Scanners, X-Ray Computed”(MeSH) OR “Tomography, Spiral Computed”(MeSH) OR “Ultrasonography”(MeSH) OR “ultrasonography”(Subheading)].

CINAHL database was also checked for relevant studies with the following keywords: [diverticulitis (MeSH) and (Ultrasonography (MeSH) or Echography (MeSH) or Radiography (MeSH) or Computed tomography (MeSH) or Computer-Assisted Tomography (MeSH))]. The Cochrane database of Systematic Reviews was searched with the following words: Diverticulitis AND (ultrasonography OR computed tomography).

Studies were eligible if they addressed the diagnostic accuracy of US, CT, or both, in patients with suspected ACD. No age, date or language restrictions were applied. If studies were judged potentially eligible, full-text versions of the papers in which they had been reported were retrieved. We crosschecked the references.

Study selection

Two reviewers (WL and AvR) independently evaluated the obtained literature for relevance. Studies were included if they met the following criteria: (1) prospective (data collection) study design; (2) CT and/or US criteria for the presence of diverticulitis were given; (3) graded compression US was performed; (4) reference standard was defined; (5) diverticulitis was located in the large bowel; (6) the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) was reported or could be extracted from the study report.

Study and patient characteristics

Two reviewers independently evaluated the included studies and extracted the data for each included study. Disagreement between the reviewers was solved by discussion among all authors. Data on study design and patient group, technical specifications, and diagnostic accuracy of CT and US were collected using a standardised case record form for each included study.

Study design characteristics

The ‘Quality assessment for diagnostic accuracy studies’ (QUADAS) tool was used for evaluation of study quality [8]. In addition to the inclusion criteria the following characteristics were recorded: (1) department of the first author; (2) design of the study (single- or multicentre); (3) description of patient population, including sample size, age, male-female distribution, the prevalence of ACD and complicated ACD, and study setting (hospitalised patients, outpatients or in-hospital referrals to the radiology department); (4) if US and CT results were independently obtained (head-to-head comparative studies); (5) if US and CT were interpreted independently from clinical information; (6) if a description of US and CT criteria for the presence of acute diverticulitis was given; (7) experience of observers; (8) time interval between US and CT (head-to-head comparative studies); (9) time interval between imaging and reference standard; (10) if interpretation of reference standard was done without information on US and CT findings; (11) if the execution of the reference standard was described. If multiple reference standards had been used, we tried to extract data on the number of patients undergoing each standard and the selection criteria.

US characteristics

Recorded were, if available: (1) type of probe; (2) frequency of probe; (3) type of scanning (conventional grey scale, pulsed, colour or power Doppler, graded compression); (4) criteria for the presence of ACD.

CT characteristics

The following CT features, if available, were recorded: (1) type of scanner [non-helical, helical (single- or multislice CT)]; (2) slice thickness or collimation used; (3) use of contrast agents (oral, intravenous and/or rectal contrast) and, if so, the amount; (4) criteria for the presence of ACD.

Data synthesis and analysis

We constructed a 2 × 2 contingency table for US and CT compared to the reference standard. From these raw data we calculated sensitivity as TP/(FN + TP) and specificity as TN/(FP + TN) for each modality in every included study. Individual study sensitivity and specificity results were plotted in a forest plot and plot in a receiver-operating characteristics (ROC) space to explore inter-study heterogeneity in test performance. The Cochran Q-test and I2-test statistics were used to statistically detect and quantify heterogeneity. The Q-test examines the null hypothesis that the results of the investigated studies are homogeneous. A statistically significant result of the Q-test, with a p-value less than 0.05, was assumed to indicate substantial heterogeneity. For quantification of heterogeneity the I2-test statistic with 95% confidence intervals was used. The I2-test is a measure of inconsistency describing the percentage of total variation across studies that is due to heterogeneity rather than chance. This statistic is a percentage, with larger percentages indicating more heterogeneity [9, 10].

Several statistical models (random, fixed, or mixed effects models) are available when performing a meta-analysis. The Akaike Information Criterion value [11], a global measure of goodness of fit of a statistical model, was used to compare the fit of each available model. It showed that the bivariate random effects model had the best fit and this model was therefore used for meta-analysis. The bivariate random effects model [12, 13] will produce a weighted average of sensitivity and specificity (also called the mean summary estimates of sensitivity and specificity) with corresponding confidence intervals based on the individual study results. In the bivariate random effects model, the logit-transformed sensitivities and logit-transformed specificities are assumed to follow a bivariate normal distribution across studies around a mean logit-sensitivity and mean logit-specificity. The mean logit-sensitivity and mean logit-specificity and the corresponding standard errors were used to obtain the summary estimates of sensitivity and specificity with corresponding confidence intervals after antilogit transformation. The summary estimates of sensitivity and specificity were used to calculate the positive summary likelihood ratio (LR+) as (sensitivity/1-specificity) and the negative summary likelihood ratio (LR-) as (1-sensitivity/specificity). Summary likelihood ratios were calculated with corresponding confidence intervals for each imaging modality. The LR+ is the ratio of the percentage of patients with ACD and a positive test result and the percentage of patients without ACD with a positive test result. A diagnostic test with a LR+ of 10 and a LR- of 0.01 is generally considered as a test with good diagnostic performance.

Identification of alternative diagnoses

The number of patients with an alternative diagnosis in each study will be recorded. The number of these alternative diagnoses that were detected by US and/or CT will be used to calculate and compare the sensitivity for the identification of alternative diagnosis of both investigations.

Head-to-head comparative studies

Due to the superiority in methodology, we will highlight the results of the head-to-head comparative studies in our results. In these studies data regarding the ability of both investigations to detect the same radiological abnormalities will be extracted. The agreement between US and CT findings, for example for the detection of peri-colic fat inflammation, will be expressed as the kappa statistic. According to Landis and Koch [14] kappa (κ) values can be divided into the following levels of agreement, κ < 0.20 poor agreement, κ = 0.21–0.40 fair agreement, κ = 0.41–0.60 moderate agreement, κ = 0.61–0.80 good agreement, and κ = 0.81–1.00 excellent agreement.

Analyses were performed in SAS 9.1 (SAS Institute Inc., Cary, NC) and Microsoft Excel 5.0 (Microsoft Office, Bellevue, WA). The estimates of US and CT were compared by means of a Z-test for unpaired data. In all statistical tests a p-value lower than 0.05 was assumed to indicate statistically significant differences.

Results

Search strategy and study selection

The initial search resulted in 1,689 articles, of which 26 were jugded potentially eligible. Full text versions of these articles were used for further selection. Fourteen of the 26 potentially relevant studies had to be excluded because they either did not aquire data prospectively (n = 6) [1520], the study report did not allow for the extraction of data to calculate test accuracy (n = 4) [2124], they were earlier publications (double publications) on the same clinical trial (n = 2) [25, 26], they had performed an additional transrectal US, which prohibited extraction of data on the diagnostic accuracy of abdominal US as a separate entity (n = 1) [27], or they did not perform graded compression US (n = 1) [28]. Twelve studies met all inclusion criteria (Fig. 1).

Fig. 1
figure 1

Flow chart of search strategy and study selection

Study characteristics

Study quality assessment using the QUADAS tool revealed a number of methodological shortcomings. Although all studies investigated patients with suspected ACD, specific inclusion criteria were defined in only 50% of studies (QUADAS question 2). The time interval between execution of the reference standard and the index tests was unclear in nearly all studies (QUADAS question 4). A vague description of the execution of the reference standard was given in slightly more then half of the studies (QUADAS question 8b). Some studies, for example, did not report on the length of follow-up in conservatively treated patients. Three US studies [2931] defined the reference standard only for the patients with ACD as reference diagnosis (QUADAS question 5). The methods of two of these three studies report that for the verification of the test results all available clinical data, laboratory and radiological investigations, and operative and histology reports were used. However, in their results they do not provide the type of verification for patients with a final diagnosis other than diverticulitis. In all studies the index test results were incorporated in the reference standard (QUADAS question 9b). Only one study [31] reported its inconclusive test results (QUADAS question 11). Severe abdominal pain, too much bowel gas, and too much abdominal fat were the reasons given for six inconclusive US results. For QUADAS questions 3, 4 and 10 the underreporting of methodological details resulted in a high percentage of “unclear” responses. Figure 2 shows the responses to each question of the QUADAS tool. Based on this assessment we conclude that the overall study quality is moderate (but not poor) and that methodological study details were often underreported. In the appendix the QUADAS items are reported in detail.

Fig. 2
figure 2

The QUADAS scores of the included studies are summed up per item and presented in a bar chart

Other study design characteristics

All studies were single-centre studies initiated by the departments of radiology (n = 8), surgery (n = 3), or internal medicine (n = 1). Two studies were methodologically superior to the other studies since they had performed a head-to-head comparative study of US and CT [32, 33], while the other ten studies investigated the diagnostic accuracy of one investigation only (4 US, 6 CT) [2831, 3440]. Study characteristics are presented in Table 1.

Table 1 Study characteristics

The mean age of the patients in the included studies was 61 years for US and 63 years for CT. The total number of included patients was 630 for US and 684 for CT. Prevalence of ACD varied in both US and CT studies, ranging from 36% to 68%. The mean prevalence of complicated ACD was not significantly different, 22% in US studies and 24% in CT studies. Other patient population characteristics are presented in Table 2.

Table 2 Patient characteristics

Different types of reference standards were applied, including surgery combined with histopathology, clinical follow-up, and other diagnostic investigations, such as barium enema and endoscopy. The number of patients undergoing each type of reference standard is summarised in Table 1. Most patients (n = 555) had been treated conservatively, and clinical follow-up was applied as the reference standard in these patients. A smaller number of patients (n = 358) underwent surgery or colonoscopy, either during the acute phase or electively, and had a histopathologically confirmed reference diagnosis. In the study by Verbanck et al. the reference standard was a barium enema in 74% (n = 43) of patients with the final diagnosis of ACD.

We found differences in the positivity threshold of US and CT for the presence of ACD between studies. For example, bowel wall thickening with peri-colic fat inflammation was considered diagnostic for ACD in five studies, but was judged as a negative test result in four other studies if the presence of diverticula was not additionally visualised. The diagnostic criteria for the presence of ACD as well as other characteristics of US and CT are presented in Tables 3, 4 and 5. Multi-slice helical CT was used in four CT studies, single-slice helical CT in one and conventional CT in three. The use of contrast agents, slice thickness and interval differed among studies. Intravenous and rectal contrast was administered in the majority of CT studies.

Table 3 US features in the included studies
Table 4 CT features in the included studies
Table 5 Test accuracy of US studies

The two head-to-head comparative studies in our meta-analysis performed US and CT blinded to each others result and within 24 h of each other. Similar US and CT criteria for the presence of ACD were used.

Sensitivity and specificity

Inter-study heterogeneity in diagnostic performance is shown in the ROC plot (Fig. 3). Individual sensitivities and specificities with corresponding confidence intervals of US and CT with the results of the Q- and I2-test are presented in Figs. 4 and 5. Mean summary sensitivity estimates of US and CT were not significantly different: 92% (95% CI: 80% to 97%) for US versus 94% (95% CI: 87% to 97%) for CT (p = 0.65). Mean summary specificity estimates were 90% (95% CI: 82% to 95%) for US and 99% (95% CI: 90% to 100%) for CT and not significantly different (p = 0.07).

Fig. 3
figure 3

Sensitivity and 1-specificity for US results (open square) and for CT results (closed triangle) per included study are shown in a ROC plot. For the two head-to-head comparative studies the open squares and closed triangles represent the individual results of the head-to-head comparative studies and are connected by a line. This is visible for only one pair; the other paired data points are located in the upper left corner in the ROC plot

Fig. 4
figure 4

Individual sensitivities and the summary sensitivity estimates of US and CT studies are shown with their corresponding 95% confidence intervals. Heterogeneity between study results is presented in the footnotes

Fig. 5
figure 5

Individual specificities and the summary specificity estimates of US and CT studies are shown with their corresponding 95% confidence intervals. Heterogeneity between study results is presented in the footnotes

Using the Q-test, we found significant heterogeneity in both the sensitivities and the specificities of US and CT studies. The I2 for the sensitivities of US was 57% (95% CI: 0% to 83%) and for the specificities 64% (95%CI: 13% to 85%). In Fig. 4 one outlying sensitivity value of CT [37] and two outlying specificity values [32, 36] responsible for the heterogeneity are easy to identify. The low sensitivity of CT reported by Stefansson was possibly due to a diagnostic laparoscopy rate of 38% in the patients with diverticulitis as these laparoscopies revealed false-negative CT results. These diagnostic laparoscopies were performed routinely as part of the study in the second half of the study period. Removal of the outliers reduced the I2 percentage from 76% (95% CI: 52% to 88%) to 0% (95% CI: 0% to 79%) for the sensitivities of the CT studies and from 73% (95% CI: 44% to 87%) to 0% (95% CI: 0% to 79%) for the specificities of the CT studies. Disregarding the outliers resulted in a summary sensitivity estimate for CT of 96% (95%CI: 92% to 98%) and a summary specificity estimate of 99% (95%CI: 97% to 100%). In other words, the observed heterogeneity in CT results had no significant influence on the summary estimates of CT accuracy.

Although the Verbanck study (ref) used barium enema as a reference standard, it did not result in an outlying sensitivity and specificity of US. Excluding this study from the meta-analysis would not significantly change the summary estimates and would result in a summary sensitivity estimate for US of 93% (95% CI: 79% to 98%) and summary specificity for US of 92% (95% CI: 88% to 95%).

Likelihood ratio

Calculated summary LR+ were 9.63 (95% CI: 4.98 to 18.62) for US and 78.41 (95% CI: 8.70 to 706.58) for CT (p = 0.07). Calculated summary LR- were 0.09 (95% CI: 0.04 to 0.23) for US and 0.06 (95% CI: 0.03 to 0.13) for CT (p = 0.53).

Identification of alternative diseases

Eight of the 12 studies reported on the sensitivity for the identification of alternative diseases. This sensitivity ranged between 33% and 78% for the US studies and between 50% and 100% for the CT studies. Table 6 presents the sensitivities for alternative diseases for the US and CT studies.

Table 6 The sensitivity for the detection of alternative disease for the US and CT studies

Head-to-head comparative studies

Although the head-to-head comparative studies did not report a significant difference between the accuracy of US and CT, there was a difference in their individual accuracy results. Farag Soliman et al. [33] reported higher sensitivities (100% for US and 98% for CT) and specificities (100% for both US and CT) compared to the sensitivities (85% for US and 91% for CT) and specificities (84% for US and 77% for CT) of Pradel et al. [32]. The study by Farag Soliman et al. merely included hospitalised patients, in contrast to the study by Pradel et al. in which all patients with suspected ACD referred for US or CT were included. The percentage of complicated ACD was 47% in the study by Farag Soliman et al. compared to 27% in the study by Pradel et al. The difference in clinical setting and spectrum of disease could be the cause of differences in reported accuracy values. Pradel et al. report good kappa agreement between US and CT findings. Kappa agreement was good for depicting peri-colic fat inflammation (κ = 0.78), good for depicting bowel wall thickening (κ = 0.69), and good for depicting peri-colic abscesses (κ = 0.69).

Discussion

In this systematic review we found that diagnostic studies of US and CT in patients suspected of ACD are of moderate quality. No significant differences in the diagnostic accuracy of US and CT in diagnosing ACD were found. Calculated sensitivities, specificities, positive and negative likelihood ratios were all higher for CT, but none of these differences were significant. The range of the sensitivities for the identification of alternative diagnoses was higher for CT than US, suggesting that CT is more accurate for detecting alternative diagnoses.

Although the two head-to-head comparative studies found different accuracy values for US and CT, they both concluded that the accuracy of US and CT was not significantly different. These two studies with the best methodological design, providing approximately 20% of our study population, support the result of our overall meta-analysis [32, 33].

Heterogeneous results, reported by studies investigating the same effect, can lead to inaccurate and irrelevant summary point estimates when pooled for meta-analysis. For this reason we explored heterogeneity in the US and CT study results. In the US studies the heterogeneity was slightly above 50%, which can be considered as moderate heterogeneity [9], allowing the pooling of these results [9]. Although we found heterogeneity in CT study results [32, 36, 37], resulting in high I2 values, exploring this heterogeneity showed that it did not influence the summary estimates of CT significantly.

The studies investigating the diagnostic accuracy of US and CT in ACD were susceptible to bias because they applied differential verification. With US and CT results being part of the reference standards, an incorporation bias could lead to over-estimation of the diagnostic accuracy. Unlike histopathology, clinical follow-up is a reference standard open to subjective interpretation and is less likely to identify the correct reference diagnosis. Using a reference standard that is open to subjective interpretation can enhance the effect of over-estimation [41]. For example, a patient with a positive US or CT for ACD who is successfully treated conservatively will most likely receive the reference diagnosis acute diverticulitis. However, the underlying illness may not be ACD, since other illnesses may present with similar symptoms and also resolve completely when treated conservatively. This can lead to an underestimation of false-positive test results and therefore give an over-estimation of the accuracy. Since this is a problem related to the reference standard, it will likely have a similar effect on US and CT.

The diagnostic accuracy of US is thought to be more dependent on observer experience than CT. This is sometimes used as an argument to discourage the usage of US. Surprisingly, the study by Zielke [31] shows that 11 surgeons in training with at least 3 months of US experience achieved equal diagnostic accuracy compared to studies using experienced observers. This does not prove that the accuracy of US is not observer dependent for abdominal pathology, but an intensive training of several months in abdominal US could be sufficient to make an accurate diagnosis of ACD on US.

A limitation of our study is that the meta-analysis is mostly based on unpaired data. Ideally the diagnostic accuracy of two competing tests is investigated in the same patient population. Meta-analysis of such head-to-head studies will be able to estimate and compare the diagnostic accuracy of tests even with greater validity and precision. Only two head-to-head comparative studies were performed. In the unpaired data between-study heterogeneity, i.e., between US and CT studies, is not to be avoided. In our results the between-study heterogeneity and methodological shortcomings in regard to patient selection, reference standard, experience of observer, imaging technique, and test interpretation are clearly presented. Facing heterogeneity and methodological shortcomings is almost inevitable when performing a meta-analysis of diagnostic accuracy studies. We reported and explored this in our results. For example, we reported on heterogeneity in patient selection, but exploration showed that the prevalence of ACD and complicated ACD nevertheless was comparable between US and CT populations. Complete reporting of all study characteristics and methodological shortcomings facilitates valid interpretation of the result of our meta-analysis.

Reporting proper methodological details in diagnostic studies is a known problem. Without these details on methodology, results of studies reporting on diagnostic performance are hard to interpret. This error in reporting was also detected in some of the included studies during quality assessment using the QUADAS tool. Attempts are made to improve methodological reporting of diagnostic test accuracy studies with the Standards for Reporting of Diagnostic Accuracy (STARD) [42]. The STARD initiative provides a checklist with items that should be included in the report of a study of diagnostic accuracy.

We tried to minimise bias in our meta-analysis by using two independent reviewers for data extraction, using specified inclusion criteria, and exploring heterogeneity between and within studies.

Two surveys conducted under surgeons from the UK and the USA [3, 4] showed that the daily use of diagnostics in patients suspected of ACD varied significantly. Of the questioned surgeons from the UK who found imaging necessary at initial assessment, 42% favoured CT and 33% favoured US. In contrast, two-thirds of the questioned surgeons from the USA favoured CT and less than 7% favoured US. So where US is used as a competitive initial diagnostic test in the UK, it seems that in the USA less value is rewarded to the diagnostic opportunities of US. This is illustrated by the appropriateness criteria for imaging in patients with left lower quadrant pain of the American College of Radiology [5], which state that CT is more appropriate than US, especially in older patients with a typical presentation of ACD. This preference for CT in the USA is reflected in this meta-analysis. All US studies in this meta-analysis concerned European studies, while CT studies originated both from the USA and Europe. In countries with a high prevalence of obesity physicians will favour CT, since the use of US is practically inappropriate in obese patients. With US being less frequently used in the USA, the performance of US by radiologists from the USA for the diagnosis of ACD might be lower. The preference for CT of many physicians is also based on the fact that CT is often regarded as a more credible test than US for the exclusion and identification of alternative diagnosis. The range of sensitivity for the identification of alternative diagnosis for CT and US shows that this is probably true. Unfortunately, the included study did not provide data that made it possible to compare the ability of US and CT to exclude alternative diagnoses.

The use of magnetic resonance colonography for the diagnosis of ACD was investigated by Ajaj et al. [43]. This feasibility study reported a promising sensitivity of 86% and specificity of 92%. Although magnetic resonance colonography is not yet routinely applied in the acute setting in patients suspected of ACD in daily practice, the accuracy results seem promising and feasibility of this modality in the diagnostic work-up of these patients deserves attention.

The practice parameters from 2006 by the American Society of Colon and Rectal Surgery [44] also advocate the use of CT in diagnosing ACD. They state that US can sometimes be useful to differentiate between a phlegmon and an abscess in ACD, but that US findings are often obscured by overlying bowel loops. Our study recorded only few inconclusive US investigations. Graded compression US possibly reduces the number of inconclusive findings due to overlying bowel loops. The two head-to-head comparative studies we included both used the same US and CT criteria for the presence of ACD. Their results show that next to differentiating between a phlegmon and an abscess US can accurately measure bowel wall thickness, show peri-colic fat inflammation and detect complications. Kappa agreement between US and CT findings was good for the above mentioned imaging features [32].

This analysis of 16 years of published literature comprehends roughly the same amount of data for US and CT and provides detailed information on between-study heterogeneity in both US and CT studies. In conclusion, diagnostic accuracy studies of US and CT in patients with suspected ACD are of moderate quality and there is a need for new methodologically solid studies. Our meta-analysis found no significant difference between the diagnostic accuracy of US and CT in diagnosing ACD. The best available evidence shows that both US and CT can be used as an initial diagnostic tool in the assessment of patients suspected of having ACD. However, in severely ill patients presenting with abdominal pain the use of CT is probably more suitable as CT images are more able than US to assist in planning of a radiological or surgical intervention, and CT images in contrast to US can be re-read at any time by any specialist involved in the treatment of severely ill patients. Moreover, reviewed data indicate that CT is more accurate for detecting alternative diagnoses than us.