If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
The voiding cystourethrogram (VCUG) is a commonly employed radiographic test used in the management of vesicoureteral reflux (VUR). Recently, the reliability of VCUG to accurately grade VUR has been questioned. The purpose of this study is to examine reliability of the VCUG for the grading of VUR in a setting mimicking daily practice in a busy pediatric hospital.
Materials and methods
Two-hundred consecutive VCUGs were independently graded by two pediatric urologists and two pediatric radiologists according to the International Classification of Vesicoureteral Reflux. A weighted kappa coefficient was calculated to determine inter-rater agreement and a modified McNemar test was performed to assess rater bias. Further assessment for impact on clinical and research decision-making was made for disagreement between grades II and III.
Weighted kappa values reflect strong reliability of VCUG for grading VUR between and among urologists and radiologists ranging from 0.95 to 0.97. There was statistically significant bias with radiologists reporting higher grades. Despite high kappa values, disagreement between raters was not infrequent and most common for grades II–IV.
VCUG is reliable for grading VUR, but small differences in grading between raters were detected and may play an important role in clinical decision-making and research outcomes.
The voiding cystourethrogram (VCUG) is commonly used in the evaluation of urinary tract infections and in the setting of hydronephrosis to detect vesicoureteral reflux (VUR). The International Classification of Vesicoureteral Reflux (ICVUR) has been the most widely used grading system. Previous work evaluating the use of VCUG for the detection of VUR has focused on timing of examination [
]. However, the reliability of VCUG for accurate diagnosis of VUR grade has been called into question with disagreement between raters in early stages of the Randomized Intervention for Children with Vesicoureteral Reflux (RIVUR) trial [
]. Each of the 13 raters graded 28 VCUGs in an artificial research setting with an average inter-rater reliability of only 0.53.
In addition to understanding the reliability of VCUG in order to appropriately interpret research results, it is also important to understand differences to provide best clinical care. For example, pediatricians depend heavily on radiologists' reports for management and referral decisions whereas pediatric urologists tend to base treatment on their own interpretation. The purpose of this study is to examine reliability of the VCUG for the grading of VUR in a setting mimicking daily practice in a busy pediatric hospital. We hypothesized that the ICVUR system for grading VUR has acceptable inter-rater reliability in this situation.
Materials and methods
Two-hundred consecutive VCUGs were independently graded by two fellowship-trained pediatric urologists and two fellowship-trained pediatric radiologists after approval by our Institutional Review Board. Raters were provided with a list of two unique de-identified numbers to query images from our hospital imaging archive in a manner blinded to the patients' clinical history. VUR was graded from 0 to V according to the ICVUR. Raters briefly reviewed this grading system together for 10 min prior to initiation of the study review period. VCUGs were graded in an environment replicating daily imaging review practice. That is, pediatric radiologists reviewed studies on dedicated radiology PACS imaging stations and the pediatric urologists reviewed studies on a standard computer monitor in a clinic setting using the hospital's archival and imaging software.
As a result of the higher expected incidence of negative VCUGs and to provide adequate variability, only one out of four VCUGs with no VUR in either unit based on grading of the original radiologist was included. VCUGs were also excluded if ordered primarily for neurogenic bladder, exstrophy, or posterior urethral valve. Patients with other anatomic abnormalities including ectopic ureterocele and ureteral duplication were not excluded. In total, 394 VCUGs performed at our institution between November 1, 2011 and March 1, 2012 were screened. Studies were excluded for above indications and one out of four VCUGs with no reflux in either unit were included resulting in a total of 200 studies. These 200 VCUGs were then evaluated independently by the four raters at their own pace over a six week period.
Inter-rater reliability among grades 0–V and grades I–V was assessed between a) the two pediatric radiologists, b) the two pediatric urologists, c) the radiologists and urologists, d) the averaged urologists' scores and the original radiologist rater, and e) the averaged radiologists' scores and the original radiologist rater; for a total of ten evaluations. For each comparison, a quadratic weighted kappa coefficient was calculated to determine inter-rater agreement, where values between 0.81 and 1 were considered near perfect agreement [
]. The median quadratic weighted kappa coefficient and its 95% CI were estimated from 2000 bootstrap replicates, where the sampling unit was the VCUG in order to account for the correlation between renal images within a VCUG (as renal units within a patient were likely more correlated on average because of the degree of reflux and/or the image quality). Note that to compute the averaged scores, we followed averaging by rounding in R using the round()function, which implements the IEC 60559 standard: ‘go to the even digit’ to prevent the rounded scores from exhibiting an upward bias.
Intra-rater agreement is generally higher than inter-rater agreement, and some authors argue it is unnecessary to assess intra-rater agreement if inter-rater agreement is high [
]. Study design called for assessment of intra-rater agreement only if inter-rater agreement fell below 0.81.
For the grades I–V analysis, VCUGs were selected that had VUR >0 according to the original rater. A modified McNemar test was performed to assess for systematic rater bias where values near or at 0.5 indicated low bias. Finally, we evaluated agreement at the cutoff between grades II and III as this has been suggested as an important clinical cutoff [
All statistical analyses were performed in R v. 2.15.0 (http://www.R-project.org/); tests were two-tailed and an alpha level of 0.05 was used to assess significance.
Seventy-five percent of VCUGs were in female children with a median age of 47.9 months (range 0.3–182.7). Original interpretation was performed by one of 14 pediatric radiologists at our institution with a median of 17 (range 8–33) images available per study. Table 1 shows that almost half of renal units evaluated did not have any VUR according to the original rater. The remaining units had relatively equal proportions of grades I–III with grades IV and V being less common. Fig. 1 shows that the urologists tended to grade studies more commonly as II and radiologists were more likely to score studies as III–IV. This is further substantiated by a significant McNemar test, indicating bias towards higher grading by radiologists compared to urologists (Table 2).
Table 1Baseline study characteristics based on the original radiologists' interpretation.
Weighted kappa values reflect strong reliability of VCUG for grading VUR between and among urologists and radiologists (Table 2). When excluding renal units scored as a 0 by the original interpreting radiologist, weighted kappa values remained high (0.93, 95% CI 0.91–0.95). Both the urologists and radiologists exhibited high agreement with the original interpreting radiologist with kappa values of 0.93 for both comparisons. As inter-rater agreement was found to be >0.81, intra-rater agreement was not assessed.
Visual representation of agreement between rater scores is shown in Fig. 2. Closer inspection of scoring reveals important discrepancies in grading clustered around grades II–IV. For example, Plot A represents agreement between the two pediatric urologists. Urologist #1 rated 50 renal units as grade III (13 + 30 + 6 + 1) and Urologist #2 scored 39 renal units as grade III (3 + 30 + 6). They agreed with each other on 30 of the same renal units (highlighted in green). This indicates that in up to 49% of ratings assigned as III by one of the two urologists, the other disagreed.
As differential treatment is suggested by some for grades II or less compared with III or greater [
], a more clinically relevant distinction may be how often a VCUG is under-graded by the other reviewer. That is, how often did one rater score a unit as III but the other rater assign a lower score. For the urologists, 16 units were graded as II by one rater out of 46 (35%) units graded as III and 15 out of 51 (29%) for the radiologists. Possible over-grading, or how often did one rater score a unit as II but the other assign a higher score, is also pertinent. For the urologists, this is 16/71 (23%) and for the radiologists, 15/40 (38%).
When disagreement did occur, the degree of difference between grading was generally by a difference of one (yellow highlighting, Fig. 2). However, radiologists were more likely to differ by more than one grade (9 units) compared with urologists (1 unit, OR 0.11, p = 0.02, red highlighting).
Although no study evaluating reliability can completely replicate daily practice, evaluation of VCUG in this study that attempted to reproduce daily clinical practice as much as possible shows it to be very reliable for the overall grading of VUR based up high kappa values. This was previously reported by some authors [
]. Metcalfe et al. showed high level agreement for grade I (0.98) and acceptable agreement for grade V (0.72) but poor agreement for the other grades. Further, agreement when assessed as an aggregate for all grades ranged between 0.56 and 0.61 [
There are clear methodological differences between this work and the study by Metcalfe et al. that likely explains the differences in results. First, raters in the present study briefly reviewed the ICVUR schema prior to beginning the study in a group setting, whereas, in earlier work, raters reviewed a single PowerPoint slide depicting and explaining the grading system. It is possible that this simple intervention might increase rater reliability and regular collective review of this schema may help maintain reliability for research protocols that depend highly on reliable VCUG grading. Next, the sheer volume of VCUGs reviewed in this work that evaluated 200 studies per rater compared with 28 in the previous work may have an impact on rater reliability. Finally, we sought to replicate daily practice for reviewing VCUGs. Metcalfe et al. displayed images copied into PowerPoint, which has the potential to limit display resolution, operator manipulation of images, and the number of images available for review. This work utilized the hospital's imaging software and archival system with standard imaging display equipment, allowing each interpreter to review the entire series of images captured during the VCUG procedure and manipulate the images as needed to arrive at the appropriate grade.
Despite high kappa values, disagreement between raters in this study was not an uncommon event with differences being generally by only a single grade (see Fig. 2). However, scoring discrepancies were more common with certain grades (II–IV). It is in this critical range of grades that variation in grading may result in important alterations to clinical decision-making and to interpretation of research results that rely on the VCUG grade for patient stratification.
One potential explanation for the high kappa values but with some disagreement in grades II–IV is that half of the renal units assessed did not have VUR, and agreement among all raters on grade 0 VUR is extremely high. Yet, when we repeated the weighted kappa measures excluding studies where the original radiologist scored the renal unit as 0, kappa remained high (see Table 2).
Understanding the nature of the quadratic weighted kappa coefficient compared with absolute agreement is also important in understanding this apparent discrepancy. The quadratic weighted kappa coefficient is more strongly negatively affected by large discrepancies in grading than by small ones. That is, kappa is much lower (closer to 0) when one rater assigns grade I and another assigns grade V compared with the situation when one assigns I and the other II. In this study, wide discrepancies that would bring the kappa down were very uncommon. Differences in grading were not uncommon, but differences were generally small. As a result, these instances of disagreement did not have a large impact on kappa.
Recently, some have argued to initially limit treatment and follow-up imaging for those diagnosed with grades I or II VUR [
]. If this is followed, differences in diagnosis at this cutoff are critical for determining treatment. When treatment is based on the grading of a single rater, under-grading and possible under-treatment would occur in 23%–38% and over-grading and possible over-treatment might occur in 29%–35%. If it is felt that this cutoff is truly of clinical importance, one possible strategy to reduce inappropriately assigning treatment would be to have a second rater review any VCUG scored as grade II or grade III, and agree together which is the appropriate grade. This is similar to the process done to assign grades at entry in the RIVUR trial [
], but limits review to those studies that are at highest risk for having clinically meaningful disagreement rather than having independent review of every VCUG.
Finally, we found statistically significant bias with both the study and the original radiologists tending to grade VUR higher than the urologists. As primary care providers likely depend on the radiologists' written report, we speculate that this higher grading tendency may have the added benefit of a more prompt referral to a specialist.
This work confirmed that VCUG is reliable for grading VUR when evaluated in a setting parallel to clinical practice. However, disagreement when grading VUR based on VCUG by experienced raters does occur and has the potential to have meaningful impact on treatment outcomes. Combined review of images by multiple reviewers as occurs for study entry in the RIVUR trial has the potential to reduce this impact and may play a role in both research protocols and clinical practice.
This investigation was supported by the University of Utah Study Design and Biostatistics Center, with funding in part from the National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant 8UL1TR000105 (formerly UL1RR025764).
Conflict of interest
Timing of voiding cystourethrogram after urinary tract infection.