Description of diagnostic rule and impact of misclassification error
Authors
Javier Mancilla Galindo, junior researcher
Dr. Lützen Portengen, supervisor
Dr. Susan Peters, supervisor
Published
October 25, 2024
Summary
Introduction: A diagnostic prediction model was developed to rule out pneumoconiosis in Dutch construction workers (surveyed in 1998) and published in 2007. The diagnostic rule identifies workers at high risk of pneumoconiosis who are referred for medical examination and diagnostic imaging with chest X-ray (CXR). Recently, concerns have been raised about the poor diagnostic performance of CXR compared to high resolution computed tomography (HRCT) for the diagnosis of silicosis, especially for detecting early cases. With the ultimate intention of recommending whether the diagnostic prediction rule should be incorporated into a health surveillance program for silicosis, this work provides an overview of the diagnostic prediction rule and estimates the extent, impact, and potential implications of outcome misclassification from its use.
Methods: Data were simulated to replicate the summary characteristics and outcome probability of the original diagnostic rule development study. A total of 5000 samples were used to estimate the potential impact of outcome misclassification over the diagnostic rule’s accuracy, by using combinations of false positive and negative rates (FPR and FNR) of CXR (index test) compared to HRCT (reference test), to estimate the adjusted area under the curve (AUC) assuming non-differential outcome misclassification. Simulated observations were categorized into low (<5 points) and high (>=5 points) risk categories according to the pneumoconiosis diagnostic rule scoring system. The true outcome that could have been observed had HRCT been performed was obtained with a reverse-misclassification function using diagnostic performance estimates. Two scenarios are presented by using possible combinations of sensitivity and specificity diagnostic performance parameters of CXR against HRCT within the uncertainty boundaries of their 95% confidence intervals (95%CI) and a negative inverse correlation between them of -0.7 and -0.8.
Results: Incorporation of uncertainties of the diagnostic performance of CXR against HRCT, assuming that outcome misclassification is non-differential, reveals that the AUC of the diagnostic prediction rule is underestimated when using CXR (absolute difference: 0.039 to 0.04). Using a cut-off score of 5 points results in 17.35% (224 out of 1291) of workers identified as being in the high-risk category. On average, outcome misclassification leads to 1 case less detected with HRCT than CXR in the high-risk group (median prevalence 7.73% vs 6.98%), whereas its impact is greater in the low-risk category, with 19 cases detected with CXR and 40-41 with HRCT (median prevalence 3.78% vs 1.76%). However, scenarios are widely variable, reflecting the uncertainties in the diagnostic performance of CXR against HRCT.
Discussion: The diagnostic prediction rule for silicosis has been used with a higher threshold (5 points) than optimal (3.75 points in the original study), likely due to the need of minimizing the number of workers who undergo diagnostic imaging studies. Under that threshold, 17.35% of participants are identified as high risk individuals to undergo screening. Incorporating current knowledge and uncertainties of the diagnostic performance of CXR against HRCT reveals that outcome misclassification has a greater impact in workers in the low risk category than those classified as high-risk. Scenarios in which differential misclassification exists remain to be explored as well as those with different disease prevalence or varying cut-off points.
Background
A diagnostic prediction model to rule out pneumoconiosis in construction workers was developed and published in 2007.1 The study population consisted of Dutch natural stone construction workers age 30 years and older. Lexces partners are currently designing a health surveillance program (HSP) for respiratory occupational diseases, including silicosis. The diagnostic prediction rule could be incorporated into the HSP to determine which workers exposed to silica dust should undergo further diagnostic workup for silicosis. However, concerns have been raised about the prediction rule not detecting early cases of silicosis. Thus, the objective of this work is to provide an overview of the diagnostic prediction rule and to estimate the extent, impact, and implications of outcome misclassification with its use.
Outcome
To develop the prediction rule, the diagnosis of pneumoconiosis was defined as a chest x ray (CXR) indicative of pneumoconiosis (ILO profusion category >=1/1), for which the ILO international classification of radiographs of pneumoconioses 2000 version was used. The most up-to-date version of this guideline is the 2022 revised edition.2 The ILO score is assigned upon examination of small opacities on CXR, in comparison to standardized CXR images. The range of possible values are integers between 0 and 3, which are assigned to a major category, followed by a subcategory (see Box 1 for a simple example). For instance, a score of 1/0 means that 1 was assigned as the major category, while 0 (subcategory) was strongly considered as the alternative. Conversely, a score of 0/1 means that the radiologist assigned 0 as the major category, but strongly considered 1 as suitable. A score 1/1 means that the CXR is consistent with the standard CXR graded as 1 in the ILO classification report.
As mentioned earlier, an ILO score >=1/1 was considered as the reference standard for pneumoconiosis to develop the diagnostic prediction rule.1 This contrasts with standard recommendations at the time mentioning that an ILO category 1/0 or higher should be considered consistent with the presence of pneumoconiosis.3 This decision was made under the rationale that a 1/0 cutoff could lead to greater misclassification, resulting in more unnecessary chest x-rays. Out of the 1291 workers included for analysis, a total of 37 (2.9%) had a score >=1/1, whereas 131 (10.1%) were graded >=1/0.
Noteworthy, three different radiologists examined the CXR and provided a score. Radiologists were blinded to patient characteristics, except for the fact that all participants worked on the construction industry. The median score was used for analysis.
Predictors
Lung function measured with a pneumotacometer on the same day of CXR obtention and worker questionnaire variables were assessed as potential predictors of pneumoconiosis. Seven candidate predictors were identified in univariable analysis:
Age
Smoking status
Job title
Time working in the construction industry
Feeling unhealthy
Cumulative exposure to silica index
Standardized residual FEV1
Continuous variables were dichotomized and modeled separately, as continuous and binary. Since there were no differences in the AUC of a prediction model with continuous vs binary predictors, the latter were kept to simplify the diagnostic rule usage.
The final model included five predictors:
Predictor
Value
Score
Beta
Age
greater/equal 40 years
1.0
0.72
Smoking habit
Current smoker
1.0
0.70
Job title
High exposure job title
1.5
1.14
Work duration in construction industry
greater/equal 15 years
1.5
1.00
Self-related health
Feeling unhealthy
1.25
0.84
Standardized residual FEV1
lower/equal -1.0
1.25
0.91
The uncorrected AUC of the model was 0.81 (95%CI: 0.75 to 0.86). The corrected AUC was 0.76.
Model Validation
In the original Suarthana study,1 the prediction model was only internally validated. A formal external validation procedure was not performed as currently recommended in TRIPOD+AI guidelines.4
To scope for studies reporting the use of the diagnostic prediction rule and any posterior external validation studies, the citations of the diagnostic rule development model were retreived from Google Scholar on 10/09/2024 and screened for title and abstract. Google Scholar was chosen due to its wide coverage of literature sources. A total of 59 records citing the paper were found. In comparison, other databases retrieved less results: PubMed-MEDLINE (n = 11), Web of Science (n = 22), Scopus (n = 32), semantic scholar (n = 34), and dimensions (n = 26). All documents were reviewed, including those in other languages, for which automatic translations were obtained to screen for any calculations of the probability of silicosis according to the diagnostic prediction rule. Out of 59 records citing the paper, 5 studies5–9 reported having used the diagnostic prediction rule to calculate workers’ risk of pneumoconiosis. These studies are summarized in the following subheadings:
In a case series of 6 young stonemasons from the UK who were diagnosed with silicosis after performing a high-resolution computed tomography (HRCT) (three of them with progressive massive fibrosis), the diagnostic rule was applied and all 6 cases had a probability of having silicosis of 0%.5 All these 6 cases would have not been referred for further chest x-ray investigation based solely on the diagnostic prediction rule score.
A subset of 180 participants enrolled in the study used for the development of the diagnostic prediction rule were invited for further examination with chest HRCT, of which a total of n=79 ultimately underwent HRCT.6 Participants invited were intended to be representative of the different risk score categories of the diagnostic prediction rule. A definite diagnosis of silicosis was not made. The study reports HRCT findings for different ILO thresholds (0/0, 1/0, and >=1/1), agreement between individual HRCT features between radiologists, and associations between the cumulative exposure index to silica and HRCT findings, controlling for smoking.
In participants with a normal CXR (ILO 0/0), only 34.9% had a normal HRCT. In these patients, findings suggestive of silicosis such as well-defined round opacities (8%) and parietal pleural abnormalities (24%) were frequent on HRCT. Emphysema was also frequent (41%), as well as irregular and/or linear opacities (22%).
This was a case-control study in which workers in the construction industry with a high-risk of silicosis based on the diagnostic prediction rule (score 5 or higher) were invited to undergo diagnostic workup, including chest CT, pulmonary function test, and medical examination by a pulmonologist. A total of 398 workers out of 42,150 (0.9%) were in the high risk category and invited to participate. The proportion of high-risk participants was lower than in the original Suarthana paper, possibly due to the ARBOUW database including a large fraction of administrative workers and not only construction workers. Ultimately, 54 participated as cases (high-risk), whereas controls were patients from a cancer screening cohort. The study reports micronodules found on chest CT.
This is a congress abstract which also used the ARBOW database to identify high-risk participants with a threshold of 5 points in the diagnostic prediction rule. Out of 75,000 employees, 1123 (1.5%) were high-risk participants. A total of 295 workers ultimately participated and underwent chest CT. Silicosis was found in 64 workers (22%), 37 (13%) in an early stage.
This is a congress abstract reporting the use of the diagnostic prediction rule to identify high-risk workers (cut-off: 5 points) in a ceramic tile prodiction plant. Out of 353 employees, 52 (15%) were in the high-risk category and underwent chest CT. Silicosis was found in 8 workers (17%).
Cut-off points of the diagnostic prediction rule
A cutoff point of 3.75 is suggested as optimal, with the following classification measures:
CXR +
CXR -
Rule +
33
534
567
Rule -
4
720
724
37
1254
1291
Sensitivity: 89.2%,
Specificity: 57.4%,
Negative Predictive Value: 99.4%,
Positive Predictive Value: 85.2%
Nonetheless, a higher cut-off point of 5 has been used in practice.7–9 The summary data for this exact cut-off point is not provided in the original diagnostic rule paper, so the cut-off point of 5.25 is used here to provide an impression of its classification properties reported in the original study (note that this may differ from the actual diagnostic performance characteristics):
CXR +
CXR -
Rule +
13
106
119
Rule -
24
1148
1178
37
1254
1291
Sensitivity: 35.1%,
Specificity: 91.5%,
Negative Predictive Value: 98.0%,
Positive Predictive Value: 10.9%
The decision to use a higher cut-off point than the optimal is likely due to the large number of individuals that should undergo CXR with a 3.75 cut-off (43.9%) vs 5.25 (9.2%).
Using the summary data reported by Hoy, et al.10 for different ILO scores, a 2x2 table can be recreated for the 1/1 ILO threshold:
HRCT +
HRCT -
CXR +
23
2
25
CXR -
17
68
85
40
70
110
Sensitivity (%): 57.5 (95%CI: 41 - 72.6 )
Specificity (%): 97.1 (95%CI: 89.1 - 99.5 )
False Positive Rate (%): 2.9 (95%CI: 0.5 - 10.9 )
False Negative Rate (%): 42.5 (95%CI: 27.4 - 59 )
Positive Predictive Value; (%): 92
Negative Predictive Value; (%): 80
Likelihood Ratio (+): 20.12
Likelihood Ratio (-): 0.44
Accuracy (%): 82.7
Diagnostic Odds Ratio: 46
Accounting for misclassification error
Corrected ROC curve analysis of prediction models can be done by taking into account misclassification error for binary outcomes, provided that disease prevalence and misclassification rates are known.11 Zawistowski, et al. simulate the value of the true outcome and then introduce different misclassification rates to understand the impact of misclassification on the prediction models’ AUC.
Non-differential misclassification
In the case of the diagnostic prediction rule, we do not know the value of the true outcome, which would have been determined with HRCT. Instead, the diagnostic prediction rule used CXR as the reference test, which means that only the value of the misclassified outcome is know. Zawistowski’s11 procedure can be adapted to obtain the reverse-misclassified outcome instead, by using the information from Hoy, et al.10 to estimate what the diagnostic rule AUC would have been had HRCT been used instead of CXR. The original functions, as well as the adapted reverse-misclassification function are found in the following script which is sourced into this document:
Simulated data with a sample size of 1291 participants is used to replicate samples with a similar size as the original diagnostic rule development study, by using the summary data reported in the paper and assigning the outcome based on the outcome probability from the diagnostic rule equation. A total of 5000 different samples are drawn to perform estimations of the potential impact of misclassification from the diagnostic prediction rule. Furthermore, scores for every fictitious participant are calculated based on the diagnostic prediction rule scoring system and a cut-off value of 5 is used to classify on high-risk (>=5 points) and low-risk (<5) of silicosis, since this is the cut-off value that has been used in practice.8,9
Scenario 1
Sensitivity and specificity confidence intervals from Hoy, et al.10 are used, simulating combinations of Sn and Sp from a bivariate normal distribution and a correlation of -0.8 between them.
Sensitivity 95%CI: 41.0 - 72.6
Specificity 95%CI: 89.1 - 99.5
The distribution of low risk and high risk participants is as follows:
Risk
Median
P.25.
P.75.
Min
Max
Low (<5 points)
1067
1058
1076
1019
1115
High (>=5 points)
224
215
233
176
272
The following table shows the distribution of outcome occurrence:
Characteristic
Median
P.25.
P.75.
Min
Max
Silicosis (CXR)
36
32
40
18
61
Silicosis (HRCT)
56
32
81
6
213
Silicosis (CXR) | high-risk
17
15
20
4
36
Silicosis (CXR) | low-risk
19
16
22
5
35
Silicosis (HRCT) | high-risk
16
11
21
2
47
Silicosis (HRCT) | low-risk
40
20
61
1
168
Prevalence of silicosis in high and low risk groups
Characteristic
Median
P.25.
P.75.
Min
Max
Prevalence (CXR) | high-risk
7.73
6.57
8.93
1.87
14.81
Prevalence (CXR) | low-risk
1.76
1.50
2.04
0.47
3.30
Prevalence (HRCT) | high-risk
6.98
4.82
9.29
0.81
21.13
Prevalence (HRCT) | low-risk
3.78
1.89
5.74
0.09
16.05
High risk group
Low risk group
ROC curve analysis
Outcome
Median
P.25.
P.75.
Min
Max
CXR
0.742
0.717
0.768
0.578
0.871
CXR-corrected
0.782
0.752
0.807
0.603
0.894
Absolute difference in AUC (CXR-corrected): 0.04
Scenario 2
Sensitivity and specificity confidence intervals from Hoy, et al.10 are used, simulating combinations of Sn and Sp from a bivariate normal distribution and a correlation of -0.7 between them.
Sensitivity 95%CI: 41.0 - 72.6
Specificity 95%CI: 89.1 - 99.5
The distribution of low risk and high risk participants is as follows:
Risk
Median
P.25.
P.75.
Min
Max
Low (<5 points)
1067
1058
1076
1019
1112
High (>=5 points)
224
215
233
179
272
The following table shows the distribution of outcome occurrence:
Characteristic
Median
P.25.
P.75.
Min
Max
Silicosis (CXR)
36
32
40
16
60
Silicosis (HRCT)
57
33
82
6
219
Silicosis (CXR) | high-risk
17
15
20
6
34
Silicosis (CXR) | low-risk
19
16
22
6
35
Silicosis (HRCT) | high-risk
16
11
21
0
57
Silicosis (HRCT) | low-risk
41
21
61
1
178
Prevalence of silicosis in high and low risk groups
Characteristic
Median
P.25.
P.75.
Min
Max
Prevalence (CXR) | high-risk
7.76
6.54
8.98
2.65
15.38
Prevalence (CXR) | low-risk
1.77
1.50
2.04
0.56
3.31
Prevalence (HRCT) | high-risk
7.04
4.98
9.38
0.00
24.36
Prevalence (HRCT) | low-risk
3.86
1.98
5.75
0.10
16.75
High risk group
Low risk group
ROC curve analysis
Outcome
Median
P.25.
P.75.
Min
Max
CXR
0.742
0.716
0.768
0.578
0.873
CXR-corrected
0.781
0.752
0.807
0.607
0.895
Absolute difference in AUC (CXR-corrected): 0.039
Differential outcome misclassification
Prior analyses assumed that outcome misclassification is non-differential. However, differential outcome misclassification is conceivable. The sources and mechanisms of differential misclassification are summarized in a mind-map (link to resource - in progress)). Here, the focus is on how the main candidate predictors of the diagnostic prediction model could have led to differential outcome misclassification through a mechanism that systematically increases the FPR with the probability of being a case and/or increases the FNR with the probability of being a control, as these are the two mechanisms that could have led AUC overestimation in the original diagnostic rule development study. Only age and smoking are thought to potentially lead to differential outcome misclassification through plausible mechanisms, because radiologists were blinded to participant characteristics, thereby blocking the sources of differential outcome misclassification for the other predictors.
… Work in progress …
Extended Data
Box 1
Box 1. Understanding the ILO chest X-ray classification scheme
The ILO CXR classification scheme may be unintuitive at first. An analogy can be made with a daily life situation to simplify its understanding. Suppose that a radiologist goes to the supermarket to buy chocolate. The radiologist finds 4 options on the shelve:
Sweet chocolate (30% cocoa) = ILO 0
Semi-sweet chocolate (50% cocoa) = ILO 1
Semi-dark chocolate (70% cocoa) = ILO 2
Dark chocolate (95% cocoa) = ILO 3
Radiologist number 1 (R1) has a hard time deciding between 30% (ILO 0) and 50% (ILO 1) cocoa, but does not even consider buying a 70% (ILO 2) or 95% (ILO 3) cocoa bar. In the end, R1 picks the 50% cocoa bar. Thus, the final score is 1/0 because they payed for semi-sweet chocolate (ILO 1), but strongly considered sweet chocolate (ILO 0) as the alternative.
On the contrary, radiologist number 2 (R2) is convinced that semi-sweet chocolate (ILO 1) is the right choice as soon as they see the shelve and does not even consider other options. Thus, the final score for R2 is 1/1.
Session and package dependencies
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Netherlands.utf8 LC_CTYPE=Dutch_Netherlands.utf8
[3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.utf8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] report_0.5.9 gt_0.11.0 MASS_7.3-60.2 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0 pacman_0.5.1
Package References
Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.
Iannone R, Cheng J, Schloerke B, Hughes E, Lauer A, Seo J, Brevoort K, Roy O (2024). gt: Easily Create Presentation-Ready Display Tables. R package version 0.11.0, https://CRAN.R-project.org/package=gt.
Makowski D, Lüdecke D, Patil I, Thériault R, Ben-Shachar M, Wiernik B (2023). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://easystats.github.io/report/.
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York. ISBN 0-387-95457-0, https://www.stats.ox.ac.uk/pub/MASS4/.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.
Suarthana E, Moons KGM, Heederik D, Meijer E. A simple diagnostic model for ruling out pneumoconiosis among construction workers. Occupational and Environmental Medicine. 2007;64(9):595-601. doi:10.1136/oem.2006.027904
Wagner GR, World Health Organization. Screening and surveillance of workers exposed to mineral dust. Published online 1996:68.
4.
Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378
5.
Nicol LM, McFarlane PA, Hirani N, Reid PT. Six cases of silicosis: implications for health surveillance of stonemasons. Occupational Medicine. 2015;65(3):220-225. doi:10.1093/occmed/kqu209
6.
Meijer E, Tjoe Nij E, Kraus T, et al. Pneumoconiosis and emphysema in construction workers: results of HRCT and lung function findings. Occupational and Environmental Medicine. 2011;68(7):542-546. doi:10.1136/oem.2010.055616
7.
Mets OM, Rooyackers J, van Amelsvoort-van de Vorst S, Mali WP, Jong PA de, Prokop M. Increased micronodule counts are more common in occupationally silica dust-exposed smokers than in control smokers. Journal of Occupational and Environmental Medicine. 2012;00(00):1-5. doi:10.1097/JOM.0b013e31824e6784
Rooijackers JM, Stigter E, Niederer M, et al. Silicosis in Dutch construction workers. European Respiratory Journal. 2016;48(suppl 60):OA458. doi:10.1183/13993003.congress-2016.OA458
10.
Hoy RF, Jones C, Newbigin K, et al. Chest x‐ray has low sensitivity to detect silicosis in artificial stone benchtop industry workers. Respirology. 2024;29(9):785-794. doi:10.1111/resp.14755
11.
Zawistowski M, Sussman JB, Hofer TP, Bentley D, Hayward RA, Wiitala WL. Corrected ROC analysis for misclassified binary outcomes. Statistics in Medicine. 2017;36(13):2148-2160. doi:10.1002/sim.7260