Silicosis Diagnostic Rule

Description of diagnostic rule and impact of misclassification error

Authors

Javier Mancilla Galindo, junior researcher

Dr. Lützen Portengen, supervisor

Dr. Susan Peters, supervisor

Published

February 14, 2025

Summary

Introduction: A diagnostic prediction model was developed to rule out pneumoconiosis in Dutch construction workers (surveyed in 1998) and published in 2007. The diagnostic rule identifies workers at high risk of pneumoconiosis who are referred for medical examination and diagnostic imaging with chest X-ray (CXR). Recently, concerns have been raised about the poor diagnostic performance of CXR compared to high resolution computed tomography (HRCT) for the diagnosis of silicosis, especially for detecting early cases. With the ultimate intention of recommending whether the diagnostic prediction rule should be incorporated into a health surveillance program for silicosis, this work provides an overview of the diagnostic prediction rule and estimates the extent, impact, and potential implications of outcome misclassification from its use.

Methods: Literature review was conducted to identify studies applying the diagnostic prediction rule. A systematic review of the diagnostic performance of CXR (ILO 1/0) and HRCT was identified and the studies included were reviewed to reconstruct 2x2 tables for the ILO 1/1 cut-off used in the diagnostic prediction rule. The diagnostic performance of CXR ILO 1/1 cutoff (index test) against HRCT (reference test) was estimated with a bivariate generalized linear mixed meta-analysis. Subsequently, data were simulated to replicate the summary characteristics and outcome probability of the original diagnostic rule development study. A total of 5000 samples were used to estimate the potential impact of outcome misclassification over the diagnostic rule’s accuracy, by using combinations of false positive and negative rates (FPR and FNR) of CXR compared to HRCT, to estimate the adjusted area under the curve (AUC) assuming non-differential outcome misclassification. Simulated observations were categorized into low (<5 points) and high (>=5 points) risk categories according to the pneumoconiosis diagnostic rule scoring system in scenario 1, adding a medium-risk category (3.75 to 4.99 points) in scenario 2. The true outcome that would have been observed had HRCT been performed was obtained with a reverse-misclassification function using pooled diagnostic performance estimates, by sampling sensitivity from a beta distribution and calculating the specificity trade-off from the pooled sensitivity and diagnostic odds ratio (DOR).

Results: Four studies were included in the meta-analysis, resulting in a pooled sensitivity of CXR (ILO >=1/1) against HRCT of 53.7% (95%CI: 30.1-75.8) and specificity 98.6% (95%CI: 94.9-99.6). The expected true prevalence of silicosis in the original diagnostic rule development study was 4.0% (n=51/1291) after correcting for non-differential misclassification (2.7%, n=36/1291 with CXR). Using a cut-off of 5 points in the prediction rule resulted in 17.35% (224 out of 1291) of workers identified as being in the high-risk category. Simulation scenarios show that the average impact of non-differential misclassification is minimal in participants categorized as high-risk by the diagnostic prediction rule, but substantial in participants categorized as low-risk. The prevalence in low-risk participants with CXR is 1.76%. In comparison, the true prevalence observed with HRCT can be twice as large in the average scenario (3.4%) and up to ten times as large (19.7%) in scenarios with high misclassification rates. Using the current cut-off of 5 points for low and high risk stratification translates into detecting less than one third of cases (n=14), while 36 cases in the low-risk group (70% out of all cases) remain undetected. The introduction of an additional cut-off of 3.75 points (recommended as optimal in the original diagnostic rule development study) to define an additional medium-risk category would reduce the number of undetected cases to 25 (44%) at the expense of performing a significant larger number of worker health examinations potentially including HRCT (567 compared to 224). Lastly, the AUC of the diagnostic prediction rule is underestimated when using CXR (absolute difference: 0.04), assuming that misclassification is truly non-differential. However, scenarios are widely variable, reflecting the uncertainties in the diagnostic performance of CXR against HRCT.

Discussion: The diagnostic prediction rule for silicosis has been used with a higher threshold (5 points) than optimal (3.75 points in the original study), likely due to the need of minimizing the number of workers who undergo diagnostic imaging studies. Under that threshold, 17.35% of participants are identified as high risk individuals to undergo screening. Incorporating current knowledge and uncertainties of the diagnostic performance of CXR against HRCT reveals that outcome misclassification has a greater impact in workers in the low risk category than those classified as high-risk. Scenarios in which differential misclassification exists remain to be explored as well as those with different disease prevalence or further varying cut-off points.

Background

A diagnostic prediction model to rule out pneumoconiosis in construction workers was developed and published in 2007.¹ The study population consisted of Dutch natural stone construction workers age 30 years and older surveyed in 1998. Lexces partners are currently designing a health surveillance program (HSP) for respiratory occupational diseases, including silicosis. The diagnostic prediction rule could be incorporated into the HSP to determine which workers exposed to silica dust should undergo further diagnostic workup for silicosis. However, concerns have been raised about the prediction rule not detecting early cases of silicosis. Thus, the objective of this work is to provide an overview of the diagnostic prediction rule and to estimate the extent, impact, and implications of outcome misclassification with its use.

Outcome

The diagnosis of pneumoconiosis was used to develop the diagnostic rule, defined as a chest x ray (CXR) indicative of pneumoconiosis (ILO profusion category >=1/1), for which the ILO international classification of radiographs of pneumoconioses 2000 version was used. The most up-to-date version of this guideline is the 2022 revised edition.² The ILO score is assigned upon examination of small opacities on CXR, in comparison to standardized CXR images. The range of possible values are integers between 0 and 3, which are assigned to a major category, followed by a subcategory (see Box 1 for a simple example). For instance, a score of 1/0 means that 1 was assigned as the major category, while 0 (subcategory) was strongly considered as the alternative. Conversely, a score of 0/1 means that the radiologist assigned 0 as the major category, but strongly considered 1 as a suitable option. A score 1/1 means that the CXR is consistent with the standard CXR graded as 1 in the ILO classification report.

As mentioned earlier, an ILO score >=1/1 was considered as the reference standard for pneumoconiosis to develop the diagnostic prediction rule.¹ This contrasts with standard recommendations at the time mentioning that an ILO category 1/0 or higher should be considered consistent with the presence of pneumoconiosis.³ This decision was made under the rationale that a 1/0 cutoff could lead to greater misclassification, resulting in more unnecessary chest x-rays. Out of the 1291 workers included for analysis, a total of 37 (2.9%) had a score >=1/1, whereas 131 (10.1%) were graded >=1/0.

Noteworthy, three different radiologists examined the CXR and provided a score. Radiologists were blinded to patient characteristics, except for the fact that all participants worked in the construction industry. The median score was used for analysis.

Predictors

Lung function measured with a pneumotacometer on the same day of CXR obtention and worker questionnaire variables were assessed as potential predictors of pneumoconiosis. Seven candidate predictors were identified in univariable analysis:

Age
Smoking status
Job title
Time working in the construction industry
Feeling unhealthy
Cumulative exposure to silica index
Standardized residual FEV1

Continuous variables were dichotomized and modeled separately, as continuous and binary. Since there were no differences in the AUC of a prediction model with continuous vs binary predictors, the latter were kept to simplify the diagnostic rule usage.

The final model included six predictors:

Predictor	Value	Score	Beta
Age	greater/equal 40 years	1.0	0.72
Smoking habit	Current smoker	1.0	0.70
Job title	High exposure job title	1.5	1.14
Work duration in construction industry	greater/equal 15 years	1.5	1.00
Self-related health	Feeling unhealthy	1.25	0.84
Standardized residual FEV1	lower/equal -1.0	1.25	0.91

The uncorrected AUC of the model was 0.81 (95%CI: 0.75 to 0.86). The corrected AUC was 0.76.

Model Validation

In the original Suarthana study,¹ the prediction model was only internally validated. A formal external validation procedure was not performed as currently recommended in TRIPOD+AI guidelines.⁴

To scope for studies reporting the use of the diagnostic prediction rule and any posterior external validation studies, the citations of the diagnostic rule development model were retrieved from Google Scholar on 10/09/2024 and screened for title and abstract. Google Scholar was chosen due to its wide coverage of literature sources. A total of 59 records citing the paper were found. In comparison, other databases retrieved less results: PubMed-MEDLINE (n = 11), Web of Science (n = 22), Scopus (n = 32), semantic scholar (n = 34), and dimensions (n = 26). All documents were reviewed, including those in other languages, for which automatic translations were obtained to screen for any calculations of the probability of silicosis according to the diagnostic prediction rule. Out of 59 records citing the paper, 5 studies^5–9 reported having used the diagnostic prediction rule to calculate workers’ risk of pneumoconiosis. These studies are summarized in the following subheadings:

Nicol, et al.⁵

In a case series of 6 young stonemasons from the UK who were diagnosed with silicosis after performing a high-resolution computed tomography (HRCT) (three of them with progressive massive fibrosis), the diagnostic rule was applied and all 6 cases had a probability of having silicosis of 0%.⁵ All these 6 cases would have not been referred for further chest x-ray investigation based solely on the diagnostic prediction rule score.

Meijer, et al.⁶

A subset of 180 participants enrolled in the study used for the development of the diagnostic prediction rule were invited for further examination with chest HRCT, of which a total of n=79 ultimately underwent HRCT.⁶ Participants invited were intended to be representative of the different risk score categories of the diagnostic prediction rule. A definite diagnosis of silicosis was not made. The study reports HRCT findings for different ILO thresholds (0/0, 1/0, and >=1/1), agreement between individual HRCT features between radiologists, and associations between the cumulative exposure index to silica and HRCT findings, controlling for smoking.

In participants with a normal CXR (ILO 0/0), only 34.9% had a normal HRCT. In these patients, findings suggestive of silicosis such as well-defined round opacities (8%) and parietal pleural abnormalities (24%) were frequent on HRCT. Emphysema was also frequent (41%), as well as irregular and/or linear opacities (22%). A total of 3 participants had an ILO >=1/1 in CXR and all three participants had positive HRCT findings, thus suggesting a high specificity (100%), but low sensitivity (25%) as there were 9 false negatives (considering only well defined round opacities as confirmatory findings for silicosis).

Mets, et al.⁷

This was a case-control study in which workers in the construction industry with a high-risk of silicosis based on the diagnostic prediction rule (score 5 or higher) were invited to undergo diagnostic workup, including chest CT, pulmonary function test, and medical examination by a pulmonologist. A total of 398 workers out of 42,150 (0.9%) were in the high risk category and invited to participate. The proportion of high-risk participants was lower than in the original Suarthana paper, possibly due to the ARBOUW database including a large fraction of administrative workers and not only construction workers. Ultimately, 54 participated as cases (high-risk), whereas controls were patients from a cancer screening cohort. The study reports micronodules found on chest CT.

Stigter, et al.⁸

This is a congress abstract reporting the use of the diagnostic prediction rule to identify high-risk workers (cut-off: 5 points) in a ceramic tile production plant. Out of 353 employees, 52 (15%) were in the high-risk category and underwent chest CT. Silicosis was found in 8 workers (17%).

Rooijackers, et al.⁹

This is a congress abstract which also used the ARBOW database to identify high-risk participants with a threshold of 5 points in the diagnostic prediction rule. Out of 75,000 employees, 1123 (1.5%) were high-risk participants. A total of 295 workers ultimately participated and underwent chest CT. Silicosis was found in 64 workers (22%), 37 (13%) in an early stage.

The latter three studies^7–9 concluded a good performance of the diagnostic prediction rule in high-risk participants, but did not include participants classified as being low-risk, a common situation that leads to verification bias and underestimation of the number of false negatives.¹⁰

Cut-off points of the diagnostic prediction rule

A cutoff point of 3.75 is suggested as optimal in the original model development study, with the following classification measures:

	CXR +	CXR -
Rule +	33	534	567
Rule -	4	720	724
	37	1254	1291

Sensitivity: 89.2%,
Specificity: 57.4%,
Negative Predictive Value: 99.4%,
Positive Predictive Value: 85.2%

Nonetheless, a higher cut-off point of 5 has been used in practice.^7–9 The summary data for this exact cut-off point is not provided in the original diagnostic rule paper, so the cut-off point of 5.25 is used here to provide an impression of its classification properties reported in the original study (note that this may differ from the actual diagnostic performance characteristics):

	CXR +	CXR -
Rule +	13	106	119
Rule -	24	1148	1178
	37	1254	1291

Sensitivity: 35.1%,
Specificity: 91.5%,
Negative Predictive Value: 98.0%,
Positive Predictive Value: 10.9%

The decision to use a higher cut-off point than the optimal is likely due to the large number of individuals that should undergo CXR with a 3.75 cut-off (43.9%) vs 5.25 (9.2%).

Misclassification of chest X-Ray vs HRCT

The results of a systematic review with meta-analysis of the diagnosis accuracy of CXR for silicosis are available as a conference abstract.¹¹ In twelve studies reporting CXR and HRCT for the ILO 1/0 threshold, sensitivity was 77.3% (95%CI: 64.4–86.5%, I² = 84%) and specificity, 95.7% (95%CI: 82.2–99.1%, I² = 27%). The datasets are available through a GitHub repository and can be used for further analysis.

All studies included in the meta-analysis were re-examined to re-construct 2x2 tables from all studies that reported enough information to do so for the ILO 1/1 threshold. A total of 4 studies^6,12–14 were included in the meta-analysis to obtain sensitivity and specificity values for this ILO cut-off point. The diagnostic odds ratio (DOR) was also calculated to model their trade-off as discussed below. The following estimates were obtained:

Pooled Sensitivity (%): 53.7 (95%CI: 30.1 - 75.8 )
Pooled Specificity (%): 98.6 (95%CI: 94.9 - 99.6 )
Pooled Diagnostic Odds Ratio (DOR): 43.29 (90%CI: 18.56 - 100.98 )

The uncertainty in sensitivity and specificity values can be modeled by first sampling sensitivity from a beta distribution (instead of using the confidence intervals that assume normality), followed by calculating specificity from the diagnostic odds ratio (DOR) with the following formula:

\[Sp = 1 - \left( \frac{Sn}{Sn + (1 - Sn) \cdot DOR} \right)\]

This approach accounts for the trade-off between sensitivity and specificity when modelling uncertainty. The following figure shows how this approach compares to all possible combinations between sensitivity and specificity within the 95%CI of the diagnostic performance parameters.

Black points represent sensitivity-specificity pairs from individual studies in the meta-analysis. The Overlap region represents all possible combinations within 95% CIs, without assuming a trade-off.

Accounting for misclassification

Corrected ROC curve analysis of prediction models can be done by taking into account misclassification error for binary outcomes, provided that disease prevalence and misclassification rates are known.¹⁵ Zawistowski, et al. simulate the value of the true outcome and then introduce different misclassification rates to understand the impact of misclassification on the prediction models’ AUC.

Non-differential misclassification

In the case of the diagnostic prediction rule, we do not know the value of the true outcome, which would have been determined with HRCT. Instead, the diagnostic prediction rule used CXR as the reference test, which means that only the value of the misclassified outcome is known. Zawistowski’s¹⁵ procedure can be adapted to obtain the reverse-misclassified outcome instead, by using the parameters obtained from the meta-analysis to estimate what the diagnostic rule AUC would have been had HRCT been used instead of CXR. The original functions, as well as the adapted reverse-misclassification function are found in the following script which is sourced into this document:

source("scripts/Zawistowski_misclassification_functions.R")

source("scripts/sample_characteristics_simulation.R")

source("scripts/Misclassification_non-differential_low_high.R")

source("scripts/Misclassification_non-differential_low_medium_high.R")

Simulated data with a sample size of 1291 participants is used to replicate samples with a similar size as the original diagnostic rule development study, by using the summary data reported in the paper and assigning the outcome based on the outcome probability from the diagnostic rule equation. A total of 5000 different samples are drawn to perform estimations of the potential impact of misclassification from the diagnostic prediction rule. Furthermore, scores for every fictitious participant are calculated based on the diagnostic prediction rule scoring system and a cut-off value of 5 is used to classify on high-risk (>=5 points) and low-risk (<5) of silicosis in Scenario 1, since this is the cut-off value that has been used in practice.^8,9 Scenario 2 includes an additional medium-risk category, with a cut-off value of 3.75 to 4.99 points.

Results from simulations are shown in Appendix 2.

Conclusions from non-differential misclassification simulations

Simulation scenarios show that the average impact of non-differential misclassification is minimal in participants categorized as high-risk by the diagnostic prediction rule, but substantial in participants categorized as low-risk. The prevalence in low-risk participants with CXR is 1.76%. In comparison, the true prevalence observed with HRCT can be twice as large in the average scenario (3.4%) and up to ten times as large (19.7%) in scenarios with high misclassification rates. This implies that the fraction of false negatives in the low risk group can be important, causing a missed opportunity to identify cases with early signs of the disease. This suggests that prior studies evaluating the performance of the diagnostic prediction rule only in workers classified as high-risk^7–9 provided overoptimistic performance measures due to validation bias,¹⁰ since low-risk participants were not included in such studies.

The prevalence of silicosis assessed with HRCT in the studies by Rooijackers, et al.⁹ and Stigter, et al.⁸ were 22% and 17%, respectively. In comparison, the true prevalence of silicosis with HRCT in high-risk participants across simulations is lower under the majority of the possible sensitivity and specificity combinations (IQR: 4.8-8.5%), but compatible with scenarios of high non-differential misclassification (maximum prevalence: 22.8%). Other explanations are also possible, such as differences in the study populations, although these are difficult to compare due to the limited summary data reported in the studies. The table below provides a comparison of available characteristics. Older age in the studies by Rooijackers, et al. and Stigter, et al. could partly explain the higher prevalence observed. Differences due to random sampling could be another explanation. One alternative explanation would be differential misclassification due to covariates included in the model, that could have caused the blinded radiologists to misclassify participants as having a positive CXR and thus inflated the associations between covariate and outcome, resulting in overestimation of the AUC of the model. In the current simulations, the true AUC of the model would be higher than reported by 0.04 units, resulting in an adjusted AUC of ~0.78 in the average scenario. This would imply that the prediction rule performs better than initially estimated had HRCT been used. However, this statement only holds if misclassification is truly non-differential. Scenarios of differential misclassification could imply that the model is worse than initially estimated. Scenarios of differential misclassification are further explored in the next section.

Characteristic	Suarthana, et al. ^b	Rooijackers, et al.	Stigter, et al.
Population	Construction workers	Construction workers	Ceramic tile production plant
High-risk group, n (%)^a	224 (21%)	295 (1.5%)	48 (15%)
Age, mean (SD)	43.6 ^b	50 (6)	52 (5)
Pack-years, mean (SD)	Not reported	26 (16)	28 (21)
COPD, n (%)	NA (20%) ^c	62 (21%)	6 (13%)

The absolute frequency are the total patients who underwent diagnostic imaging, whereas the percentage was calculated as the relative frequency of workers classified at high risk out of the total source population.
The mean age of high risk workers was calculated from simulated data reproducing the original sample in the diagnostic rule development study.
This is the overall proportion of history of lung diseases (defined as emphysema, pleuritis, or tuberculosis) in the total sample.

Lastly, when translating this into numbers of cases, the diagnostic rule development study assumed that there were 36 cases of silicosis according to ILO >= 1 in CXR in a total sample of 1291 workers. Correcting for sensitivity and specificity estimates from the meta-analysis throws an average of 51 true cases of silicosis if HRCT had been used. Using the current cut-off of 5 points for low and high risk stratification translates into detecting less than one third of cases (n=14), while 36 cases in the low-risk group (70% out of all cases) remain undetected. The introduction of an additional cut-off of 3.75 points (recommended as optimal in the original diagnostic rule development study) to define an additional medium-risk category would reduce the number of undetected cases to 22 (43%) at the expense of performing a significant larger number of worker health examinations potentially including HRCT (567 compared to 224).

Differential outcome misclassification

Prior analyses assumed that outcome misclassification is non-differential. However, differential outcome misclassification is conceivable. The sources and mechanisms of differential misclassification are summarized in a mind-map (link to resource - in progress)). Here, the focus is on how the main candidate predictors of the diagnostic prediction model could have led to differential outcome misclassification through a mechanism that systematically increases the FPR with the probability of being a case and/or increases the FNR with the probability of being a control, as these are the two mechanisms that could have led AUC overestimation in the original diagnostic rule development study. Only age and smoking are thought to potentially lead to differential outcome misclassification through plausible mechanisms, because radiologists were blinded to participant characteristics, thereby blocking the sources of differential outcome misclassification for the other predictors.

… Work in progress …

Preliminary Recommendations

Recommendation 1: It is not possible to directly recommend the incorporation of the diagnostic prediction rule for the screening of silicosis due to important sources of uncertainty, including lack of high-quality external validation studies. One study from the Netherlands⁶ included a representative sample from different score categories to undergo HRCT, but does not allow for the reconstruction of a 2x2 table of risk categories against HRCT. Other available studies from the Dutch context assessing the tool^7–9 are at high risk of verification bias due to non-inclusion of low-risk participants). Furthermore, outcome misclassification is an important limitation of the diagnostic prediction rule, as it was developed using CXR as the reference test. The prediction rule has been used at a cut-off threshold (5 points) higher than optimal (3.75) which reduces the number of necessary imaging studies, whilst likely resulting in a high number of false negatives. Nonetheless, the use of the diagnostic prediction rule could be justified under circumstances of resource constrains as discussed below.

Recommendation 2: Prediction model updating procedures are likely needed as the tool was developed 25 years ago and may no longer be representative of current worker populations and conditions. Other relevant predictors could be more relevant nowadays. Even in the case that the predictors remain relevant, their magnitude may be under or overestimated due to outcome misclassification in the original model development study. Thus, model updating through a number of different strategies (recalibration, revision, and extension)¹⁶ is highly recommended. More generalizable prediction rules could be attained by incorporating multiple industries.

Recommendation 3: It is important that any future studies assessing the performance of the diagnostic prediction rule include at least a small random sample of low-risk participants to undergo reference testing (i.e., HRCT). In such cases, sample weights can be obtained and used for modelling of diagnostic performance estimates. For instance, in the study by Rooijackers, et al., 1123 (1.5%) were high-risk participants out of 75000 workers. A fraction as small as 0.5% of low-risk participants could have likewise been invited to participate in the study to estimate robust diagnostic performance estimates of the prediction rule.

Recommendation 4: The use of the diagnostic prediction rule to stratify participants at risk of silicosis for further diagnostic workup can be justified under resource constrains, given no other low-cost alternatives are available. Reducing the current threshold for risk stratification (from 5 points to 3.75) could help to reduce the number of false negatives. Alternatively, adding a second cut-off to define a medium risk category for more intensive follow-up (i.e., 3.75 and 5-point cut-offs to define low, medium, and high risk groups) could be considered.

Recommendation 5: Approaches involving parallel testing¹⁷ with the diagnostic prediction rule (test 1) and the predicted lifetime cumulative exposure to silica (test 2) to screen for participants at risk of silicosis could help to minimize false negatives in the first stage of the HSP. Such as strategy has previously been applied in the ceramic industry (Mosa) between 2019 and 2020 through simultaneous screening with the diagnostic prediction rule and the individual cumulative exposure to silica index from workplace exposure measurement data. Whenever exposure measurement data is not available, the use of a job-exposure matrix (i.e., SYN-JEM) could be considered as an alternative. Nonetheless, thresholds for such for individual exposure rely on many assumptions, reason why it would be necessary to derive optimal thresholds through diagnostic accuracy studies to further minimize false negative cases whilst achieving a reasonable use of resources.

Recommendation 6: Different screening strategies could be compared through medical decision modelling (health technology assessment). Efficacy of screening strategies can be measured in terms of (minimization of) false positives and false negatives, quality-adjusted life years (QALYs), life expectancy, or working life expectancy measures. Trade-off of efficacy against monetary costs, or other undesired events can be included into such models. Because occurrence of silicosis is highly related to the cumulative exposure to silica, microsimulation could be used to keep track of individual exposure and characteristics through time as participants age and differentially accumulate exposure according to their job title, whilst allowing them to transition through the prediction rule threshold. This would also be useful to establish the optimal frequency of screening according to risk categories from the silicosis diagnostic prediction rule.

Appendix 1. Extended explanation of the ILO classification scheme

Box 1. Understanding the ILO chest X-ray classification scheme
The ILO CXR classification scheme may be unintuitive at first. An analogy can be made with a daily life situation to simplify its understanding. Suppose that a radiologist goes to the supermarket to buy chocolate. The radiologist finds 4 options on the shelve: Sweet chocolate (30% cocoa) = ILO 0 Semi-sweet chocolate (50% cocoa) = ILO 1 Semi-dark chocolate (70% cocoa) = ILO 2 Dark chocolate (95% cocoa) = ILO 3 Radiologist number 1 (R1) has a hard time deciding between 30% (ILO 0) and 50% (ILO 1) cocoa, but does not even consider buying a 70% (ILO 2) or 95% (ILO 3) cocoa bar. In the end, R1 picks the 50% cocoa bar. Thus, the final score is 1/0 because they payed for semi-sweet chocolate (ILO 1), but strongly considered sweet chocolate (ILO 0) as the alternative. On the contrary, radiologist number 2 (R2) is convinced that semi-sweet chocolate (ILO 1) is the right choice as soon as they see the shelve and does not even consider other options. Thus, the final score for R2 is 1/1.

The ILO CXR classification scheme may be unintuitive at first. An analogy can be made with a daily life situation to simplify its understanding. Suppose that a radiologist goes to the supermarket to buy chocolate. The radiologist finds 4 options on the shelve:

Sweet chocolate (30% cocoa) = ILO 0
Semi-sweet chocolate (50% cocoa) = ILO 1
Semi-dark chocolate (70% cocoa) = ILO 2
Dark chocolate (95% cocoa) = ILO 3

Radiologist number 1 (R1) has a hard time deciding between 30% (ILO 0) and 50% (ILO 1) cocoa, but does not even consider buying a 70% (ILO 2) or 95% (ILO 3) cocoa bar. In the end, R1 picks the 50% cocoa bar. Thus, the final score is 1/0 because they payed for semi-sweet chocolate (ILO 1), but strongly considered sweet chocolate (ILO 0) as the alternative.

On the contrary, radiologist number 2 (R2) is convinced that semi-sweet chocolate (ILO 1) is the right choice as soon as they see the shelve and does not even consider other options. Thus, the final score for R2 is 1/1.

Appendix 2. Non-differential misclassification simulation results

Scenario 1

The distribution of low risk and high risk participants is as follows:

Risk	Median	P.25.	P.75.	Min	Max
Low (<5 points)	1067	1057	1076	1011	1116
High (>=5 points)	224	215	234	175	280

The following table shows the distribution of outcome occurrence:

Characteristic	Median	P.25.	P.75.	Min	Max
Silicosis (CXR)	36	32	40	12	59
Silicosis (HRCT)	51	38	68	8	258
Silicosis (CXR) \| high-risk	17	15	20	4	35
Silicosis (CXR) \| low-risk	19	16	22	3	35
Silicosis (HRCT) \| high-risk	14	11	19	0	50
Silicosis (HRCT) \| low-risk	36	27	50	6	210

Prevalence of silicosis in high and low risk groups

Characteristic	Median	P.25.	P.75.	Min	Max
Prevalence (CXR) \| high-risk	7.66	6.51	8.93	2.02	15.77
Prevalence (CXR) \| low-risk	1.76	1.50	2.04	0.28	3.28
Prevalence (HRCT) \| high-risk	6.47	4.78	8.48	0.00	22.83
Prevalence (HRCT) \| low-risk	3.42	2.51	4.66	0.56	19.72

High risk group

Low risk group

ROC curve analysis

Outcome	Median	P.25.	P.75.	Min	Max
CXR	0.743	0.716	0.767	0.580	0.874
CXR-corrected	0.786	0.760	0.810	0.621	0.894

Absolute difference in AUC (CXR-corrected): 0.043

Scenario 2

Same parameters, but with an ordinal classification of low, medium, and high-risk workers.

The distribution of participants across risk categories is as follows:

Risk	Median	P.25.	P.75.	Min	Max
Low Risk (<3.75)	723.5	712	735	659	799
Medium Risk (3.75-4.99)	344.0	333	354	266	398
High Risk (>=5)	224.0	215	234	175	280

The following table shows the distribution of outcome occurrence:

Characteristic	Median	P.25.	P.75.	Min	Max
Silicosis (CXR)	36	32	40	12	59
Silicosis (HRCT)	51	38	68	8	258
Silicosis (CXR) \| high-risk	17	15	20	4	35
Silicosis (CXR) \| medium-risk	11	9	13	1	24
Silicosis (CXR) \| low-risk	8	6	10	0	21
Silicosis (HRCT) \| high-risk	14	11	19	0	50
Silicosis (HRCT) \| medium-risk	14	10	19	1	79
Silicosis (HRCT) \| low-risk	22	16	31	1	131

Prevalence of silicosis in high, medium, and low risk groups

Characteristic	Median	P.25.	P.75.	Min	Max
Prevalence (CXR) \| high-risk	7.66	6.51	8.93	2.02	15.77
Prevalence (CXR) \| medium-risk	3.10	2.50	3.75	0.29	7.12
Prevalence (CXR) \| low-risk	1.10	0.83	1.38	0.00	2.76
Prevalence (HRCT) \| high-risk	6.47	4.78	8.48	0.00	22.83
Prevalence (HRCT) \| medium-risk	4.12	2.90	5.62	0.29	22.32
Prevalence (HRCT) \| low-risk	3.11	2.21	4.29	0.14	18.42

Age distribution in high, medium, and low risk groups

Characteristic	Mean
Mean age (HRCT) \| high-risk	43.6
Mean age (HRCT) \| medium-risk	41.4
Mean age (HRCT) \| low-risk	42.2

References

Suarthana E, Moons KGM, Heederik D, Meijer E. A simple diagnostic model for ruling out pneumoconiosis among construction workers. Occupational and Environmental Medicine. 2007;64(9):595-601. doi:10.1136/oem.2006.027904

International Labour Organization. Guidelines for the use of the ILO International Classification of Radiographs of Pneumoconioses. revised ed.; 2023:37. https://www.ilo.org/publications/guidelines-use-ilo-international-classification-radiographs-pneumoconioses-1

Wagner GR, World Health Organization. Screening and surveillance of workers exposed to mineral dust. Published online 1996:68.

Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378

Nicol LM, McFarlane PA, Hirani N, Reid PT. Six cases of silicosis: implications for health surveillance of stonemasons. Occupational Medicine. 2015;65(3):220-225. doi:10.1093/occmed/kqu209

Meijer E, Tjoe Nij E, Kraus T, et al. Pneumoconiosis and emphysema in construction workers: results of HRCT and lung function findings. Occupational and Environmental Medicine. 2011;68(7):542-546. doi:10.1136/oem.2010.055616

Mets OM, Rooyackers J, van Amelsvoort-van de Vorst S, Mali WP, Jong PA de, Prokop M. Increased micronodule counts are more common in occupationally silica dust-exposed smokers than in control smokers. Journal of Occupational and Environmental Medicine. 2012;00(00):1-5. doi:10.1097/JOM.0b013e31824e6784

Stigter E, Rooyackers J, Houba R, Heederik D. Medical triage for early detection of silicosis in a ceramic tile production plant. European Respiratory Journal. 2011;38(Suppl 55):p4934. http://erj.ersjournals.com/content/38/Suppl{\_}55/p4934.abstract

Rooijackers JM, Stigter E, Niederer M, et al. Silicosis in Dutch construction workers. European Respiratory Journal. 2016;48(suppl 60):OA458. doi:10.1183/13993003.congress-2016.OA458

10.

O’Sullivan JW, Banerjee A, Heneghan C, Pluddemann A. Verification bias. BMJ Evidence-Based Medicine. 2018;23(2):54-55. doi:10.1136/bmjebm-2018-110919

11.

Durairaj A, Howlett P, Feary J. S17 The diagnostic accuracy of chest X-ray for the diagnosis of silicosis and how this relates to silica exposure. In: “Lungs Labours Lost” – Occupational Lung Disease. BMJ Publishing Group Ltd and British Thoracic Society; 2024:A19.2-A20. doi:10.1136/thorax-2024-BTSabstracts.23

12.

Hoy RF, Jones C, Newbigin K, et al. Chest x‐ray has low sensitivity to detect silicosis in artificial stone benchtop industry workers. Respirology. 2024;29(9):785-794. doi:10.1111/resp.14755

13.

Murgia N, Muzi G, dell’Omo M, et al. An old threat in a new setting: High prevalence of silicosis among jewelry workers. American J Industrial Med. 2007;50(8):577-583. doi:10.1002/ajim.20490

14.

Tamura T, Suganuma N, Hering KG, et al. Relationships (I) of International Classification of High-resolution Computed Tomography for Occupational and Environmental Respiratory Diseases with the ILO International Classification of Radiographs of Pneumoconioses for parenchymal abnormalities. Industrial Health. 2015;53(3):260-270. doi:10.2486/indhealth.2014-0073

15.

Zawistowski M, Sussman JB, Hofer TP, Bentley D, Hayward RA, Wiitala WL. Corrected ROC analysis for misclassified binary outcomes. Statistics in Medicine. 2017;36(13):2148-2160. doi:10.1002/sim.7260

16.

Efthimiou O, Seo M, Chalkou K, Debray T, Egger M, Salanti G. Developing clinical prediction models: A step-by-step guide. BMJ. Published online September 2024:e078276. doi:10.1136/bmj-2023-078276

17.

Franco F, Di Napoli A. Evaluation of diagnostic tests in parallel and in series. Giornale di Tecniche Nefrologiche e Dialitiche. 2016;28(3):212-215. doi:10.5301/GTND.2016.15992

R Package References

Aragon T (2020). epitools: Epidemiology Tools. R package version 0.5-10.1, https://CRAN.R-project.org/package=epitools.
Auguie B (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3, https://CRAN.R-project.org/package=gridExtra.
Balduzzi S, Rücker G, Schwarzer G (2019). “How to perform a meta-analysis with R: a practical tutorial.” Evidence-Based Mental Health, 153-160.
Belgorodski N, Greiner M, Tolksdorf K, Schueller K (2017). rriskDistributions: Fitting Distributions to Given Data or Known Quantiles. R package version 2.1.2, https://CRAN.R-project.org/package=rriskDistributions.
Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.
Iannone R, Cheng J, Schloerke B, Hughes E, Lauer A, Seo J, Brevoort K, Roy O (2024). gt: Easily Create Presentation-Ready Display Tables. R package version 0.11.0, https://CRAN.R-project.org/package=gt.
Makowski D, Lüdecke D, Patil I, Thériault R, Ben-Shachar M, Wiernik B (2023). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://easystats.github.io/report/.
Müller K, Wickham H (2023). tibble: Simple Data Frames. R package version 3.2.1, https://CRAN.R-project.org/package=tibble.
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Rinker TW, Kurkiewicz D (2018). pacman: Package Management for R. version 0.5.0, http://github.com/trinker/pacman.
Viechtbauer W, White T, Noble D, Senior A, Hamilton W (2025). metadat: Meta-Analysis Datasets. R package version 1.4-0, https://CRAN.R-project.org/package=metadat.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Wickham H (2023). forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0, https://CRAN.R-project.org/package=forcats.
Wickham H (2023). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1, https://CRAN.R-project.org/package=stringr.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H, Bryan J (2023). readxl: Read Excel Files. R package version 1.4.3, https://CRAN.R-project.org/package=readxl.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.
Wickham H, Henry L (2023). purrr: Functional Programming Tools. R package version 1.0.2, https://CRAN.R-project.org/package=purrr.
Wickham H, Hester J, Bryan J (2024). readr: Read Rectangular Text Data. R package version 2.1.5, https://CRAN.R-project.org/package=readr.
Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.