Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
DE GRUYTER Current Directions in Biomedical Engineering 2020;6(3): 20203137 Britta König*, Nika Guberina, Hilmar Kühl, and Waldemar Zylka Validation of iterative CT reconstruction by inter and intra observer performance assessment of artificial lung foci https://doi.org/10.1515/cdbme-2020-3137 Keywords: CT, iterative reconstruction, lung nodule detec- tion, inter– and intra–observer reliability, low–dose, image Abstract: We investigate the suitability of statistical and quality, phantom model-based iterative reconstruction (IR) algorithm strengths and their influence on image quality and diagnostic perfor- mance in low-dose computer tomography (CT) protocols for 1 Introduction and Background lung–cancer screening procedures. We evaluate the inter– and intra–observer performance for the assessment of iterative The Belgian-Dutch randomized-controlled NELSON Trial in CT reconstruction. Artificial lung foci shaped as spheres and 2017 demonstrated a reduction of lung cancer mortality with spicules made from material with calibrated Hounsfield units low-dose computer tomography (LDCT) for high-risk patients were pressed within layered granules in lung lobes of an an- over a 10-year period: men by 26% and women by 61% . thropomorphic phantom. Adaptively, a soft–tissue– and fat– Previously the US Lung Cancer Screening Trial NLST in 2011 extension ring were attached. The phantom with foci was showed a possible relative risk reduction of dying of lung can- scanned using standard high contrast, low-dose and ultra low- cer in the risk group by 20% by performed LDCT screening dose protocols. For reconstruction the IR algorithm ADMIRE procedures, which corresponds to an absolute risk reduction of at four different strength levels were used. Two ranking tests 0.3% . High contrast LDCT-protocols can be used for lung and Friedman statistics were performed. Fleiss 𝜅 and modified foci detection. Established filtered back projection (FBP) as Cohen’s 𝜅 were used to quantify inter– and intra–observer an analytical reconstruction method generates reduced image 𝑛𝑒𝑦 performance. In conjunction with the standard lung kernel quality with LDCT. To meet high quality standards in terms of BL75 radiologists evaluated medium to high IR strength, with dose reduction and image quality, manufacturers created mul- preference to 𝑆 , as suitable for lung foci detection. When tiple solutions: Reduction of tube voltage, automatic tube cur- varying reconstruction kernels the ranking became more ran- rent modulation and iterative reconstruction (IR) as the most dom than with varying phantom diameter. The inter–observer advanced techniques . In this investigation the statistical, reliability shows poor to slight agreement expressed by 𝜅 < 0 model–based IR algorithm ADMIRE (Advanced Modeled It- and 𝜅 = 0 − 0.20. For the intra-observer reliability non– erative Reconstruction) was used [3, 4]. agreement with 𝜅 = 0− 0.20 and moderate agreement with IR reconstruction may outperform traditional analytical 𝑛𝑒𝑦 𝜅 = 0.60− 0.79 for the first ranking test, and almost perfect methods as image impression alters with increasing algorithm 𝑛𝑒𝑦 agreement with 𝜅 > 0.90 for the second ranking test was strength. In fact, in  has been reported that radiologists have 𝑛𝑒𝑦 observed. In conclusion, our validation suggests radiological reservations with regard to IR reconstructed image and its pos- preference of medium to high iteration strengths, especially sible influence on diagnostics. Furthermore, radiologists eval- 𝑆 , for lung foci detection. An investigation of the correlation uated IR images rather different which may be due to profes- between diagnostic experience and the subjective perception sional experience and could mirror their accustoming. We ad- of IR reconstructed CT images still needs to be investigated. dress these issues in this paper and report on a statistical anal- ysis of an inter– and intra–observer performance assessment of artificial lung foci in IR reconstructed CT images. *Corresponding author: Britta König, Westphalian University, Campus Gelsenkirchen, Germany, and University of Duisburg-Essen, email@example.com 2 Materials and methods Nika Guberina, University of Duisburg-Essen, University Hospital Essen, Germany, firstname.lastname@example.org Anthropomorphic Phantom. The commercially-available Hilmar Kühl, St. Bernhard-Hospital Kamp-Lintfort GmbH, anthropomorphic QRM Lung-Nodule Phantom Set including University of Duisburg-Essen, Germany, email@example.com extension rings (QRM GmbH, Möhrendorf, Germany) was Waldemar Zylka, Westphalian University, Campus Gelsenkirchen, Germany, firstname.lastname@example.org Open Access. © 2020 Britta Kö nig et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 License. 2 Britta König et al., Validation of iterative CT reconstruction by inter and intra observer assessment used to simulate an adult human chest, displayed in Fig.1(a). The adjustable anatomical model consists of components with calibrated Hounsfield values (HU) of human tissue. In detail, two lung lobes fill-able with lung granules, mediastinum and spine with the soft– tissue section of a chest wall. For material replacements and lung nodule positioning a front– and back– (a) QRM Phantom Set inside CT (b) Top view without front cover cover can be screwed. Furthermore, varying chest diameters can be simulated by fitting soft– tissue– and fat–extension rings (effective diameters = 25/ 30 cm). Replicated lung foci shaped as spheres and spicules were inserted to the phantom, see Fig. 1(b-d). We performed a total of three setups, one with the phantom body and one each with fitted soft tissue and fat ring. (c) Artifical lung foci, spheres (d) Artifical lung foci, spicules CT protocol and reconstruction. The entire raw data ac- quisition was performed on a Somatom Force CT, further spec- Fig. 1: Phantom within the CT-Gantry (a), targets made from ma- ifications in . For each setup three CT dose protocols were terial with calibrated (at 120kV) HU values equal to -690/-50/+100 with color assignment beige/brown/opaque inside phantom on selected: (i) standard high contrast (SHC; 120kV/51mAs), (ii) 25mm layered granules: left lung spicules; right lung spheres (b), low-dose (LD; 120kV/40mAs) and (iii) ultra-low-dose (ULD; and top view of the spherical (diameters 3/5/8/10/12mm) (c) and 120kV/20mAs). Raw data were acquired using FOV 430mm, the speculated targets (diameters 16/20/24mm) (d). rotation time 0.5s, delay 2s, beam collimation 192x0.6mm, deactivated CAREDose4D, slice thickness 1mm, increment 𝛼 = 5% when the number attributes is 𝑘 = 4) to verify the 1mm. Preparations scans and reconstruction were performed randomness of the ranking. To assess inter– and intra–observer as in . The acquired raw data of each protocol were recon- reliability we computed Fleiss 𝜅 and a modification of Cohens structed by utilizing three kernels: BL57 is a standard for lung 𝜅 . The second ranking test (ranktest2) followed the same node detection, BR32 is a soft, and BR69 is hard kernel. The 𝑛𝑒𝑦 procedure but contained images of the phantom body only, images were reconstructed using ADMIRE 𝑆 /𝑆 /𝑆 /𝑆 as 1 3 4 5 with each ring attached, fixed BL57 and varying LDCT pro- axial slices of 5 mm thickness, 5 mm reconstruction increment tocols. with parenchymal lung window (-600/1200HU). Sixty CT im- age series of twenty-five slices were acquired. Subsequent test procedures used the seventeenth slice of all image series. Each CT image was anonymized. Additionally, lung sections were 3 Results selected on the thirty-six images to hide varying phantom di- ameters. Averaged rank sums (𝑟𝑠) over all test sets are displayed in Evaluation methods. In order to rank CT images recon- Fig. 2. The collective result of ranktest1 demonstrate that structed by ADMIRE 𝑆 /𝑆 /𝑆 /𝑆 according to their suit- 1 3 4 5 medium to high 𝑆 (≥ 𝑆 ) were generally rated as higher 𝑟𝑠 ability for lung node detection, the subjective differential tests than lower 𝑆 (≤ 𝑆 ). 𝑆 yields with 𝑟𝑠 = 145 (> 𝑆 with 143; 3 5 3 frequently used in sensor technology were implemented . > 𝑆 with 139) the highest collective 𝑟𝑠 and 𝑆 with 𝑟𝑠 = 112 4 1 Specifically, two ranking test were carried out. The first test the lowest. The assessment of the radiologist with 26 years of (ranktest1) contains images of the phantom body with attached professional practice (ypp) does not follow the collective re- fat ring, varying kernels and LDCT protocols. Pursuant to a sult, whereas the most (32 ypp) and less experienced observers ranking test by four investigation attributes, four (out of the (1 ypp) rated in accordance with the overall result. Collective thirty-six) randomly selected images were randomly placed in 𝑟𝑠 of ranktest2 shows the highest amount with 𝑟𝑠 = 162 for 𝑆 a four field graphic user interface (4GUI). This arrangement (> 𝑆 with 137; > 𝑆 with 132;) and the lowest for 𝑆 with 3 5 1 was presented to a radiologist, who was asked to rank the im- 𝑟𝑠 = 116. The averaged rank sum 𝑟𝑠 of ranktest2 indicate ages in a descending order from rank 1 to rank 4 according to that medium to high IR strength is considered more suitable his perception. One test set contained nine 4GUIs. Six radiol- for the detection of lung foci. Furthermore, a 2.5ypp radiolo- ogists evaluated three test sets. For analyzing the determined gist ranking is entirely contrary to the collective result, while 1 rankings, we add up the set ranks for each investigation at- and 1.5 ypp observers assessments are in line with the collec- tribute over all 4GUIs and get the rank sums (𝑟𝑠). Calculated tive 𝑟𝑠 result. The most experienced observers (6 and 31 ypp) Friedman values (F-values) were compared to the approximate critical value 7.81 (which is the probability of error at level Britta König et al., Validation of iterative CT reconstruction by inter and intra observer assessment 3 Tab. 1: Friedman values calculated of the 𝑟𝑠 of all rankings per set rated similarly to the collective 𝑟𝑠 with a preference for 𝑆 of each radiologist. reconstruction strength. radiologist’s F–values experience ranktest1 ranktest2 [ypp ] Set 1 Set 2 Set 3 Set 1 Set 2 Set 3 1 16,33 14,20 14,73 5,93 14,47 13,40 1,5 10,74 13,93 4,60 11,40 23,13 21,67 2,5 7,27 3,27 8,87 24,60 18,73 16,60 6 3,93 9,40 5,40 24,33 13,67 16,60 26 3,80 8,60 13,67 12,60 12,47 15,80 36 21,93 16,20 23,27 10,33 18,73 21,40 Tab. 2: Inter–observer agreement with Fleiss 𝜅 for each ranking position and set. (a) Average of rank sums of the first test (ranktest1). Ranktest Set Fleiss 𝜅 Pos. 1 Pos. 2 Pos. 3 Pos. 4 1 -0,014 -0,067 0,002 -0,019 1 2 -0,073 -0,057 -0,035 0,038 3 -0,067 -0,032 -0,089 -0,052 1 -0,089 -0,083 -0,021 -0,019 2 2 -0,119 -0,092 -0,056 -0,105 3 -0,102 -0,089 -0,092 -0,137 In order to measure the intra–individual variability of the assessed image positions over all sets per observer at differ- ent times, we calculated a modified Cohen’s 𝜅 . It is well (b) Average of rank sums of the second test (ranktest2). known that the interpretation of Cohen’s 𝜅 does not take into consideration problems with this measure, which may lead to Fig. 2: Averaged rank sums 𝑟𝑠 based on the rank position of the misleading conclusions . Usually Cohen’s 𝜅 is calculated by order (rank) of the images order for (a) ranktest1 and (b) rank- the randomly expected agreement 𝐸 and the observed agree- test2. Radiologists are represented by ypp and color. ment 𝐵 according to 𝜅 = (𝐵 − 𝐸)/(1 − 𝐸) . Instead of 𝐸 we used its modification 𝐶 and receive the modified 𝜅 . 𝑛𝑒𝑦 The F–values listed in Tab. 1 were calculated based on 𝑟𝑠 In 𝐶 only the parts of the marginal distributions of the Co- of each 4GUI ranking per set for every radiologist. Concerning hen’s agreement matrix that also have a part of 𝐵 are taken ranktest1, F–values for radiologists of 1 and 36ypp are higher into consideration. In Fig. 3 𝐵 is plotted as function of 𝜅 , 𝑛𝑒𝑦 than the approximated critical value 7.81 (𝛼 = 5%; 𝑘 = 4). i.e. 𝐵(𝜅 ). It contains n=72 comparisons per test. In rank- 𝑛𝑒𝑦 All other radiologists ranked at least once by random, in total test1 𝜅 ranges between 0 and 0.65 and 𝐵 = 0 − 0.78 and 𝑛𝑒𝑦 seven of eleven F–values are less than 7.81. Fewer random max. incidence was 10. For ranktest2 𝜅 = 0 − 1.00 while 𝑛𝑒𝑦 rankings were found in ranktest2, except in set 1 by radiologist 𝐵 = 0 − 0.89 and max. incidence was 7. The agreement given with 1ypp. In summary, the rankings of ranktest1 were more by 𝜅 results for both tests be interpreted as follows: values 𝑛𝑒𝑦 at random than those of ranktest2. 0.60 − 0.79 indicating moderat as best assessment for rank- Fleiss 𝜅 as statistical measure for inter–observer reliabil- test1, values 0− 0.20 as non occurs less in ranktest2 compared ity for each test set and ranking position of all radiologists is to ranktest1, values 0.80− 0.90 as strong and > 0.90 as almost listed in Tab. 2. For the ranktest1 𝜅 < 0, except for set 1 (Pos. perfect were the highest ratings for ranktest2, interpretation 3) and set 2 (Pos. 4). Concerning ranktest2 we get 𝜅 < 0. Ac- based on . In total intra–observer reliability was higher for cording to  𝜅 < 0 indicate poor and 𝜅 = 0 − 0.20 slight ranktest2 than for ranktest1. agreement. The results represent a differential observers per- ception in terms of the suitability of 𝑆 for lung foci detection. 4 Britta König et al., Validation of iterative CT reconstruction by inter and intra observer assessment ment or multiple-stage categorical assessment. As suspected the inter–observer reliability shows a poor to slight agreement with regard to the ranking procedure. This is an indicator for individual subjective judgments. Negative signs according to Fleiss 𝜅 statistics may indicate systematic errors. For instance, this ranking tasks are related only to the detection of for lung foci and none strict assessment criteria were applied. Due to fields of expertise two radiologists evaluate according to their own criteria, which could have a systematic impact on Fleiss 𝜅. Other potential errors may by due to the use of a not standard- ized monitor. For the intra–observer reliability the observed agreement 𝐵(𝜅 ) has wide variability. All observers have (a) Observed agreement 𝐵(𝜅 ) for ranktest1. 𝑛𝑒𝑦 𝑛𝑒𝑦 confidently selected rank positions 1 and 4, i.e. the best and the worst image. The intra–observer reliability for the inter- mediate rank positions 2 and 3 was significantly lower. In conclusion, our ranking tests suggest radiological pref- erence of medium to high iteration strengths, especially 𝑆 , for lung foci detection. We hypothesise, that in subjective image quality analysis the assessment habits of radiologists are dichotomous- or multiple stage–categorical. An investiga- tion on the correlation between diagnostic experience and the subjective perception of iterative reconstructed CT images is mandatory. Author Statement (b) Observed agreement 𝐵(𝜅 ) of ranktest2. 𝑛𝑒𝑦 Research funding: The author state no funding involved. Con- Fig. 3: The function of the observed agreement 𝐵(𝜅 ) for (a) 𝑛𝑒𝑦 flict of interest: Authors state no conflict of interest. ranktest1 and (b) ranktest2, including color assignment for inci- dence and n=72 comparisons. References 4 Discussion and Conclusion  De Koning H, Van Der Aalst C, Ten Haaf K, et al. Effects of volume CT lung cancer screening: Mortality results of Ranking tests of iterative CT reconstructions show that radi- the NELSON randomized-controlled population based trial. ologists consider medium to high IR strength more suitable Journal of Thoracic Oncology 2018;13:185 for lung foci detection. If clinically applied parameters were  Center for Statistical Sciences, Brown University,Providence, used (BL57, varying phantom diameter), the experiment re- United States of America. The National Lung Screening Trial: veals a noticeable preference of 𝑆 . The correlation of years 4 Overview and Study Design. Radiology 2011;258:243-253.  Siemens Healthineers, Erlangen, Germany. of professional experience and the subjective evaluation of IR  Beister M, Kolditz D, Kalender W A. Iterative reconstruction reconstructed images of various strength cannot be verified ac- methods in X-ray CT. Physica Medica 2012;28:94–108. curately. The most and less experienced observers assessed  König B, Guberina N, Kühl H, Zylka W. Design and first re- similar, differences in fields of expertise and diagnostic ac- sults of a phantom studyon the suitability of iterative recon- customing give further reason to test more observers in order struction for lung-cancer screening with low-dose computer to make a confident statement. The influence of varying ker- tomography. Biomedical Engineering 2018;5:593-596.  Landis J R, Koch G G. The Measurement of Observer Agree- nels (fixed phantom diameter) with IR could be the reason why ment for Categorical Data. Biometrics 1977;33:159-174. more random rankings were obtained, whereas more clinically  Kutschmann M.Private communication 2020. conditions generated only one random ranking. It was found  Wirtz M, Kutchmann M. Analyse der Beurteilerübereinstim- in  that radiologists rate BR69 and BR32 worse than BL57 mung für Kategoriale Daten mittels Cohens Kappa und alter- while evaluating anonymized and randomized single images, nativer Maße. Rehabilitation 2007;46:1-8.  McHugh M L. Interrater reliability: the kappa statistic. Bio- so we expected a reliable ranking even with varying kernels. chemia Medica 2012;22:276-282. Potentially the ranking test method is more challenging for ra- diologists to answer than a dichotomous categorical assess-
Current Directions in Biomedical Engineering – de Gruyter
Published: Sep 1, 2020
Keywords: CT; iterative reconstruction; lung nodule detection; inter- and intra-observer reliability; low-dose; image quality; phantom
Access the full text.
Sign up today, get DeepDyve free for 14 days.