Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Stochastic tissue window normalization of deep learning on computed tomography

Stochastic tissue window normalization of deep learning on computed tomography Stochastic Tissue Window Normalization of Deep Learning on CT 1* 1 2 2 2 1 Yuankai Huo , Yucheng Tang , Yunqiang Chen , Dashan Gao , Shizhong Han , Shunxing Bao , 3 4 4 5 1,4 Smita De , James G. Terry , J. Jeffery Carr , Richard G. Abramson , and Bennett A. Landman Vanderbilt University, Department of Electrical Engineering and Computer Science, 12 Sigma Technologies Cleveland Clinic Vanderbilt University Medical Center, Department of Radiology 1 Abstract—Tissue window filtering has been widely used in deep learning for computed tomography (CT) image analyses to improve training performance (e.g., soft tissue windows for abdominal CT). However, the effectiveness of tissue window normalization is questionable since the generalizability of the trained model might be further harmed, especially when such models are applied to new cohorts with different CT reconstruction kernels, contrast mechanisms, dynamic variations in the acquisition, and physiological changes. In this paper, we evaluate the effectiveness of both with and without using soft tissue window normalization on multi-site CT cohorts. Moreover, we propose a new stochastic tissue window normalization (SWN) method to improve the generalizability of tissue window normalization. Different from the naï ve random sampling, the SWN method centers the randomization around the soft tissue window to maintain the specificity for abdominal organs. To evaluate the performance of different strategies, 80 training and 453 validation and testing scans from six datasets are employed to perform multi-organ segmentation using standard 2D U-Net. The Fig. 1. The soft tissue window normalization works well when the distribution six datasets cover the scenarios where the training and testing of the testing scan (Testing A) matches the training scan. However, the scans are from (1) same scanner and same population, (2) same CT performance might be degraded on the testing scan (Testing B) with different contrast but different pathology, and (3) different CT contrast and CT contrast. The mechanism of modifying the contrast is to apply a soft tissue pathology. The traditional soft tissue window and non-windowed window (-160<HU<240) on the raw CT scans. approaches achieved better performance on (1). The proposed SWN achieved general superior performance on (2) and (3) with practitioners to define typical window ranges (e.g., the range of statistical analyses, which offers better generalizability for a intensities to display) to enhance the visual contrasts for trained model. particular tissues or organs by applying tissue windows [1]. A tissue window is an intensity band-pass filter, which only keeps Index Terms — Tissue Window, CT, Deep Learning, the intensities within the band and censors the intensities Segmentation beyond the maximal/minimal values. The band is commonly decided according to the HU of targeting organ. For instance, a I. INTRODUCTION lung window (-1150<HU<350) is typically applied to omputed tomography (CT) is a quantitative imaging investigate lung images [2], and a soft tissue window (- technique that produces imaging intensities normalized in 160<HU<240) is commonly employed to enhance the image Hounsfield Units (HU) (e.g., air as -1000 HU, water as 0 HU). contrast for abdominal organs [1]. Tissue windows not only The quantitative meaning of intensity units allows clinical improve the image contrast for human visualization [3] but also The authors of the paper are directly employed by the institutes or companies donors”. This project will evaluate novel biomarkers of risk as relates to obese provided in this paper. This research was supported by NSF CAREER 1452485, living kidney donors. J. Jeffery Carr and James G. Terry are co-Investigator NIH grants 5R21EY024036, R01EB017230, 1R21NS064534, 1R03EB012461, 6R01 DK112262, NIDDK Koethe (PI) (02/01/17-01/31/22) “The role of R01 DK113980, 6R01 DK112262. Yuankai Huo, Yucheng Tang, Richard G. adipose-resident T cells in HIV-associated glucose intolerance”. No conflicts Abramson, and Bennett A. Landman are supported by the Vanderbilt-12 Sigma of interest, financial or otherwise, are declared by Yunqiang Chen, Dashan Gao, Research Grant (Huo/Abramson/Landman). Richard G. Abramson is also Shizhong Han, Shunxing Bao, and Smita De. We thank Naiyun Zhou for receiving partial support from 2U01CA142565 and P30 CA068485. J. Jeffery helping organize part of the data. Carr and James G. Terry are in part supported by R01 DK113980, DK Locke Corresponding Author: Yuankai Huo, E-mail: yuankai.huo@vanderbilt.edu (PI) (09/01/17-08/31/22) “CKD risk prediction among obese living kidney Fig. 2. The workflow of deploying the proposed stochastic tissue window normalization (SWN) to train a standard 2D U-Net segmentation network. filter out texture/noise in unrelated tissues, organs, and using random windows, we limit the window variations to be background. centralized around the soft tissue window to improve In recent years, the tissue window filtering process has been specificity. widely adapted to deep learning methods on CT image analyses Eighty non-contrast CT scans with healthy organs are used [4-7]. The rationale of using tissue window normalization to train a standard 2D U-Net from [8]. Then, 20 scans from the (preprocessing) is to get rid of the unnecessary information same cohort and 433 scans from different cohorts are used to before the machine learning stage, which enhances the evaluate the effectiveness of STN, WIR and SWN, which specificity of the trained deep learning model. The “specificity” covers the scenarios where the training and testing scans are in this study is referred to the performance of deploying a from (1) same scanner and same population, (2) same CT trained deep learning network on testing data with the same contrast but different pathology, and (3) different CT contrast imaging acquisition as the training data. The hypothesis behind and pathology. that is the HU values are standardized and homogenous across different cohorts. However, this hypothesis might not always be II. METHOD valid for some imaging scenarios, including but not limited to, A. Stochastic Tissue Window Normalization (1) different CT hardware, (2) potential confounds of CT Figure 2 demonstrates the principle of training an organ reconstruction kernels, (3) different contrast-enhanced CT segmentation network using SWN, which randomly samples imaging, (4) dynamic variations in acquisition, (5) the window size and location beyond the STN. A tissue window physiological changes, et cetera. As a result, the generalizability is determined by two parameters: window level (center) and of the trained model using fixed tissue window might be window size [1]. Instead of only pursuing generalizability by degraded when it is applied to the heterogeneous clinical CT natively sampling random windows, we force the randomly scans (Figure 1). The “generalizability” in this study is defined sampled windows to be centered around soft tissue window to as the performance of deploying a trained deep learning maintain the specificity. To achieve that, we used the soft tissue network on testing data with the different imaging acquisition window (window level 𝐿 = 40, half window size 𝑊 = 200) as from the training data. the centers of the random sampling. The pseudo code of the In this paper, we investigate the effectiveness of standard soft proposed SWN method is provided in Figure 3. Briefly, we tissue window normalization (STN) for canonical multi-organ employed two Gaussian distributions to add variability upon the segmentation task compared with whole intensity range (WIR, soft tissue window. The new windows are randomly sampled without using tissue windows). Moreover, we propose a new from the following two Gaussian distributions, stochastic tissue window normalization (SWN) method to leverage the generalizability upon STN. Different from naively base network. The same data augmentation stages (random cropping, padding, rotation, translation) are performed to enhance the spatial generalizability. First, all input CT image voxels are converted to floating point numbers with 32 bits (“float”). Then all the input 2D CT images (after windowing and preprocessing) are further normalized to 0 to 255 (“float”) with resolution 512 × 512. The number of input channels is one, while the number of output channels is eight (including background, spleen, right kidney, left kidney, liver, stomach, pancreas, body mask). The Adam optimizer [9] with learning rate 0.00001 is used with a batch size of six. Weighted cross- Fig. 3. Pseudo-Code of the Stochastic Tissue Window Normalization (SWN). The terms are defined based on Eq (1), (2), and (3). entropy is used as the loss function, whose weights of eight channels are [1, 10, 10, 10, 5, 10, 10, 1]. The models are trained 𝐿 ~ 𝐺 (𝜇 = 40, 𝜎 = 𝑥 ) (1) with the maximum of 100 epochs. When training each epoch, 𝑊 ~ 𝑠𝑠𝑖𝑎𝑛𝐺𝑎𝑢 (𝜇 = 200, 𝜎 = 𝑦 ) (2) every image is windowed once, across different windowing methods. The level and window size are randomly decided for where 𝑥 and 𝑦 are the two coefficients to control the each time when using the proposed SWN. Therefore, the variabilities of the random windows. In the paper, we used the windows are different, even for the same image across different format “ [𝑥 , 𝑦 ] ” to show the values of 𝑥 and 𝑦 for any epochs. During testing stage, the standard soft-tissue window experiments performed by SWN. During the training, a 2D (without randomness) is used for SWN to have a fair training image slice 𝐼 is normalized by the sampled window comparison with the STW method. The learning rate, epoch with the following steps: number, and the weights were optimized from internal validation and were applied to all testing cohort consistently. 𝐼 (𝐼 > (𝐿 + 𝑊 )) = (𝐿 + 𝑊 ) 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 Notably, the same hyper-parameters are used across all ( ) 𝐼 𝐼 < (𝐿 − 𝑊 ) = (𝐿 − 𝑊 ) 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 experiments, except the tissue window normalization. 𝐼 – (𝐿 − 𝑊 ) (3) 𝑖 𝑖 𝑖 𝐼 = C. Training and Validation Data (Same Scanner and 2𝑊 Population) MLBCV (Multi-organ): 100 abdominal CT scans were Note that, the 𝐿 and 𝑊 are different for each input during 𝑖 𝑖 obtained from the MICCAI 2015 Multi-atlas Labeling Beyond training, which are randomly sampled from the aforementioned the Cranial Vault (MLBCV) challenge [10]. The data were two Gaussian distributions. For WIR, the intensity with in acquired from portal venous phase CT modality with variable whole major intensity range (-1000<HU<1000) are normalized volume sizes (512 × 512 × 33 to 512 × 512 × 158) and field of for training without applying any tissue windows. In the testing views (approx. 300 × 300 × 250 mm3 to 500 × 500 × 700 mm3). stage, we preprocess every testing scan using standard soft The in-plane resolution varies from 0.54 × 0.54 mm2 to 0.98 × tissue window for STN and SWN, while not using such window 0.98 mm2. Among 100 scans, 80 were used as training while for WIR. the remaining 20 were used as validation. Six organs (spleen, B. Multi-organ Segmentation Network right kidney, left kidney, liver, stomach, pancreas) from To evaluate the effectiveness of using tissue window MLBCV are used as training targets. normalization, we keep the training network and processing standardized. The canonical 2D U-Net [8] is employed as the Fig. 4. Summary of training, validation, and testing cohorts 𝑎𝑢𝑠𝑠𝑖𝑎𝑛 III. DATA A. Testing Data (Same CT Contrast but Different Pathology) Figure 4 summarizes the six datasets used in this study. All datasets were acquired in deidentified form under institutional review board approval. Decathlon (Pancreas): 282 abdominal CT scans with manual pancreas segmentation were obtained from MICCAI 2018 Medical Segmentation Decathlon (Pancreas Tumor) dataset. The data were acquired from portal venous phase CT modality. The details of the data can be found at http://medicaldecathlon.com. LiTS (Liver): 131 abdominal CT scans with liver manual segmentation were obtained from Liver Tumor Segmentation (LiTS) Challenge. The data were acquired from portal venous phase CT modality. The details of the data can be found at https://competitions.codalab.org/competitions/17094. FNH (Liver): 8 abdominal CT scans with liver manual segmentation were internally acquired from patients with Focal Fig. 5. This figure shows the specificity and generalizability of STN, WIR, and SWN. To test the different tissue window normalization strategy, the testing Nodular Hyperplasia (FNH) lesion. The data were acquired scans have been added or subtract-ed constant values and fed into the same from contrast-enhanced in portal venous phase CT modality network. The color indicates the mean Dice values across 20 validation scans for with in-plane image size 512 × 512 and resolution from 0.5 mm each organ. The width of the yellow color range in each row shows the generalizability, while the brightness indicates the specificity. The proposed to 0.8 mm. The slice thickness is 5 mm. SWN has better generalizability compared with STN, and better specificity compared with WIR. B. Testing Data (Different CT Contrast and Pathology) AADHS (Liver): 5 abdominal CT scans with fatty liver diagnosis and manual liver segmentations were obtained from been used as the metrics to show the segmentation accuracy. African American-Diabetes Heart Study (AADHS) dataset. The The Decathlon, LiTS, and FNH cohorts are employed to data were acquired from non-contrast CT modality with in- evaluate the performance of different window normalization plane resolution 512 × 512. The details of the data can be found strategies for the scenarios that the training and testing scans are at [11]. from “same CT contrast but different pathology”. Delayed (Kidneys): 5 abdominal CT scans with manual left The AADHS and Delayed cohorts are employed to evaluate and right kidney segmentation were acquired internally with the performance of different window normalization strategies excretory phase sequences. The scans were performed in the for the scenarios that the training and testing scans are from prone position at an 8 min delay per institutional protocol with “different CT contrast and pathology”. 3 mm axial reconstructions. A. Internal Validation (MLBVC) IV. SIMULATION The qualitative and quantitative results of 20 MLBVC validation scans are shown in Figure 6 and 7 respectively. The Specificity and Generalizability Analysis. The 20 detailed measurements of six labels are presented in Table 1. As validation CT scans were used to evaluate the specificity and the training and validation datasets are from the same cohort generalizability of STN, WIR, and SWN. To test the specificity and the same scanner, the intensities of training scans and and generalizability, we performed a simulation, which adds or testing scans are homogeneous. Therefore, the canonical STN subtract constant values on 20 validation scans (from -300 to or WIR methods achieved superior performance in either +300 in steps of 25 HU). That experiment simulates the median DSC or mean DSC for all six organs. In Table 1, The intensity variations in testing data when applying the trained best DSC results are marked as bold. Briefly, the greater median model. The 20 validation CT scans were used since the data and mean DSC indicate the better segmentation performance were acquired from the same scanner as the training data. referring to the manual segmentations. The smaller standard Therefore, the spatial effects will be minimized and the deviation (STD) of DSC means the variation of the difference in performance is solely from the global variations segmentation performance is smaller and more consistent on intensities. Figure 5 shows the variations of segmentation across the cases. The symbol“―” indicates that the difference performance on six organs with the changes in raw intensities. between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” V. EMPIRICAL VALIDATION means significantly higher and lower respectively using the The 20 MLBCV scans are used to evaluate the performance Wilcoxon signed rank test with p<0.05. The symbol “*” means of different window normalization strategies for the scenarios the false discovery rate (FDR) corrected p value within the that the training and testing scans are from the “same scanner corresponding abdominal organ is < 0.05, with number of and population”. The Dice similarity coefficient (DSC) has comparisons = 12 of each organ. (highlighted with colors) is used in the study. When the training and testing scans are regimented to be acquired from the same B. 5.2 External Validation on Same Imaging Protocol scanner, protocol, and patient population (Table 1), the We group the results of Decathlon, LiTS, and FNH as the proposed method demonstrates improved benchmarks as external validation results on same CT modality, since such compared to the standard method. It means the simple standard datasets were acquired from the same imaging protocol (portal intensity normalization methods are more proper for the venous phase) as the training datasets but from different sites. internal validation. But in the real world, we typically would The qualitative and quantitative results of different methods are like to train a more generalizable deep learning model, which presented in Figure 8 and 9. The corresponding detailed can be applied directly to different cohorts and populations measurements are provided in Table 2. When performing the (Table 2 and 3). Under such external validation scenarios, the trained model on external validation datasets with the same generalizability of the trained model is essential, especially imaging protocol but different sites and pathologies, the when the number of available training cases are typically in proposed SWN method achieved superior performance small-scale for medical imaging applications. The proposed compared with the canonical STN and WIR methods. method achieves overall superior performance when the testing and training cohorts are more heterogeneous, which leverages C. 5.3 External Validation on Different Imaging Protocol the segmentation performance of the trained models on the The trained model from portal venous phase CT scans is different testing imaging protocols. Under the more restricted evaluated using the non-contrast CT scans (AADHS) and scenarios, FDR correction is applied to correct the original p- delayed phase CT scans (Delayed). In this scenario, the HU values for multiple comparison (highlighted with “*”). After intensities of livers in AADHS are systematically different from FDR correction, the differences for MLBCV-spleen, MLBCV- training data. Meanwhile, the HU intensities of kidneys in stomach (Table 1), FNH-liver (Table2), AADHS-liver, Delayed are systematically different from training data. Delayed-left kidney and Delayed-right kidney (Table 3) are not Therefore, the intensities of targeting organs in training and significant. The non-significant comparisons in Table 2 and 3 testing datasets are heterogeneous. The qualitative and are due to the relatively small sizes of available cohorts (i.e., 5 quantitative results are presented in Figure 10 and 11. The to 8 patients). corresponding detailed measurements are provided in Table 3. The standard 2D U-Net is employed as the segmentation From the results, the proposed SWN method achieved superior network to evaluate the performance of using tissue windows. performance compared with the canonical STN and WIR While this combination is successful, we do not claim methods. optimality of using 2D U-Net. To achieve the superior segmentation network is not the major aim of this work. In the VI. CONCLUSION AND DISCUSSION future, it would be also interesting to have the organs from We evaluate the effectiveness of both tissue window different contrasts labeled by different human expert. In that normalization and non-windowed methods for deep learning on case, the inter-rater reliability is able to be calculated, which can CT organ segmentation tasks. The soft tissue window typically be used to evaluate the automatic detection with human yields superior performance on segmenting smaller and more variability. challenging organs (pancreas and stomach). Meanwhile. the The proposed method is validated on the soft tissue window. segmentation performance of without using tissue window However, other types of tissue windows (e.g., lung, cardiac, techniques achieved superior performance on larger and easier liver window etc.) have also been widely used in different organs (liver and spleen). applications. Theoretically, the stochastic tissue window would From internal validation (training and testing data are from also improve the generalizability of deep network for such the same scanner and population), the STN and WIR achieved applications. Therefore, it would be useful to extend and overall better segmentation performance (Figure 7 and Table 1). validate the proposed method to such applications in the future. We propose a new stochastic tissue window normalization Another limitation of the proposed window based method and evaluate the STN, WIR and SWN methods using normalization is that it sacrifices the physical information simulation (Figure 5) different external testing cohorts. behind the HU standardization. According to the absolute differences in Dice values (highlighted in Bold), the propose SWN method achieved SPIE Author Biography generally better Dice scores, when evaluated on the testing First Author is a research assistant professor at the Vanderbilt scans acquired from the different scanner but same contrast University. He received his BS degree in Telecommunication (Figure 9 and Table 2), When evaluated on the testing scans Engineering from Nanjing University of Posts and acquired from different modalities and different pathologies Telecommunications in 2008, his MS degrees in Information (Figure 11 and Table 3), the proposed SWN method also and Telecommunication Engineering and Computer Science achieved generally superior Dice values compared with STN from the Southeast University and the Columbia University and WIR. The proposed SWN provided better generalizability in 2011 and 2014 respectively, and his PhD degree in of a trained model while preserving the specificity compared Electrical Engineering from the Vanderbilt University in with STN and WIR. 2018. He is the author of more than 50 journal and conference The standard Wilcoxon signed-rank test statistical analyses papers in medical image analysis. He is a member of SPIE. muscle on computed tomography for body morphometric analysis," REFERENCES Journal of digital imaging, vol. 30, pp. 487-498, 2017. [6] S. Dorn, S. Chen, S. Sawall, D. Simons, M. May, J. Maier, et al., "Organ- [1] K. Sahi, S. Jackson, E. Wiebe, G. Armstrong, S. Winters, R. Moore, et specific context-sensitive CT image reconstruction and display," in al., "The value of “liver windows” settings in the detection of small renal Medical Imaging 2018: Physics of Medical Imaging, 2018, p. 1057326. cell carcinomas on unenhanced computed tomography," Canadian [7] Y. Huo, Z. Xu, S. Bao, C. Bermudez, H. Moon, P. Parvathaneni, et al., Association of Radiologists Journal, vol. 65, pp. 71-76, 2014. "Splenomegaly Segmentation on Multi-modal MRI using Deep [2] radiantviewer. (2019). Changing brightness or contrast Convolutional Networks," IEEE transactions on medical imaging, 2018. (https://www.radiantviewer.com/dicom-viewer- [8] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks manual/change_brightness_contrast.htm). Available: for biomedical image segmentation," in International Conference on https://www.radiantviewer.com/dicom-viewer- Medical image computing and computer-assisted intervention, 2015, pp. manual/change_brightness_contrast.htm 234-241. [3] S. M. Pomerantz, C. S. White, T. L. Krebs, B. Daly, S. A. Sukumar, F. [9] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," Hooper, et al., "Liver and bone window settings for soft-copy arXiv preprint arXiv:1412.6980, 2014. interpretation of chest and abdominal CT," American Journal of [10] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, Roentgenology, vol. 174, pp. 311-314, 2000. "Multi-atlas labeling beyond the cranial vault," ed, 2015. [4] K. Yan, L. Lu, and R. M. Summers, "Unsupervised body part regression [11] D. W. Bowden, A. J. Cox, B. I. Freedman, C. E. Hugenschimdt, L. E. using convolutional neural network with self-organization," arXiv Wagenknecht, D. Herrington, et al., "Review of the Diabetes Heart Study preprint arXiv:1707.03891, 2017. (DHS) family of studies: a comprehensively examined sample for genetic [5] H. Lee, F. M. Troschel, S. Tajmir, G. Fuchs, J. Mario, F. J. Fintelmann, and epidemiological studies of type 2 diabetes and its complications," The et al., "Pixel-level deep segmentation: artificial intelligence quantifies review of diabetic studies: RDS, vol. 7, p. 188, 2010. Table 1. Segmentation performance on MLBCV STN WIR SWN SWN SWN SWN SWN [10,10] [10,100] [50,50] [100,10] [100,100] MLBCV - Spleen Median 0.9438 0.9469 0.9444 0.9456 0.9437 0.9442 0.9469 Mean 0.9380 0.9407 0.9359 0.9391 0.9368 0.9399 0.9413 Std 0.0317 0.0294 0.0333 0.0247 0.0283 0.0194 0.0279 p<0.05 Ref. ― p=0.048 ↓ ― ― ― ― p<0.05 ― Ref. ― ― ― ― ― MLBCV - Right Kidney Median 0.9307 0.9274 0.9194 0.9252 0.9240 0.9288 0.9187 Mean 0.8898 0.8942 0.8894 0.8948 0.8995 0.8999 0.8936 Std 0.1231 0.1046 0.0966 0.0894 0.0801 0.0887 0.0859 p<0.05 Ref. ― * p=0.004 ↓ ― ― ― ― p<0.05 ― Ref. * p=0.006 ↓ ― ― ― ― MLBCV - Left Kidney Median 0.9360 0.9400 0.9364 0.9337 0.9251 0.9346 0.9132 Mean 0.8840 0.8859 0.8855 0.8801 0.8762 0.8835 0.8571 Std 0.2087 0.2096 0.2093 0.2083 0.2074 0.2089 0.2040 p<0.05 Ref. ― ― ― * p=0.003 ↓ ― * P<0.001 ↓ p<0.05 ― Ref. ― ― * p=0.003 ↓ ― * P<0.001 ↓ MLBCV – Liver Median 0.9633 0.9659 0.9662 0.9633 0.9648 0.9577 0.9634 Mean 0.9622 0.9646 0.9639 0.9613 0.9640 0.9575 0.9611 Std 0.0096 0.0086 0.0089 0.0110 0.0086 0.0127 0.0103 p<0.05 Ref. * p=0.010 ↑ * p=0.025 ↑ ― ― * P<0.001 ↓ ― p<0.05 * p=0.010 ↓ Ref. ― * p=0.011 ↓ ― * P<0.001 ↓ * p=0.010 ↓ MLBCV - Stomach Median 0.8528 0.8102 0.8412 0.8418 0.8348 0.8306 0.8325 Mean 0.8377 0.8029 0.8380 0.8327 0.8305 0.8234 0.8307 Std 0.0805 0.1052 0.0777 0.0995 0.0891 0.0944 0.0845 p<0.05 Ref. p=0.019 ↓ ― ― ― ― ― p<0.05 p=0.019 ↑ Ref. p=0.014 ↑ p=0.019 ↑ ― ― ― MLBCV - Pancreas Median 0.7620 0.7196 0.7453 0.7407 0.7234 0.7294 0.7336 Mean 0.7483 0.7030 0.7357 0.7344 0.7313 0.7215 0.7167 Std 0.1149 0.1140 0.1038 0.0886 0.1091 0.1279 0.1046 p<0.05 Ref. * p=0.007 ↓ ― * p=0.007 ↓ ― * p=0.003 ↓ * p=0.005 ↓ p<0.05 * p=0.007 ↑ Ref. * p=0.012 ↑ * p=0.017 ↑ * p=0.033 ↑ ― ― The best DSC results are marked as bold. The symbol “―” indicates that the difference between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” means significantly higher and lower respectively using the Wilcoxon signed rank test with p<0.05. “*” means the FDR corrected p value is also < 0.05. Table 3. Performance on Testing Data (Same CT Contrast, Different Pathology) STN WIR SWN SWN SWN SWN SWN [10,10] [10,100] [50,50] [100,10] [100,100] Decathlon - Pancreas Median 0.6908 0.6407 0.6996 0.6880 0.6933 0.6870 0.6972 Mean 0.6480 0.6009 0.6714 0.6612 0.6607 0.6590 0.6665 Std 0.1639 0.1645 0.1432 0.1403 0.1507 0.1467 0.1416 p<0.05 Ref. * p<0.001 ↓ * p<0.001 ↑ ― * p=0.034 ↑ ― * p=0.011 ↑ p<0.05 * p<0.001 ↑ Ref. * p<0.001 ↑ * p<0.001 ↑ * p<0.001 ↑ * p<0.001 ↑ * p<0.001 ↑ LiTS - Liver Median 0.9396 0.9425 0.9414 0.9420 0.9439 0.9389 0.9425 Mean 0.9321 0.9294 0.9315 0.9351 0.9335 0.9288 0.9346 Std 0.0307 0.0472 0.0405 0.0300 0.0376 0.0398 0.0300 p<0.05 Ref. ― ― * p=0.015 ↑ * p=0.009 ↑ * p=0.008 ↑ * p=0.015 ↑ p<0.05 ― Ref. ― ― ― * p=0.011 ↓ ― FNH - Liver Median 0.9317 0.9395 0.9389 0.9430 0.9422 0.9408 0.9443 Mean 0.9295 0.9367 0.9386 0.9408 0.9423 0.9361 0.9399 Std 0.0264 0.0181 0.0138 0.0119 0.0139 0.0203 0.0166 p<0.05 Ref. ― ― ― p=0.008 ↑ ― p=0.016 ↑ p<0.05 ― Ref. ― ― ― ― ― The best DSC results are marked as bold. The symbol “―” indicates that the difference between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” means significantly higher and lower respectively using the Wilcoxon signed rank test with p<0.05. “*” means the FDR corrected p value is also < 0.05. Table 2. Performance on Testing Data (Different CT Contrast and Pathology) STN WIR SWN SWN SWN SWN SWN [10,10] [10,100] [50,50] [100,10] [100,100] AADHS - Liver Median 0.8290 0.8752 0.8983 0.9017 0.8924 0.8433 0.8892 Mean 0.7799 0.8214 0.8458 0.8811 0.8797 0.8084 0.8589 Std 0.1661 0.1248 0.1266 0.0559 0.0449 0.1433 0.1039 p<0.05 Ref. ― p<0.05 ↑ ― p<0.05 ↑ p<0.05 ↑ p<0.05 ↑ p<0.05 ― Ref. p<0.05 ↑ ― p<0.05 ↑ ― p<0.05 ↑ Delayed - Right Kidney Median 0.8847 0.8875 0.8678 0.9048 0.9084 0.8954 0.8905 Mean 0.8652 0.8673 0.8690 0.9035 0.9031 0.8921 0.8995 Std 0.0352 0.0719 0.0526 0.0341 0.0256 0.0281 0.0245 p<0.05 Ref. ― ― p<0.05 ↑ p<0.05 ↑ ― ― p<0.05 ― Ref. ― ― ― ― ― Delayed – Left Kidney Median 0.8755 0.8328 0.8491 0.8994 0.8841 0.8853 0.8936 Mean 0.8580 0.7898 0.8359 0.8987 0.8910 0.8818 0.8913 Std 0.0605 0.1010 0.0427 0.0334 0.0298 0.0272 0.0293 p<0.05 Ref. p<0.05 ↓ ― ― ― ― ― p<0.05 p<0.05 ↑ Ref. ― ― ― ― p<0.05 ↑ The best DSC results are marked as bold. The symbol “―” indicates that the difference between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” means significantly higher and lower respectively using the Wilcoxon signed rank test with p<0.05. Fig. 6. The qualitative results of applying different intensity normalization strategies. The segmentation results of three scans with the lowest, median, and highest DSC (in SWN [50,50]) are presented for each experiment. Fig. 7. The quantitative results of applying different intensity normalization strategies to MLBCV dataset, which is from the “same scanner and same population” as training. Fig. 8. The qualitative results of applying different intensity normalization strategies. The segmentation results of three scans with the lowest, median, and highest DSC (in SWN [50,50]) are presented for each experiment. Fig. 9. The quantitative results of applying different intensity normalization strategies to Decathlon, LiTS, and FNH, which are from “same CT contrast, different pathology”. Fig. 10. The qualitative results of applying different intensity normalization strategies on AADHS and Delayed datasets are provided. The segmentation results of three scans with the lowest, median, and highest DSC (in SWN [50,50]) are presented. The yellow and blue arrows indicate the key observations among different methods. Fig. 11. The quantitative results of applying different intensity normalization strategies on the testing scans, which are from “different CT contrast and pathology” compared with training. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Loading next page...
 
/lp/arxiv-cornell-university/stochastic-tissue-window-normalization-of-deep-learning-on-computed-2W36PKZC5J
ISSN
2329-4302
eISSN
ARCH-3348
DOI
10.1117/1.JMI.6.4.044005
Publisher site
See Article on Publisher Site

Abstract

Stochastic Tissue Window Normalization of Deep Learning on CT 1* 1 2 2 2 1 Yuankai Huo , Yucheng Tang , Yunqiang Chen , Dashan Gao , Shizhong Han , Shunxing Bao , 3 4 4 5 1,4 Smita De , James G. Terry , J. Jeffery Carr , Richard G. Abramson , and Bennett A. Landman Vanderbilt University, Department of Electrical Engineering and Computer Science, 12 Sigma Technologies Cleveland Clinic Vanderbilt University Medical Center, Department of Radiology 1 Abstract—Tissue window filtering has been widely used in deep learning for computed tomography (CT) image analyses to improve training performance (e.g., soft tissue windows for abdominal CT). However, the effectiveness of tissue window normalization is questionable since the generalizability of the trained model might be further harmed, especially when such models are applied to new cohorts with different CT reconstruction kernels, contrast mechanisms, dynamic variations in the acquisition, and physiological changes. In this paper, we evaluate the effectiveness of both with and without using soft tissue window normalization on multi-site CT cohorts. Moreover, we propose a new stochastic tissue window normalization (SWN) method to improve the generalizability of tissue window normalization. Different from the naï ve random sampling, the SWN method centers the randomization around the soft tissue window to maintain the specificity for abdominal organs. To evaluate the performance of different strategies, 80 training and 453 validation and testing scans from six datasets are employed to perform multi-organ segmentation using standard 2D U-Net. The Fig. 1. The soft tissue window normalization works well when the distribution six datasets cover the scenarios where the training and testing of the testing scan (Testing A) matches the training scan. However, the scans are from (1) same scanner and same population, (2) same CT performance might be degraded on the testing scan (Testing B) with different contrast but different pathology, and (3) different CT contrast and CT contrast. The mechanism of modifying the contrast is to apply a soft tissue pathology. The traditional soft tissue window and non-windowed window (-160<HU<240) on the raw CT scans. approaches achieved better performance on (1). The proposed SWN achieved general superior performance on (2) and (3) with practitioners to define typical window ranges (e.g., the range of statistical analyses, which offers better generalizability for a intensities to display) to enhance the visual contrasts for trained model. particular tissues or organs by applying tissue windows [1]. A tissue window is an intensity band-pass filter, which only keeps Index Terms — Tissue Window, CT, Deep Learning, the intensities within the band and censors the intensities Segmentation beyond the maximal/minimal values. The band is commonly decided according to the HU of targeting organ. For instance, a I. INTRODUCTION lung window (-1150<HU<350) is typically applied to omputed tomography (CT) is a quantitative imaging investigate lung images [2], and a soft tissue window (- technique that produces imaging intensities normalized in 160<HU<240) is commonly employed to enhance the image Hounsfield Units (HU) (e.g., air as -1000 HU, water as 0 HU). contrast for abdominal organs [1]. Tissue windows not only The quantitative meaning of intensity units allows clinical improve the image contrast for human visualization [3] but also The authors of the paper are directly employed by the institutes or companies donors”. This project will evaluate novel biomarkers of risk as relates to obese provided in this paper. This research was supported by NSF CAREER 1452485, living kidney donors. J. Jeffery Carr and James G. Terry are co-Investigator NIH grants 5R21EY024036, R01EB017230, 1R21NS064534, 1R03EB012461, 6R01 DK112262, NIDDK Koethe (PI) (02/01/17-01/31/22) “The role of R01 DK113980, 6R01 DK112262. Yuankai Huo, Yucheng Tang, Richard G. adipose-resident T cells in HIV-associated glucose intolerance”. No conflicts Abramson, and Bennett A. Landman are supported by the Vanderbilt-12 Sigma of interest, financial or otherwise, are declared by Yunqiang Chen, Dashan Gao, Research Grant (Huo/Abramson/Landman). Richard G. Abramson is also Shizhong Han, Shunxing Bao, and Smita De. We thank Naiyun Zhou for receiving partial support from 2U01CA142565 and P30 CA068485. J. Jeffery helping organize part of the data. Carr and James G. Terry are in part supported by R01 DK113980, DK Locke Corresponding Author: Yuankai Huo, E-mail: yuankai.huo@vanderbilt.edu (PI) (09/01/17-08/31/22) “CKD risk prediction among obese living kidney Fig. 2. The workflow of deploying the proposed stochastic tissue window normalization (SWN) to train a standard 2D U-Net segmentation network. filter out texture/noise in unrelated tissues, organs, and using random windows, we limit the window variations to be background. centralized around the soft tissue window to improve In recent years, the tissue window filtering process has been specificity. widely adapted to deep learning methods on CT image analyses Eighty non-contrast CT scans with healthy organs are used [4-7]. The rationale of using tissue window normalization to train a standard 2D U-Net from [8]. Then, 20 scans from the (preprocessing) is to get rid of the unnecessary information same cohort and 433 scans from different cohorts are used to before the machine learning stage, which enhances the evaluate the effectiveness of STN, WIR and SWN, which specificity of the trained deep learning model. The “specificity” covers the scenarios where the training and testing scans are in this study is referred to the performance of deploying a from (1) same scanner and same population, (2) same CT trained deep learning network on testing data with the same contrast but different pathology, and (3) different CT contrast imaging acquisition as the training data. The hypothesis behind and pathology. that is the HU values are standardized and homogenous across different cohorts. However, this hypothesis might not always be II. METHOD valid for some imaging scenarios, including but not limited to, A. Stochastic Tissue Window Normalization (1) different CT hardware, (2) potential confounds of CT Figure 2 demonstrates the principle of training an organ reconstruction kernels, (3) different contrast-enhanced CT segmentation network using SWN, which randomly samples imaging, (4) dynamic variations in acquisition, (5) the window size and location beyond the STN. A tissue window physiological changes, et cetera. As a result, the generalizability is determined by two parameters: window level (center) and of the trained model using fixed tissue window might be window size [1]. Instead of only pursuing generalizability by degraded when it is applied to the heterogeneous clinical CT natively sampling random windows, we force the randomly scans (Figure 1). The “generalizability” in this study is defined sampled windows to be centered around soft tissue window to as the performance of deploying a trained deep learning maintain the specificity. To achieve that, we used the soft tissue network on testing data with the different imaging acquisition window (window level 𝐿 = 40, half window size 𝑊 = 200) as from the training data. the centers of the random sampling. The pseudo code of the In this paper, we investigate the effectiveness of standard soft proposed SWN method is provided in Figure 3. Briefly, we tissue window normalization (STN) for canonical multi-organ employed two Gaussian distributions to add variability upon the segmentation task compared with whole intensity range (WIR, soft tissue window. The new windows are randomly sampled without using tissue windows). Moreover, we propose a new from the following two Gaussian distributions, stochastic tissue window normalization (SWN) method to leverage the generalizability upon STN. Different from naively base network. The same data augmentation stages (random cropping, padding, rotation, translation) are performed to enhance the spatial generalizability. First, all input CT image voxels are converted to floating point numbers with 32 bits (“float”). Then all the input 2D CT images (after windowing and preprocessing) are further normalized to 0 to 255 (“float”) with resolution 512 × 512. The number of input channels is one, while the number of output channels is eight (including background, spleen, right kidney, left kidney, liver, stomach, pancreas, body mask). The Adam optimizer [9] with learning rate 0.00001 is used with a batch size of six. Weighted cross- Fig. 3. Pseudo-Code of the Stochastic Tissue Window Normalization (SWN). The terms are defined based on Eq (1), (2), and (3). entropy is used as the loss function, whose weights of eight channels are [1, 10, 10, 10, 5, 10, 10, 1]. The models are trained 𝐿 ~ 𝐺 (𝜇 = 40, 𝜎 = 𝑥 ) (1) with the maximum of 100 epochs. When training each epoch, 𝑊 ~ 𝑠𝑠𝑖𝑎𝑛𝐺𝑎𝑢 (𝜇 = 200, 𝜎 = 𝑦 ) (2) every image is windowed once, across different windowing methods. The level and window size are randomly decided for where 𝑥 and 𝑦 are the two coefficients to control the each time when using the proposed SWN. Therefore, the variabilities of the random windows. In the paper, we used the windows are different, even for the same image across different format “ [𝑥 , 𝑦 ] ” to show the values of 𝑥 and 𝑦 for any epochs. During testing stage, the standard soft-tissue window experiments performed by SWN. During the training, a 2D (without randomness) is used for SWN to have a fair training image slice 𝐼 is normalized by the sampled window comparison with the STW method. The learning rate, epoch with the following steps: number, and the weights were optimized from internal validation and were applied to all testing cohort consistently. 𝐼 (𝐼 > (𝐿 + 𝑊 )) = (𝐿 + 𝑊 ) 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 Notably, the same hyper-parameters are used across all ( ) 𝐼 𝐼 < (𝐿 − 𝑊 ) = (𝐿 − 𝑊 ) 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 experiments, except the tissue window normalization. 𝐼 – (𝐿 − 𝑊 ) (3) 𝑖 𝑖 𝑖 𝐼 = C. Training and Validation Data (Same Scanner and 2𝑊 Population) MLBCV (Multi-organ): 100 abdominal CT scans were Note that, the 𝐿 and 𝑊 are different for each input during 𝑖 𝑖 obtained from the MICCAI 2015 Multi-atlas Labeling Beyond training, which are randomly sampled from the aforementioned the Cranial Vault (MLBCV) challenge [10]. The data were two Gaussian distributions. For WIR, the intensity with in acquired from portal venous phase CT modality with variable whole major intensity range (-1000<HU<1000) are normalized volume sizes (512 × 512 × 33 to 512 × 512 × 158) and field of for training without applying any tissue windows. In the testing views (approx. 300 × 300 × 250 mm3 to 500 × 500 × 700 mm3). stage, we preprocess every testing scan using standard soft The in-plane resolution varies from 0.54 × 0.54 mm2 to 0.98 × tissue window for STN and SWN, while not using such window 0.98 mm2. Among 100 scans, 80 were used as training while for WIR. the remaining 20 were used as validation. Six organs (spleen, B. Multi-organ Segmentation Network right kidney, left kidney, liver, stomach, pancreas) from To evaluate the effectiveness of using tissue window MLBCV are used as training targets. normalization, we keep the training network and processing standardized. The canonical 2D U-Net [8] is employed as the Fig. 4. Summary of training, validation, and testing cohorts 𝑎𝑢𝑠𝑠𝑖𝑎𝑛 III. DATA A. Testing Data (Same CT Contrast but Different Pathology) Figure 4 summarizes the six datasets used in this study. All datasets were acquired in deidentified form under institutional review board approval. Decathlon (Pancreas): 282 abdominal CT scans with manual pancreas segmentation were obtained from MICCAI 2018 Medical Segmentation Decathlon (Pancreas Tumor) dataset. The data were acquired from portal venous phase CT modality. The details of the data can be found at http://medicaldecathlon.com. LiTS (Liver): 131 abdominal CT scans with liver manual segmentation were obtained from Liver Tumor Segmentation (LiTS) Challenge. The data were acquired from portal venous phase CT modality. The details of the data can be found at https://competitions.codalab.org/competitions/17094. FNH (Liver): 8 abdominal CT scans with liver manual segmentation were internally acquired from patients with Focal Fig. 5. This figure shows the specificity and generalizability of STN, WIR, and SWN. To test the different tissue window normalization strategy, the testing Nodular Hyperplasia (FNH) lesion. The data were acquired scans have been added or subtract-ed constant values and fed into the same from contrast-enhanced in portal venous phase CT modality network. The color indicates the mean Dice values across 20 validation scans for with in-plane image size 512 × 512 and resolution from 0.5 mm each organ. The width of the yellow color range in each row shows the generalizability, while the brightness indicates the specificity. The proposed to 0.8 mm. The slice thickness is 5 mm. SWN has better generalizability compared with STN, and better specificity compared with WIR. B. Testing Data (Different CT Contrast and Pathology) AADHS (Liver): 5 abdominal CT scans with fatty liver diagnosis and manual liver segmentations were obtained from been used as the metrics to show the segmentation accuracy. African American-Diabetes Heart Study (AADHS) dataset. The The Decathlon, LiTS, and FNH cohorts are employed to data were acquired from non-contrast CT modality with in- evaluate the performance of different window normalization plane resolution 512 × 512. The details of the data can be found strategies for the scenarios that the training and testing scans are at [11]. from “same CT contrast but different pathology”. Delayed (Kidneys): 5 abdominal CT scans with manual left The AADHS and Delayed cohorts are employed to evaluate and right kidney segmentation were acquired internally with the performance of different window normalization strategies excretory phase sequences. The scans were performed in the for the scenarios that the training and testing scans are from prone position at an 8 min delay per institutional protocol with “different CT contrast and pathology”. 3 mm axial reconstructions. A. Internal Validation (MLBVC) IV. SIMULATION The qualitative and quantitative results of 20 MLBVC validation scans are shown in Figure 6 and 7 respectively. The Specificity and Generalizability Analysis. The 20 detailed measurements of six labels are presented in Table 1. As validation CT scans were used to evaluate the specificity and the training and validation datasets are from the same cohort generalizability of STN, WIR, and SWN. To test the specificity and the same scanner, the intensities of training scans and and generalizability, we performed a simulation, which adds or testing scans are homogeneous. Therefore, the canonical STN subtract constant values on 20 validation scans (from -300 to or WIR methods achieved superior performance in either +300 in steps of 25 HU). That experiment simulates the median DSC or mean DSC for all six organs. In Table 1, The intensity variations in testing data when applying the trained best DSC results are marked as bold. Briefly, the greater median model. The 20 validation CT scans were used since the data and mean DSC indicate the better segmentation performance were acquired from the same scanner as the training data. referring to the manual segmentations. The smaller standard Therefore, the spatial effects will be minimized and the deviation (STD) of DSC means the variation of the difference in performance is solely from the global variations segmentation performance is smaller and more consistent on intensities. Figure 5 shows the variations of segmentation across the cases. The symbol“―” indicates that the difference performance on six organs with the changes in raw intensities. between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” V. EMPIRICAL VALIDATION means significantly higher and lower respectively using the The 20 MLBCV scans are used to evaluate the performance Wilcoxon signed rank test with p<0.05. The symbol “*” means of different window normalization strategies for the scenarios the false discovery rate (FDR) corrected p value within the that the training and testing scans are from the “same scanner corresponding abdominal organ is < 0.05, with number of and population”. The Dice similarity coefficient (DSC) has comparisons = 12 of each organ. (highlighted with colors) is used in the study. When the training and testing scans are regimented to be acquired from the same B. 5.2 External Validation on Same Imaging Protocol scanner, protocol, and patient population (Table 1), the We group the results of Decathlon, LiTS, and FNH as the proposed method demonstrates improved benchmarks as external validation results on same CT modality, since such compared to the standard method. It means the simple standard datasets were acquired from the same imaging protocol (portal intensity normalization methods are more proper for the venous phase) as the training datasets but from different sites. internal validation. But in the real world, we typically would The qualitative and quantitative results of different methods are like to train a more generalizable deep learning model, which presented in Figure 8 and 9. The corresponding detailed can be applied directly to different cohorts and populations measurements are provided in Table 2. When performing the (Table 2 and 3). Under such external validation scenarios, the trained model on external validation datasets with the same generalizability of the trained model is essential, especially imaging protocol but different sites and pathologies, the when the number of available training cases are typically in proposed SWN method achieved superior performance small-scale for medical imaging applications. The proposed compared with the canonical STN and WIR methods. method achieves overall superior performance when the testing and training cohorts are more heterogeneous, which leverages C. 5.3 External Validation on Different Imaging Protocol the segmentation performance of the trained models on the The trained model from portal venous phase CT scans is different testing imaging protocols. Under the more restricted evaluated using the non-contrast CT scans (AADHS) and scenarios, FDR correction is applied to correct the original p- delayed phase CT scans (Delayed). In this scenario, the HU values for multiple comparison (highlighted with “*”). After intensities of livers in AADHS are systematically different from FDR correction, the differences for MLBCV-spleen, MLBCV- training data. Meanwhile, the HU intensities of kidneys in stomach (Table 1), FNH-liver (Table2), AADHS-liver, Delayed are systematically different from training data. Delayed-left kidney and Delayed-right kidney (Table 3) are not Therefore, the intensities of targeting organs in training and significant. The non-significant comparisons in Table 2 and 3 testing datasets are heterogeneous. The qualitative and are due to the relatively small sizes of available cohorts (i.e., 5 quantitative results are presented in Figure 10 and 11. The to 8 patients). corresponding detailed measurements are provided in Table 3. The standard 2D U-Net is employed as the segmentation From the results, the proposed SWN method achieved superior network to evaluate the performance of using tissue windows. performance compared with the canonical STN and WIR While this combination is successful, we do not claim methods. optimality of using 2D U-Net. To achieve the superior segmentation network is not the major aim of this work. In the VI. CONCLUSION AND DISCUSSION future, it would be also interesting to have the organs from We evaluate the effectiveness of both tissue window different contrasts labeled by different human expert. In that normalization and non-windowed methods for deep learning on case, the inter-rater reliability is able to be calculated, which can CT organ segmentation tasks. The soft tissue window typically be used to evaluate the automatic detection with human yields superior performance on segmenting smaller and more variability. challenging organs (pancreas and stomach). Meanwhile. the The proposed method is validated on the soft tissue window. segmentation performance of without using tissue window However, other types of tissue windows (e.g., lung, cardiac, techniques achieved superior performance on larger and easier liver window etc.) have also been widely used in different organs (liver and spleen). applications. Theoretically, the stochastic tissue window would From internal validation (training and testing data are from also improve the generalizability of deep network for such the same scanner and population), the STN and WIR achieved applications. Therefore, it would be useful to extend and overall better segmentation performance (Figure 7 and Table 1). validate the proposed method to such applications in the future. We propose a new stochastic tissue window normalization Another limitation of the proposed window based method and evaluate the STN, WIR and SWN methods using normalization is that it sacrifices the physical information simulation (Figure 5) different external testing cohorts. behind the HU standardization. According to the absolute differences in Dice values (highlighted in Bold), the propose SWN method achieved SPIE Author Biography generally better Dice scores, when evaluated on the testing First Author is a research assistant professor at the Vanderbilt scans acquired from the different scanner but same contrast University. He received his BS degree in Telecommunication (Figure 9 and Table 2), When evaluated on the testing scans Engineering from Nanjing University of Posts and acquired from different modalities and different pathologies Telecommunications in 2008, his MS degrees in Information (Figure 11 and Table 3), the proposed SWN method also and Telecommunication Engineering and Computer Science achieved generally superior Dice values compared with STN from the Southeast University and the Columbia University and WIR. The proposed SWN provided better generalizability in 2011 and 2014 respectively, and his PhD degree in of a trained model while preserving the specificity compared Electrical Engineering from the Vanderbilt University in with STN and WIR. 2018. He is the author of more than 50 journal and conference The standard Wilcoxon signed-rank test statistical analyses papers in medical image analysis. He is a member of SPIE. muscle on computed tomography for body morphometric analysis," REFERENCES Journal of digital imaging, vol. 30, pp. 487-498, 2017. [6] S. Dorn, S. Chen, S. Sawall, D. Simons, M. May, J. Maier, et al., "Organ- [1] K. Sahi, S. Jackson, E. Wiebe, G. Armstrong, S. Winters, R. Moore, et specific context-sensitive CT image reconstruction and display," in al., "The value of “liver windows” settings in the detection of small renal Medical Imaging 2018: Physics of Medical Imaging, 2018, p. 1057326. cell carcinomas on unenhanced computed tomography," Canadian [7] Y. Huo, Z. Xu, S. Bao, C. Bermudez, H. Moon, P. Parvathaneni, et al., Association of Radiologists Journal, vol. 65, pp. 71-76, 2014. "Splenomegaly Segmentation on Multi-modal MRI using Deep [2] radiantviewer. (2019). Changing brightness or contrast Convolutional Networks," IEEE transactions on medical imaging, 2018. (https://www.radiantviewer.com/dicom-viewer- [8] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks manual/change_brightness_contrast.htm). Available: for biomedical image segmentation," in International Conference on https://www.radiantviewer.com/dicom-viewer- Medical image computing and computer-assisted intervention, 2015, pp. manual/change_brightness_contrast.htm 234-241. [3] S. M. Pomerantz, C. S. White, T. L. Krebs, B. Daly, S. A. Sukumar, F. [9] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," Hooper, et al., "Liver and bone window settings for soft-copy arXiv preprint arXiv:1412.6980, 2014. interpretation of chest and abdominal CT," American Journal of [10] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, Roentgenology, vol. 174, pp. 311-314, 2000. "Multi-atlas labeling beyond the cranial vault," ed, 2015. [4] K. Yan, L. Lu, and R. M. Summers, "Unsupervised body part regression [11] D. W. Bowden, A. J. Cox, B. I. Freedman, C. E. Hugenschimdt, L. E. using convolutional neural network with self-organization," arXiv Wagenknecht, D. Herrington, et al., "Review of the Diabetes Heart Study preprint arXiv:1707.03891, 2017. (DHS) family of studies: a comprehensively examined sample for genetic [5] H. Lee, F. M. Troschel, S. Tajmir, G. Fuchs, J. Mario, F. J. Fintelmann, and epidemiological studies of type 2 diabetes and its complications," The et al., "Pixel-level deep segmentation: artificial intelligence quantifies review of diabetic studies: RDS, vol. 7, p. 188, 2010. Table 1. Segmentation performance on MLBCV STN WIR SWN SWN SWN SWN SWN [10,10] [10,100] [50,50] [100,10] [100,100] MLBCV - Spleen Median 0.9438 0.9469 0.9444 0.9456 0.9437 0.9442 0.9469 Mean 0.9380 0.9407 0.9359 0.9391 0.9368 0.9399 0.9413 Std 0.0317 0.0294 0.0333 0.0247 0.0283 0.0194 0.0279 p<0.05 Ref. ― p=0.048 ↓ ― ― ― ― p<0.05 ― Ref. ― ― ― ― ― MLBCV - Right Kidney Median 0.9307 0.9274 0.9194 0.9252 0.9240 0.9288 0.9187 Mean 0.8898 0.8942 0.8894 0.8948 0.8995 0.8999 0.8936 Std 0.1231 0.1046 0.0966 0.0894 0.0801 0.0887 0.0859 p<0.05 Ref. ― * p=0.004 ↓ ― ― ― ― p<0.05 ― Ref. * p=0.006 ↓ ― ― ― ― MLBCV - Left Kidney Median 0.9360 0.9400 0.9364 0.9337 0.9251 0.9346 0.9132 Mean 0.8840 0.8859 0.8855 0.8801 0.8762 0.8835 0.8571 Std 0.2087 0.2096 0.2093 0.2083 0.2074 0.2089 0.2040 p<0.05 Ref. ― ― ― * p=0.003 ↓ ― * P<0.001 ↓ p<0.05 ― Ref. ― ― * p=0.003 ↓ ― * P<0.001 ↓ MLBCV – Liver Median 0.9633 0.9659 0.9662 0.9633 0.9648 0.9577 0.9634 Mean 0.9622 0.9646 0.9639 0.9613 0.9640 0.9575 0.9611 Std 0.0096 0.0086 0.0089 0.0110 0.0086 0.0127 0.0103 p<0.05 Ref. * p=0.010 ↑ * p=0.025 ↑ ― ― * P<0.001 ↓ ― p<0.05 * p=0.010 ↓ Ref. ― * p=0.011 ↓ ― * P<0.001 ↓ * p=0.010 ↓ MLBCV - Stomach Median 0.8528 0.8102 0.8412 0.8418 0.8348 0.8306 0.8325 Mean 0.8377 0.8029 0.8380 0.8327 0.8305 0.8234 0.8307 Std 0.0805 0.1052 0.0777 0.0995 0.0891 0.0944 0.0845 p<0.05 Ref. p=0.019 ↓ ― ― ― ― ― p<0.05 p=0.019 ↑ Ref. p=0.014 ↑ p=0.019 ↑ ― ― ― MLBCV - Pancreas Median 0.7620 0.7196 0.7453 0.7407 0.7234 0.7294 0.7336 Mean 0.7483 0.7030 0.7357 0.7344 0.7313 0.7215 0.7167 Std 0.1149 0.1140 0.1038 0.0886 0.1091 0.1279 0.1046 p<0.05 Ref. * p=0.007 ↓ ― * p=0.007 ↓ ― * p=0.003 ↓ * p=0.005 ↓ p<0.05 * p=0.007 ↑ Ref. * p=0.012 ↑ * p=0.017 ↑ * p=0.033 ↑ ― ― The best DSC results are marked as bold. The symbol “―” indicates that the difference between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” means significantly higher and lower respectively using the Wilcoxon signed rank test with p<0.05. “*” means the FDR corrected p value is also < 0.05. Table 3. Performance on Testing Data (Same CT Contrast, Different Pathology) STN WIR SWN SWN SWN SWN SWN [10,10] [10,100] [50,50] [100,10] [100,100] Decathlon - Pancreas Median 0.6908 0.6407 0.6996 0.6880 0.6933 0.6870 0.6972 Mean 0.6480 0.6009 0.6714 0.6612 0.6607 0.6590 0.6665 Std 0.1639 0.1645 0.1432 0.1403 0.1507 0.1467 0.1416 p<0.05 Ref. * p<0.001 ↓ * p<0.001 ↑ ― * p=0.034 ↑ ― * p=0.011 ↑ p<0.05 * p<0.001 ↑ Ref. * p<0.001 ↑ * p<0.001 ↑ * p<0.001 ↑ * p<0.001 ↑ * p<0.001 ↑ LiTS - Liver Median 0.9396 0.9425 0.9414 0.9420 0.9439 0.9389 0.9425 Mean 0.9321 0.9294 0.9315 0.9351 0.9335 0.9288 0.9346 Std 0.0307 0.0472 0.0405 0.0300 0.0376 0.0398 0.0300 p<0.05 Ref. ― ― * p=0.015 ↑ * p=0.009 ↑ * p=0.008 ↑ * p=0.015 ↑ p<0.05 ― Ref. ― ― ― * p=0.011 ↓ ― FNH - Liver Median 0.9317 0.9395 0.9389 0.9430 0.9422 0.9408 0.9443 Mean 0.9295 0.9367 0.9386 0.9408 0.9423 0.9361 0.9399 Std 0.0264 0.0181 0.0138 0.0119 0.0139 0.0203 0.0166 p<0.05 Ref. ― ― ― p=0.008 ↑ ― p=0.016 ↑ p<0.05 ― Ref. ― ― ― ― ― The best DSC results are marked as bold. The symbol “―” indicates that the difference between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” means significantly higher and lower respectively using the Wilcoxon signed rank test with p<0.05. “*” means the FDR corrected p value is also < 0.05. Table 2. Performance on Testing Data (Different CT Contrast and Pathology) STN WIR SWN SWN SWN SWN SWN [10,10] [10,100] [50,50] [100,10] [100,100] AADHS - Liver Median 0.8290 0.8752 0.8983 0.9017 0.8924 0.8433 0.8892 Mean 0.7799 0.8214 0.8458 0.8811 0.8797 0.8084 0.8589 Std 0.1661 0.1248 0.1266 0.0559 0.0449 0.1433 0.1039 p<0.05 Ref. ― p<0.05 ↑ ― p<0.05 ↑ p<0.05 ↑ p<0.05 ↑ p<0.05 ― Ref. p<0.05 ↑ ― p<0.05 ↑ ― p<0.05 ↑ Delayed - Right Kidney Median 0.8847 0.8875 0.8678 0.9048 0.9084 0.8954 0.8905 Mean 0.8652 0.8673 0.8690 0.9035 0.9031 0.8921 0.8995 Std 0.0352 0.0719 0.0526 0.0341 0.0256 0.0281 0.0245 p<0.05 Ref. ― ― p<0.05 ↑ p<0.05 ↑ ― ― p<0.05 ― Ref. ― ― ― ― ― Delayed – Left Kidney Median 0.8755 0.8328 0.8491 0.8994 0.8841 0.8853 0.8936 Mean 0.8580 0.7898 0.8359 0.8987 0.8910 0.8818 0.8913 Std 0.0605 0.1010 0.0427 0.0334 0.0298 0.0272 0.0293 p<0.05 Ref. p<0.05 ↓ ― ― ― ― ― p<0.05 p<0.05 ↑ Ref. ― ― ― ― p<0.05 ↑ The best DSC results are marked as bold. The symbol “―” indicates that the difference between the corresponding method and the reference method (“Ref.”) is not significant. The symbol “↑” and “↓” means significantly higher and lower respectively using the Wilcoxon signed rank test with p<0.05. Fig. 6. The qualitative results of applying different intensity normalization strategies. The segmentation results of three scans with the lowest, median, and highest DSC (in SWN [50,50]) are presented for each experiment. Fig. 7. The quantitative results of applying different intensity normalization strategies to MLBCV dataset, which is from the “same scanner and same population” as training. Fig. 8. The qualitative results of applying different intensity normalization strategies. The segmentation results of three scans with the lowest, median, and highest DSC (in SWN [50,50]) are presented for each experiment. Fig. 9. The quantitative results of applying different intensity normalization strategies to Decathlon, LiTS, and FNH, which are from “same CT contrast, different pathology”. Fig. 10. The qualitative results of applying different intensity normalization strategies on AADHS and Delayed datasets are provided. The segmentation results of three scans with the lowest, median, and highest DSC (in SWN [50,50]) are presented. The yellow and blue arrows indicate the key observations among different methods. Fig. 11. The quantitative results of applying different intensity normalization strategies on the testing scans, which are from “different CT contrast and pathology” compared with training.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Dec 1, 2019

There are no references for this article.