Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Visualization and Interpretation of Convolutional Neural Network Predictions in Detecting Pneumonia in Pediatric Chest Radiographs

Visualization and Interpretation of Convolutional Neural Network Predictions in Detecting... applied sciences Article Visualization and Interpretation of Convolutional Neural Network Predictions in Detecting Pneumonia in Pediatric Chest Radiographs Sivaramakrishnan Rajaraman * , Sema Candemir , Incheol Kim, George Thoma and Sameer Antani Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20894, USA; sema.candemir@nih.gov (S.C.); ickim@mail.nih.gov (I.K.); gthoma@mail.nih.gov (G.T.); santani@mail.nih.gov (S.A.) * Correspondence: sivaramakrishnan.rajaraman@nih.gov; Tel.: +1-301-827-2383 Received: 25 August 2018; Accepted: 18 September 2018; Published: 20 September 2018 Abstract: Pneumonia affects 7% of the global population, resulting in 2 million pediatric deaths every year. Chest X-ray (CXR) analysis is routinely performed to diagnose the disease. Computer-aided diagnostic (CADx) tools aim to supplement decision-making. These tools process the handcrafted and/or convolutional neural network (CNN) extracted image features for visual recognition. However, CNNs are perceived as black boxes since their performance lack explanations. This is a serious bottleneck in applications involving medical screening/diagnosis since poorly interpreted model behavior could adversely affect the clinical decision. In this study, we evaluate, visualize, and explain the performance of customized CNNs to detect pneumonia and further differentiate between bacterial and viral types in pediatric CXRs. We present a novel visualization strategy to localize the region of interest (ROI) that is considered relevant for model predictions across all the inputs that belong to an expected class. We statistically validate the models’ performance toward the underlying tasks. We observe that the customized VGG16 model achieves 96.2% and 93.6% accuracy in detecting the disease and distinguishing between bacterial and viral pneumonia respectively. The model outperforms the state-of-the-art in all performance metrics and demonstrates reduced bias and improved generalization. Keywords: computer vision; computer-aided diagnosis; convolutional neural networks; pediatric; pneumonia; visualization; explanation; chest X-rays; clinical decision 1. Introduction Pneumonia is a significant cause of mortality in children across the world. According to the World Health Organization (WHO), around 2 million pneumonia-related deaths are reported every year in children under 5 years of age, making it the most significant cause of pediatric death [1]. Pneumonia sourced from bacterial and viral pathogens are the two leading causes and require different forms of management [2]. Bacterial pneumonia is immediately treated with antibiotics while viral pneumonia requires supportive care, making timely and accurate diagnosis important. Chest X-ray (CXR) analysis is the most commonly performed radiographic examination for diagnosing and differentiating the types of pneumonia [3]. However, rapid radiographic diagnoses and treatment are adversely impacted by the lack of expert radiologists in resource-constrained regions where pediatric pneumonia is highly endemic with alarming mortality rates. Figure 1 shows sample instances of normal and infected pediatric CXRs. Appl. Sci. 2018, 8, 1715; doi:10.3390/app8101715 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 1715 2 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 2 of 18 Figure 1. Pediatric CXRs: (a) Normal CXR showing clear lungs with no abnormal opacification; (b) Figure 1. Pediatric CXRs: (a) Normal CXR showing clear lungs with no abnormal opacification; Bacterial pneumonia exhibiting focal lobar consolidation in the right upper lobe; (c) Viral pneumonia (b) Bacterial pneumonia exhibiting focal lobar consolidation in the right upper lobe; (c) Viral pneumonia manifesting with diffuse interstitial patterns in both lungs. manifesting with diffuse interstitial patterns in both lungs. Computer-aided diagnostic (CADx) tools aim to supplement clinical decision-making. They Computer-aided diagnostic (CADx) tools aim to supplement clinical decision-making. combine elements of computer vision and artificial intelligence with radiological image processing They combine elements of computer vision and artificial intelligence with radiological image processing for recognizing patterns [4]. Much of the published literature describes machine learning (ML) for recognizing patterns [4]. Much of the published literature describes machine learning (ML) algorithms that use handcrafted feature descriptors [5] that are optimized for individual datasets and algorithms that use handcrafted feature descriptors [5] that are optimized for individual datasets trained for specific variability in size, orientation, and position of the region of interest (ROI) [6]. In and trained for specific variability in size, orientation, and position of the region of interest (ROI) [6]. recent years, data-driven deep learning (DL) methods are shown to avoid the issues with handcrafted In recent featu years, res through en data-driven d-to-e deep nd felearning ature extr (DL) action methods and class ar ife icshown ation. to avoid the issues with handcrafted Convolutional neural networks (CNNs) belong to a class of DL models that are prominently features through end-to-end feature extraction and classification. used in computer vision [7]. These models have multiple processing layers to learn hierarchical Convolutional neural networks (CNNs) belong to a class of DL models that are prominently used feature representations from the input pixel data. The features in the early network layers are in computer vision [7]. These models have multiple processing layers to learn hierarchical feature abstracted through the mechanisms of local receptive fields, weight sharing, and pooling to form rich representations from the input pixel data. The features in the early network layers are abstracted feature representations toward learning and classifying the inputs to their respective classes. Due to through the mechanisms of local receptive fields, weight sharing, and pooling to form rich feature lack of sufficiently extensive medical image data, CNNs trained on large-scale data collections such representations toward learning and classifying the inputs to their respective classes. Due to lack as ImageNet [8] are used to transfer the knowledge of learned representations in the form of generic of sufficiently extensive medical image data, CNNs trained on large-scale data collections such as image features to the current task. CNNs are also shown to deliver promising results in object ImageNet [8] are used to transfer the knowledge of learned representations in the form of generic detection and localization tasks [9]. image features to the current task. CNNs are also shown to deliver promising results in object detection The astounding success of deep CNNs coupled with lack of explainable decision-making has resulted in a perception of doubt. This poorly understood model behavior has limited their use in and localization tasks [9]. routine clinical practice [10]. There aren’t enough studies pertaining to the visualization and The astounding success of deep CNNs coupled with lack of explainable decision-making has interpretation of CNNs in medical image analysis/understanding applications. In this article, we (i) resulted in a perception of doubt. This poorly understood model behavior has limited their use detect and distinguish pneumonia types in pediatric CXRs, and (ii) explain the internal operations in routine clinical practice [10]. There aren’t enough studies pertaining to the visualization and and predictions of CNNs applied to this challenge. interpretation of CNNs in medical image analysis/understanding applications. In this article, In this study, we evaluate, visualize, and explain the predictions of CNN models in classifying we (i) detect and distinguish pneumonia types in pediatric CXRs, and (ii) explain the internal pediatric CXRs to detect pneumonia and furthermore to differentiate between bacterial and viral operations and predictions of CNNs applied to this challenge. pneumonia to facilitate swift referrals that require urgent medical intervention. We propose a novel In this study, we evaluate, visualize, and explain the predictions of CNN models in classifying method to visualize the class-specific ROI that is considered significant for correct predictions across pediatric all the CXRs inputs to tha detect t belong to a pneumonia n expected and class. We furthermor evalua e te toadif nd sta ferentiate tistically val between idate the perf bacterial orma and nce viral of different customized CNNs that is trained end-to-end on the dataset under study to provide an pneumonia to facilitate swift referrals that require urgent medical intervention. We propose a novel accurate and timely diagnosis of the pathology. The work is organized as follows: Section 2 discusses method to visualize the class-specific ROI that is considered significant for correct predictions across the related work, Section 3 elaborates on the materials and methods, Section 4 discusses the results, all the inputs that belong to an expected class. We evaluate and statistically validate the performance and Section 5 concludes the study. of different customized CNNs that is trained end-to-end on the dataset under study to provide an accurate and timely diagnosis of the pathology. The work is organized as follows: Section 2 discusses 2. Related Work the related work, Section 3 elaborates on the materials and methods, Section 4 discusses the results, A study of the literature reveals several works pertaining to the use of handcrafted features for and Section 5 concludes the study. detecting pneumonia in chest radiographs [11–14]. However, few studies reported the performance of DL methods applied to pneumonia detection in pediatric CXRs. Relatively few researchers 2. Related Work A study of the literature reveals several works pertaining to the use of handcrafted features for detecting pneumonia in chest radiographs [11–14]. However, few studies reported the performance of DL methods applied to pneumonia detection in pediatric CXRs. Relatively few researchers attempted to Appl. Sci. 2018, 8, 1715 3 of 17 offer a qualitative explanation of their model’s learned behavior, internal computations, and predictions. The authors of [15] used a pretrained InceptionV3 model as a fixed feature extractor to classify normal and pneumonia-infected pediatric CXRs and further distinguish between bacterial and viral pneumonia with an area under the curve (AUC) of 0.968 and 0.940 respectively. In another study [4], the authors used a gradient-based ROI localization algorithm to detect and spatially locate pneumonia in CXRs. They released the largest collection of the National Institutes of Health (NIH) CXR dataset that contains 112,120 frontal CXRs, the associated labels are text-mined from radiological reports using natural language processing tools. The authors reported an AUC of 0.633 toward detecting the disease. The authors of [16] used a gradient-based visualization method to localize the ROI with heat maps toward pneumonia detection. They used a 121-layer densely connected neural network toward estimating the disease probability and obtained an AUC of 0.768 toward detecting pneumonia. The authors of [17] used an attention-guided mask inference algorithm to locate salient image regions that stand indicative of pneumonia. The features of local and global network branches in the proposed model are concatenated to estimate the probability of the disease. An AUC of 0.776 is reported for pneumonia detection. 3. Materials and Methods 3.1. Data Collection and Preprocessing We used a set of pediatric CXRs that have been made publicly available by the authors of [15]. The authors have obtained approvals from the Institutional Review Board (IRB) and Ethics Committee toward data collection and experimentation. The dataset includes anteroposterior CXRs of children from 1 to 5 years of age collected from Guangzhou Women and Children’s Medical Center in Guangzhou, China. The characteristics of the data and its distribution are shown in Table 1. The dataset is screened for quality control to remove unreadable and low-quality radiographs and curated by experts to avoid grading errors. Table 1. Dataset and its characteristics. Category Training Samples Test Samples File Type Normal 1349 234 JPG Bacterial 2538 242 JPG Viral 1345 148 JPG The CXRs contain regions other than the lungs that do not contribute to diagnosing pneumonia. Under these circumstances, the model may learn irrelevant feature representations from the underlying data. Using an algorithm based on anatomical atlases [18] to automatically detect the lung ROI can avoid this. A reference set of patient CXRs with expert-delineated lung masks are used as models [19] to register with the objective pediatric CXR. When presented with an objective chest radiograph, the algorithm uses the Bhattacharyya distance measure to select the most similar model CXRs. The correspondence between the model CXRs and objective CXR is computed by modeling the objective CXR with local image feature representations and identifying similar locations by applying SIFT-flow algorithm [20]. This map is the transformation applied to the model lung masks to transform them into the approximate lung model for the objective chest radiograph. The lung boundaries are cropped to the size of a bounding box to include all the lung pixels that constitute the ROI for the current task. The baseline data (whole CXRs) and the cropped bounding box are resampled to 1024  1024 pixel dimensions and mean normalized to assist the models in faster convergence. The detected lung boundaries for the sample pediatric CXRs are shown in Figure 2. Appl. Sci. 2018, 8, 1715 4 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 4 of 18 Figure 2. Detected boundaries in sample pediatric CXRs. Figure 2. Detected boundaries in sample pediatric CXRs. 3.2. Configuring CNNs for Pneumonia Detection 3.2. Configuring CNNs for Pneumonia Detection We evaluated the performance of different customized CNNs and a VGG16 model in detecting We evaluated the performance of different customized CNNs and a VGG16 model in detecting pneumonia and furthermore distinguishing between bacterial and viral types to facilitate timely and pneumonia and furthermore distinguishing between bacterial and viral types to facilitate timely and acaccur curata ete d disei ase sease d diagniagnosis. We ev osis. We evaluatealuat d theed the perform perfo anceromance of th f three differree diffe ent custom rent customized ized CNN architeCNN ctures: (i) architectu Sequentia res: l CN (i) Se N; ( qu ii)ential CNNCNN; with r(i ei) C sidu N aN with re l connecti so id ns ual connect (Residual ion CN sN (R );es an idu d, a (i l CNN) ii) CNN ; and, with (iI ii n )c CNN eption with Inception modules (Inception CNN). modules (Inception CNN). 3.2.1. Sequential CNN 3.2.1. Sequential CNN A sequential CNN model belongs to the class of deep, feed-forward artificial neural networks A sequential CNN model belongs to the class of deep, feed-forward artificial neural networks that that ar are com e commonly monly appl applied ied t to o visua visuall re recognition cognition [ [7 7]. ]. It It is a l is a linear inear st stack ack of conv of convolutional, olutional, no nonlinear nlinear, , pooling, poolingand , and dense layers. We dense layers. We optimized optimized the se the sequential quential CNN CNN arch architectur itecture and its e and itshyperparameters hyperparameters for the datasets under study through Bayesian learning [21,22]. The procedure uses a Gaussian for the datasets under study through Bayesian learning [21,22]. The procedure uses a Gaussian process model of an objective function and its evaluation to optimize the network depth, learning process model of an objective function and its evaluation to optimize the network depth, learning rate, momentum, and L2-regularization. These parameters are passed as arguments in the form of rate, momentum, and L2-regularization. These parameters are passed as arguments in the form optimization variables to evaluate the objective function. We initialized the search ranges to [110], [1 of optimization variables to evaluate the objective function. We initialized the search ranges to −7 −1 −10 −2 7 1 10 2 × 10 1 × 10 ], [0.7 0.99], and [1 × 10 1 × 10 ] for the network depth, learning rate, momentum, and [110], [1  10 1  10 ], [0.7 0.99], and [1  10 1  10 ] for the network depth, learning rate, L2-regularization respectively. The objective function takes these variables as input, trains, validates momentum, and L2-regularization respectively. The objective function takes these variables as input, and saves the optimal network that gives the minimum classification error on the test data. Figure 3 trains, validates and saves the optimal network that gives the minimum classification error on the test illustrates the steps involved in optimization. data. Figure 3 illustrates the steps involved in optimization. 3.2.2. Residual CNN 3.2.2. Residual CNN In a sequential CNN, the succeeding network layer learns the feature representations from only In a sequential CNN, the succeeding network layer learns the feature representations from only the the preceding layer. These networks are constrained by the level of information they can process. preceding layer. These networks are constrained by the level of information they can process. Residual Residual networks are proposed by [23] that won the ImageNet Large Scale Visual Recognition networks are proposed by [23] that won the ImageNet Large Scale Visual Recognition (ILSVRC) (ILSVRC) Challenge in 2015. These networks tackle the issue of representational bottlenecks by Challenge in 2015. These networks tackle the issue of representational bottlenecks by injecting the injecting the information from the earlier network layers downstream to prevent loss of information. information from the earlier network layers downstream to prevent loss of information. They also They also prevent the gradients from vanishing by introducing a linear information carry track to prevent the gradients from vanishing by introducing a linear information carry track to propagate propagate gradients through deep network layers. In this study, we propose a customized CNN that gradients through deep network layers. In this study, we propose a customized CNN that is made up is made up of six residual blocks, as shown in Figure 4. of six residual blocks, as shown in Figure 4. 3.2.3. Inception CNN 3.2.3. Inception CNN The Inception architecture, proposed by [24] consists of independent modules having parallel The Inception architecture, proposed by [24] consists of independent modules having parallel branches that are concatenated to form the resultant feature map that is fed into the succeeding branches that are concatenated to form the resultant feature map that is fed into the succeeding modules. modules. Unlike sequential CNN, this method of stacking modules help in separately learning the Unlike sequential CNN, this method of stacking modules help in separately learning the spatial and spatial and channel-wise feature representations. The 1 × 1 convolution filters used in these modules channel-wise feature representations. The 1  1 convolution filters used in these modules factor out factor out the channel and spatial feature learning by computing features from the channels without the channel and spatial feature learning by computing features from the channels without mixing mixing spatial information by looking at one input tile at a given point in time. We construct a spatial information by looking at one input tile at a given point in time. We construct a customized customized Inception CNN by stacking six InceptionV3 modules [23], as shown in Figure 5. Inception CNN by stacking six InceptionV3 modules [23], as shown in Figure 5. Appl. Sci. 2018, 8, 1715 5 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 5 of 18 Appl. Sci. 2018, 8, x FOR PEER REVIEW 5 of 18 Figure 3. Flowchart describing the optimization procedure. Figure 3. Flowchart describing the optimization procedure. Figure 3. Flowchart describing the optimization procedure. Figure 4. The architecture of customized residual CNN: (a) Residual block; (b) Customized residual Figure 4. The architecture of customized residual CNN: (a) Residual block; (b) Customized residual CNN stacked with six residual blocks. CNN stacked with six residual blocks. Figure 4. The architecture of customized residual CNN: (a) Residual block; (b) Customized residual CNN stacked with six residual blocks. Appl. Sci. 2018, 8, 1715 6 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 6 of 18 Figure 5. The architecture of customized InceptionV3 CNN: (a) InceptionV3 module; (b) Customized Figure 5. The architecture of customized InceptionV3 CNN: (a) InceptionV3 module; (b) Customized Inception CNN stacked with six InceptionV3 modules. Inception CNN stacked with six InceptionV3 modules. 3.2.4. Customized VGG16 3.2.4. Customized VGG16 VGG16 is proposed and trained by the Oxford’s Visual Geometry Group (VGG) [25] for object VGG16 is proposed and trained by the Oxford’s Visual Geometry Group (VGG) [25] for object recognition. The model scored first in ILSVRC image localization and second in image classification recognition. The model scored first in ILSVRC image localization and second in image classification tasks. We customized the architecture of VGG16 model and evaluated its performance toward the tasks. We customized the architecture of VGG16 model and evaluated its performance toward the tasks of interest. The model is truncated at the deepest convolutional layer and added with a global tasks of interest. The model is truncated at the deepest convolutional layer and added with a global average pooling (GAP) and dense layer as shown in Figure 6. We refer to this model as customized average pooling (GAP) and dense layer as shown in Figure 6. We refer to this model as customized VGG16 in this study. VGG16 in this study. The hyperparameters of the customized residual, Inception and VGG16 models are optimized The hyperparameters of the customized residual, Inception and VGG16 models are optimized through a randomized grid search [26] that searches and optimizes the value of hyperparameters through a randomized grid search [26] that searches and optimizes the value of hyperparameters including learning rate, momentum, and L2-regularization. The search ranges are initialized to including learning rate, momentum, and L2-regularization. The search ranges are initialized to [1 × 6 1 10 1 [1− 6 10 −11  10 ], [0.7 0.99], −and 10 [1 −1 10 1  10 ] for the learning rate, momentum, and 10 1 × 10 ], [0.7 0.99], and [1 × 10 1 × 10 ] for the learning rate, momentum, and L2-regularization L2-regularization respectively. Callbacks are used to view the internal states during training and respectively. Callbacks are used to view the internal states during training and retain the best retain the best performing model for analysis. We performed hold-out testing with the test data after performing model for analysis. We performed hold-out testing with the test data after every step. every step. The performance of customized CNNs are evaluated in terms of the following performance The performance of customized CNNs are evaluated in terms of the following performance metrics: metrics: (i) accuracy; (ii) AUC; (iii) precision; (iv) recall; (v) specificity; (vi) F-Score; and, (vii) Matthews (i) accuracy; (ii) AUC; (iii) precision; (iv) recall; (v) specificity; (vi) F-Score; and, (vii) Matthews Correlation Coefficient (MCC). We used the NIH Biowulf Linux cluster (https://hpc.nih.gov/) and Correlation Coefficient (MCC). We used the NIH Biowulf Linux cluster (https://hpc.nih.gov/) and the the high performance computing facility at the National Library of Medicine (NLM) for computational high performance computing facility at the National Library of Medicine (NLM) for computational analyses. Software frameworks included with Matlab R2017b are used to configure and evaluate analyses. Software frameworks included with Matlab R2017b are used to configure and evaluate the the sequential CNN along with Keras and Tensorflow backend for other customized models used in sequential CNN along with Keras and Tensorflow backend for other customized models used in this this study. study. Appl. Sci. 2018, 8, 1715 7 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 7 of 18 Figure 6. VGG16 model truncated at the deepest convolutional layer and added with a GAP and Figure 6. VGG16 model truncated at the deepest convolutional layer and added with a GAP and dense layer. dense layer. 3.3. Visualization Studies 3.3. Visualization Studies The interpretation and understanding of CNNs is a hotly debated topic in ML, particularly in The interpretation and understanding of CNNs is a hotly debated topic in ML, particularly in the context of clinical decision-making [4]. CNNs are perceived as black boxes and it is imperative to the context of clinical decision-making [4]. CNNs are perceived as black boxes and it is imperative to explain their working to build trust in their predictions [9]. This helps to understand their working explain their working to build trust in their predictions [9]. This helps to understand their working principles, assist in hyperparameter tuning and optimization, identify and get an intuition of the principles, assist in hyperparameter tuning and optimization, identify and get an intuition of the reason reason behind the model failures, and explain the predictions to the end-user prior to possible behind the model failures, and explain the predictions to the end-user prior to possible deployment. deployment. The methods of visualizing CNNs are broadly categorized into (i) preliminary methods The methods of visualizing CNNs are broadly categorized into (i) preliminary methods that help to that help to visualize the overall structure of the model; and, (ii) gradient-based methods that visualize the overall structure of the model; and, (ii) gradient-based methods that manipulate the manipulate the gradients from the forward and backward pass during training [27]. We demonstrated the overall structure of the CNNs, as shown in Figures 4–6. gradients from the forward and backward pass during training [27]. We demonstrated the overall structure of the CNNs, as shown in Figures 4–6. 3.3.1. Visual Explanation through Discriminative Localization 3.3.1. Visual Explanation through Discriminative Localization The trained model focusses on discriminative parts of the image to arrive at the predictions. Class Activation Maps (CAM) help in visualizing and debugging model predictions, particularly in The trained model focusses on discriminative parts of the image to arrive at the predictions. case of a prediction error when the model predicts based on the surrounding context [27]. The output Class Activation Maps (CAM) help in visualizing and debugging model predictions, particularly in of the GAP layer is fed to the dense layer to identify the discriminative ROI localized to classify the case of a prediction error when the model predicts based on the surrounding context [27]. The output inputs to their respective classes. Let G denote the GAP that spatially averages the m-th feature map of the GAP layer is fed to the dense layer to identify the discriminative ROI localized to classify the from the deepest convolutional layer, and 𝑤 denote the weights connecting the m-th feature map inputs to their respective classes. Let G denote the GAP that spatially averages the m-th feature map to the output neuron corresponding to the expected class p. A prediction score Sp at the output neuron from the deepest convolutional layer, and w denote the weights connecting the m-th feature map to is expressed as a weighted sum of GAP as shown in Equation (1). the output neuron corresponding to the expected class p. A prediction score S at the output neuron is 𝑆 = ∑ 𝑤 ∑ 𝑔 (𝑥, 𝑦 ) =∑∑ 𝑤 𝑔 (𝑥, 𝑦) (1) , , expressed as a weighted sum of GAP as shown in Equation (1). The value gm (x, y) denotes the m-th feature map activation in the spatial location (x, y). The CAM p p for the class p denoted S by = CAMw p is expressed g (x a,sy the weighted ) = sum of the w g (x,a yc)tivations from all the (1) p å m å m å å m m m x,y x,y m feature maps with respect to the expected class p at the spatial location (x, y) as shown in Equation The value g (x, y) denotes the m-th feature map activation in the spatial location (x, y). The CAM (2). for the class p denoted by CAM is expressed as the weighted sum of the activations from all the feature 𝐶𝐴𝑀 (𝑥, 𝑦 ) = ∑ 𝑤 𝑔 (𝑥, 𝑦) (2) maps with respect to the expected class p at the spatial location (x, y) as shown in Equation (2). CAM gives information pertaining to the importance of the activations at each spatial grid (x, y) to classify an input image to its expect C A M ed cla (x, ss y)p= . It is rescal w g ed to the siz (x, y) e of the input image to locate (2) p m å m the discriminative ROI used to classify the image to its expected class. This helps to answer queries CAM gives information pertaining to the importance of the activations at each spatial grid (x, pertaining to the ability of the model in predicting and localizing the ROI specific to its category. We y) to classify an input image to its expected class p. It is rescaled to the size of the input image to propose a novel visualization method called average-CAM to represent the class-level ROI that is locate the discriminative ROI used to classify the image to its expected class. This helps to answer most commonly considered significant for correct prediction across all the inputs that belong to a given class. The average-CAM for the class p is computed by averaging the CAM outputs as shown queries pertaining to the ability of the model in predicting and localizing the ROI specific to its category. in Equation (3). We propose a novel visualization method called average-CAM to represent the class-level ROI that is most commonly considered significant for correct pred (ictio )n ac ∑ ross all the inputs that belong to a given class. 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 − 𝐶𝐴𝑀 𝑥, 𝑦 = 𝐶𝐴𝑀 (𝑥, 𝑦) (3) The average-CAM for the class p is computed by averaging the CAM outputs as shown in Equation (3). 𝐶𝐴𝑀 (x, y) denotes the CAM for the a-th image in the expected class p. This helps to identify the ROI specific to the expected class, improve the interpretability of the i a nternal representations, and average C A M (x, y) = C A M (x, y) (3) p å explainability of the model predictions. C A M (x, y) denotes the CAM for the a-th image in the expected class p. This helps to identify the ROI specific to the expected class, improve the interpretability of the internal representations, and explainability of the model predictions. Appl. Sci. 2018, 8, 1715 8 of 17 CAM visualization can only be applied to networks with a GAP layer. Gradient-weighted CAM Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 18 (grad-CAM) is a strict generalization of CAM that can be applied to all existing CNNs [28]. It uses Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 18 CAM visualization can only be applied to networks with a GAP layer. Gradient-weighted CAM the gradient information of the expected class, flowing back into the deepest convolutional layer to CAM visua (gr liza ad-C tion ca AM) n only be is a strict appl gen ied t erao li netw zation of orks wi CAth a GAP la M that can b yer. Gra e applied dient-wei to all exi ghted C sting C AM NNs [28]. It uses generate explanations. Grad-CAM produces the weighted sum of all the feature maps in the deepest (grad-CAM) is a thst e gr rict adient gene i ra nli form zatio at n of ion C ofA tM t he exp hat c ect an b ed cl e ap ass, p f lied lowing ba to all exi ck sitn ing to t C hNNs e deepest convol [28]. It usesu tional layer to convolutional layer for the expected class p as shown in Equation (4). A ReLU nonlinearity is applied the gradient inform genera ation te expla of thn ea exp tions ect . Grad- ed clas Cs, A f M lowing ba produces ck the weig into the deepest convol hted sum of all the uti fe ona ature l layer to maps in the deepest to avoid the negative weights from influencing the class p. This is based on the consideration that the Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 18 generate explanconvolution ations. Grad- al layer CAM produ for the c es expected class the weighted sum p as shown of all the in Eq fe u ature ation ( maps 4). A Re in the d LU neepest onlinearity is applied pixels with negative weights are likely to belong to other classes. CAM visua convolution lizata io l layer n ca to a n only be v for the oid the nega expected class applied t tive wei o netw gp hts f as orks wi shown rom ith a GAP la nf iln uenci Equn ag the class tion ( yer. Gra 4). A Re dp ient-wei . Thi LU sn is oghted C nline based ar on it A y t M is h e c applied onsider ation that the grad C A M (x, y) = ReLU b g (x, y) (4) (grad-CA to a M) v i oi s a d the nega strict gen piti xel e ve wei ra s wi lizat th nega gihts f on of rom ti C ve weights a A iM t nfluenci hat cn ag the class n b re li eke ap ly pto belong to o lied pp. Thi to al s is l exi based stther classe ing on å C tNNs he c ms. o [ m nsider 28]. It uses ation that the the gradient pixel isn wi form th nega ationti o ve weights a f the expecte re li d cl ke asly s, f to belong to o lowing back tiher classe nto the deepest convol s. utional layer to − 𝐶 (𝑥, 𝑦 ) =𝑅𝑒𝐿𝑈( ∑ 𝛽 𝑔 (𝑥, 𝑦 )) (4) The value b is obtained by computing the gradient of the prediction score S with respect to the generate explanations. Grad-CAM produces the weighted sum of all the feature maps in the deepest p − 𝐶 (𝑥, 𝑦 ) =𝑅𝑒𝐿𝑈( ∑ 𝛽 𝑔 (𝑥, 𝑦 )) (4) m-th The value feature map 𝛽 as ishown s obtaine ind Equation by comput (5).ing the gradient of the prediction score Sp with respect to convolutional layer for the expected class p as shown in Equation (4). A ReLU nonlinearity is applied The value the 𝛽 m- is obt th fe aat ine ure m d bya comput p as shown in E ing the gr quat adient ion ( o 5f). t he prediction score Sp with respect to to avoid the negative weights from influencing the class p. This is based on the consideration that the ¶S the m-th feature map as shown in Equation (5). b = (5) pixels with negative weights are likely to belong to other classes. m å x,y ¶g (x, y) 𝛽 = ∑ m (5) (,) ( ) ∑ ( ) − 𝐶 𝑥, 𝑦 =𝑅𝑒𝐿𝑈( ∑ 𝛽 𝑔 𝑥, 𝑦 ) (4) 𝛽 = (5) p p (,) Accor Accordi ding ng to Eq to Equations uations (1) a (1)nd and (4), (4), 𝛽 is precise b is precisely ly the same the as same 𝑤 as for networks wi w for networks th a Cwith AM- m m The value 𝛽 is obtained by computing the gr adient of the prediction score Sp with respect to According acompatible to Eq CAM-compatible uations ( architecture. T 1) and ar (4 chitectur ), 𝛽 h e d is precise e.ifference The dif ly the lies in ference same lies applying the as in 𝑤 apply for networks wi ingReLU non the ReLU th -linearity to non-linearity a CAM- exclude the to exclude the m-th feature map as shown in Equation (5). compatible architecture. T the influence of influence hnegative weights that e d of negative ifference lies in weights applying the that are ar like e likely ly to bReLU non to elo belong ng to ot to -linearity to her cl other ass classes. es. exclude the The av Theerage average-grad-CAM -grad-CAM for influence of negative weights that for the class the class p is p computed by is computed are likeby aver ly t averaging o b agin elog th ng tthe e grad-CA o ot grad-CAM her class M eoutp s. outputs The av uts as sh erage as shown own in E -grad in -C Equation AM fo quation (6). The v r (6). The value alue 𝛽 = (5) (,) the class p is computed by grad- grad-C 𝐶𝐴𝑀 A M ( aver ( xx , y , y ) d )agin denotes enot g th es te grad-CA the he gr grad-CAM ad-CAM M outp for for tuts as sh the he a- a-th th imag image own in E e in the expected class in the quaexpected tion (6). The v classp pa. .lue Accordi gradn -𝐶𝐴𝑀 g to Eq (xua , yti ) d ons ( enot 1e ) a s tn hd e gr (4), a d 𝛽 -CAM is precise for the lya- the th imag same e in the expected class as 𝑤 for networks wi p. th a CAM- ( ) ∑ − − 𝐶 𝑥, 𝑦 = − 𝐶 (𝑥, 𝑦) (6) average grad C A M (x, y) = grad C A M (x, y) (6) p å compatible architecture. The difference lies in applying the ReLU non-linearity to exclude the − − 𝐶 (𝑥, 𝑦 ) = ∑ − 𝐶 (𝑥, 𝑦) (6) influence of negative weights that are likely to belong to other classes. The average-grad-CAM for 3.3.2. Model-Agnostic Visual Explanations the class p is computed by 3.3.2. aver Model-Agnostic aging the grad-CA VisualM Explanations outputs as shown in Equation (6). The value 3.3.2. Model-Agnostic Visual Explanations grad-𝐶𝐴𝑀 (x, y) denotes the grad-CAM for the a-th image in the expected class p. Local interpretable model-agnostic explanations (LIME) is a visualization tool proposed by [29]. Local interpretable model-agnostic explanations (LIME) is a visualization tool proposed by [29]. Local interp Itr helps t etable m o provide odel-agnost a qu ic exp alitatl ive int anation erpr s (LI eta M tiE) on of is a the rel visuala iz tiat onshi ion tp ool be p tween perturbed i roposed by [29]. nput instances − It helps − 𝐶 to provide a qualitative (𝑥, 𝑦 ) = interpr ∑ − 𝐶 etation of the (𝑥, 𝑦) relationship between perturbed input instances (6) It helps to provide and t a q he mod ualitate ive int l predeict rpr ions eta.t The inp ion of the rel ut im aage tionshi is dp ivid between perturbed i ed into contiguous n supe put in rpi sta xn els ces and a dataset of and the model predictions. The input image is divided into contiguous superpixels and a dataset and the model pred perturbed input insta ictions. The input i nm ces is construct age is divided int ed by tu o contrni iguous ng on/of supef these i rpixels and nterpreta a databset le components. The of of perturbed input instances is constructed by turning on/off these interpretable components. 3.3.2. Model-Agnostic Visual Explanations perturbed input insta perturbed nces is construct instances are ed by tu weighted b rning on/of y their f these i similarity to the nterpretabexpl le components. The ained instance. The algorithm The perturbed instances are weighted by their similarity to the explained instance. The algorithm Local perturbed interpretin abstances are leapproxim model-agnost atweighted b es t ic exp he CNN by lanat y their ion a sp s (LI sim arse MiE) la , lri iisty to the ne aar mode visualizexpl at l tion hat a t i is we ned ool piight rnop sta osed b ed nce. The only in t y [29a ]l . hgori e nethm ighborhood of the approximates the CNN by a sparse, linear model that is weighted only in the neighborhood of the It helps tapproxim o provide at a q es t uh al explained e CNN by itative int pr ea sp rpr edictions. earse tation of , line An the rel ar mode explan atil t onshi ation hat is we pis gener between perturbed i ighta ed ted only in t in the f hn e n oput rm e ighborhood of insta superp nces ixof t els wi he th the highest explained predictions. An explanation is generated in the form of superpixels with the highest positive and the mod explained el pred pr ictedictions. ions posi . The inp tive weights tha An explan ut image ation t demonstra is diis gener vided int te the discri ate o cont d in the f iguous mina o supe rm tive of R rpi Osuperp I xlels ocal and iizxeels wi d by the model a datath the highest set of to classify the image weights that demonstrate the discriminative ROI localized by the model to classify the image to d d d 0 d perturbed input insta positive weights tha nto its expecte ces is construct t demonstra d class. Let e td by tu e the discri rni k ∈ mi ng on/of ℝ na ti bve e t fh R these i e O exp I local lain niterpreta ze ed d by the model instance, ble components. The and to cla k’ ∈ ssi {0fy the , 1} , the imabi ge na ry vector that its expected class. Let k 2 R be the explained instance, and k 2 {0, 1} , the binary vector that d d perturbed to its expecte instances are d class. Let denotes the p weighted b k ∈ ry esence/ab ℝ their be thsim es exp enila ce of a supe ri lain ty to the ed inst rpixel. Let ance, expl a and ined g k’ ∈ in∈ G st a {0 n denote the ex , 1 ce. The } , the bi algori na plan ry vector tha thm ation wh ere t G is a class of denotes the presence/absence of a superpixel. Let g 2 G denote the explanation where G is a class of approxim denotes the p ates the CNN by resence/ab interpret a sparse s aen ble ce of a supe , l line inea ar mode r model rpixel. Let l t s. Let hat is we ℽ(g g ) ∈ denote the complexi ight G ed denote the ex only in the n plan ty mea eighborhood ation wh sure ere associ of t G is h ae ta ed class of with the exp lanation interpretable linear models. Let (g) denote the complexity measure associated with the explanation explained int pr erpret edictions. able line gAn a∈ r explan model G. The v sation . Let alue ℽis gener (ℽ g()g denote the complexi ) denot ated es the n in the f umber of n orm ty mea of superp os n-zero ure aissoci xcoefficients fo els wi ated th the highest with the exp r the line laar m natio on del. Let m: ℝ → g 2 G. The value (g) denotes the number of non-zero coefficients for the linear model. Let m: R ! R positive weights tha g ∈ G. The v t demonstra alue ℝ dℽ e( note g) denot t the expl e the discri es the n aine mi u dmber of n na mode tive lR aO no d I n-zero lm ocal (k), i t zcoefficients fo h ed by the model e probability t r the line hat to cla k bel sar m si ofy the ngs oto a gi del. Let imag ven cl em: ℝ ass. → Let Πk(x) denote denote the explained model and m(k), the probability that k belongs to a given class. Let P (x) denote d d to its expecte ℝ denote d class. Let the expl the measur a k ine∈d ℝ mode be e tof prox lh ae n exp d m(limity ain k), te hd e probab between the instance, ilit and y th instance at k’ k∈ bel {0 o x ngs , 1 to } to a gi , the k and bi ven cl P na (m ry vector tha , g ass. , Π Let k) dΠ eknote t (x t ) deh note e loss of g toward the measure of proximity between the instance x to k and P(m, g, P ) denote the loss of g toward denotes the p the measur resence/ab e of prox approxim sence of a supe imity at between the ing rpixel. Let m in the neighborhood instance g ∈ G denote the ex x to k and defin P e(d by plan m, gation wh ,Π Π k. k The val ) denote t ere u Ge h is e Pa (loss o m cl , g ass of , fΠ g k ) is towminimiz ard ed and the approximating m in the neighborhood defined by P . The value P(m, g, P ) is minimized and the k k interpret approxim able lineaat r model ing value o m sin . Let the neighborhood f ℽℽ((g g)) remain denote the complexi s low enough defined by ty mea forΠ in k. The val t se ure rpret associ abil ue ia t P y t(ed .m Eq , g wi uation , th the exp Πk) is (7) minimiz gives the exp lanatie od and n lan the ations produced value of (g) remains low enough for interpretability. Equation (7) gives the explanations produced g ∈ G. The v value o alue f ℽ(ℽ g()g remain ) denot by LIME. s low enough es the number of n for in otn-zero erpreta coefficients fo bility. Equation r the line (7) gives the exp ar model. Let lanm: ations pro ℝ → duced by LIME. ℝ denote by LIME. the explained model and m(k), the probability that k belongs to a given class. Let Πk(x) denote b(k) = argminP(m, g, P ) + (g) (7) ( ) ( k) 𝛽 𝑘 =argmin 𝑃 𝑚, 𝑔, 𝛱 +ℽ(𝑔) (7) g2G the measure of proximity between the instance x to k and P(m, g , ∈ Π k) denote the loss of g toward 𝛽 (𝑘 ) =argmin 𝑃 (𝑚, 𝑔, 𝛱 ) +ℽ(𝑔) (7) approximating m in the neighborhood defined by Πk. The value P(m, g, Πk) is minimized and the The value P (m, g, P ) is approximated by drawing samples weighted by P . Equation (8) shows The value P (m, g, Πk) is approximated by drawing samples weighted by Πk. Equation (8) shows k k value of ℽ(g) remains low enough for interpretability. Equation (7) gives the explanations produced an exponential kernel defined on the L2-distance function (J) with width e. For a given input perturbed The value P an exponential kernel defined on (m, g, Πk) is approximated by drawing the sL2-dist amples we ance ight funct ed by ion Π (Jk) . E wit quh a widt tion (8h) sho €. For ws a given input 0 d by LIME. d’ sample b 2 {0, 1} containing a fraction of non-zero elements, the label for the explanation model m(b) an exponential kernel defined on perturbed sample b’ the ∈ {0L2-dist , 1} contain ance ifng unct a fraction of no ion (J) with widt n-zero h € eleme . For n ts, the label a given input for the explan ation d’ d is obtained by recovering the sample in the original representation b 2 R as shown in Equation (9). perturbed sample model b’ ∈m {(0 b, ) is obtained 1} containing by a fraction of no recovering the samp n-zero eleme le in tn hts, the label e original re for the explan presentation ation b ∈ ℝ as shown in 𝛽 (𝑘 ) =argmin 𝑃 (𝑚, 𝑔, 𝛱 ) +ℽ(𝑔) (7) model m(b) is obtained Equation (9 by ). recovering the sample in the original representation b ∈ ℝ as shown in (y, b) The value Equation (9 P (m,). g, Πk) is approximated by drawing samples weighted by Πk. Equation (8) shows (,) P = exp J (8) Π =exp(−𝐽 2 ) (8) an exponential kernel defined on the L2-distance function (J) wit h width € €. For a given input (,) (8) Π =exp(−𝐽 ) d’ perturbed sample b’ ∈ {0, 1} containing a fraction of non-zero elements, the label for the explan ation ( ) ∑ ( ) ( ) 𝑃 𝑚, 𝑔, Π = Π (𝑚 𝑏 −𝑔 𝑏 ) (9) ,∈ model m(b) is obtained by recovering the sample in the original representation b ∈ ℝ as shown in 𝑃 (𝑚, 𝑔, Π ) = ∑ Π (𝑚 (𝑏 ) −𝑔 (𝑏 )) (9) ,∈ LIME provides explanations that help to make an informed decision about the trustworthiness Equation (9). LIME provof t ideh se pred explan ict atiions ons t and hatga he in lp cruc to m ia al kinsi e an g h in tsformed decision a into the model beh bout the trustworthi avior. ness (,) (8) of the predictions and gain crucia Π l insi =e gx hp ts( int −𝐽 o the m ) odel behavior. 𝑃 (𝑚, 𝑔, Π ) = ∑ Π (𝑚 (𝑏 ) −𝑔 (𝑏 )) (9) ,∈ LIME provides explanations that help to make an informed decision about the trustworthiness of the predictions and gain crucial insights into the model behavior. 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 Appl. Sci. 2018, 8, 1715 9 of 17 b 0 P(m, g, P ) = P (m(b) g b ) (9) k å b,b2B LIME provides explanations that help to make an informed decision about the trustworthiness of the predictions and gain crucial insights into the model behavior. 4. Results Appl. and Sci. Discussion 2018, 8, x FOR PEER REVIEW 9 of 18 4. Results and Discussion 4.1. Performance Evaluation of Customized CNNs 4.1. Performance Evaluation of Customized CNNs Figure 7 shows the optimized architecture and parameters of the sequential CNN, obtained through Bayesian learnFig ing u.re 7 sho We p wes the optimize rformed 100 d architect objectiv ue re fand p unctia orameters of the n evaluation sequenti s towaral C d oNN, ptim obtaine izing dt he model 3 6 through Bayesian learning. We performed 100 objective function evaluations toward optimizing the parameters. The optimized values are found to be 6, 1 10 , 0.9, and 1 10 for the network depth, −3 −6 model parameters. The optimized values are found to be 6, 1 × 10 , 0.9, and 1 × 10 for the network learning rate, momentum, and L2-regularization parameters respectively. The number of convolutional depth, learning rate, momentum, and L2-regularization parameters respectively. The number of layer filters is increased by a factor of 2 each time a max-pooling layer is used, in order to ensure roughly convolutional layer filters is increased by a factor of 2 each time a max-pooling layer is used, in order the same ntu om ensure ber ofrou cog m hly t puh ta e tisam onse number o in the netfw comput ork laat ye ions rs. in t Rech te net ifiedw L oirn k l ea aryer Usn . iRe t (ct Rifie eLU d Lin ) lae yar er Unit s are added to (ReLU) layers are added to introduce non-linearity and prevent vanishing gradients during introduce non-linearity and prevent vanishing gradients during backpropagation [7]. backpropagation [7]. Figure 7. The optimized architecture of customized sequential CNN. Figure 7. The optimized architecture of customized sequential CNN. Our analysis shows an increase in the performance of the residual and inception CNNs when Our analysis shows an increase in the performance of the residual and inception CNNs when the the number of filters in the convolutional layers of the succeeding blocks are increased by a factor of number of filters in the convolutional layers of the succeeding blocks are increased by a factor of 2. 2. We found the optimal hyperparameter values for the residual, inception, and VGG16 models through a randomized grid search. The values are tabulated in Table 2. We found the optimal hyperparameter values for the residual, inception, and VGG16 models through a randomized grid search. The values are tabulated in Table 2. The customized CNNs are evaluated with the baseline and cropped ROI data. The results are tabulated in Table 3. We observed that the performance of the models with the cropped ROI is relatively promising in comparison to the baseline in classifying normal and pneumonia infected CXRs. This is obvious because the models trained with the cropped ROI learn relevant feature representations toward classifying the task of interest. Appl. Sci. 2018, 8, 1715 10 of 17 Table 2. Optimal values for the hyperparameters of the customized residual and inception CNNs obtained through a randomized grid search. Model Learning Rate Momentum L2 Regularization 3 6 Residual CNN 1  10 0.9 1  10 2 4 Inception CNN 1  10 0.95 1  10 4 6 Customized VGG16 1  10 0.99 1  10 Table 3. Performance of customized CNNs with baseline and cropped ROI data. Task Data Models Accuracy AUC Precision Recall Specificity F-Score MCC Customized 0.957 0.990 0.951 0.983 0.915 0.967 0.908 VGG16 Baseline Sequential 0.943 0.983 0.920 0.980 0.855 0.957 0.878 Residual 0.910 0.967 0.908 0.954 0.838 0.931 0.806 Normal vs. Inception 0.886 0.922 0.887 0.939 0.800 0.913 0.755 Pneumonia Customized 0.962 0.993 0.977 0.962 0.962 0.970 0.918 VGG16 Cropped Sequential 0.941 0.984 0.930 0.995 0.877 0.955 0.873 ROI Residual 0.917 0.971 0.913 0.959 0.847 0.936 0.820 Inception 0.897 0.932 0.896 0.947 0.817 0.921 0.778 Customized 0.936 0.962 0.920 0.984 0.860 0.951 0.862 VGG16 Baseline Sequential 0.928 0.954 0.909 0.984 0.838 0.946 0.848 Residual 0.897 0.921 0.880 0.967 0.784 0.922 0.780 Bacterial vs. Viral Inception 0.854 0.901 0.841 0.934 0.714 0.886 0.675 Pneumonia Customized 0.936 0.962 0.920 0.984 0.860 0.951 0.862 VGG16 Cropped Sequential 0.928 0.956 0.909 0.984 0.838 0.946 0.848 ROI Residual 0.908 0.933 0.888 0.976 0.798 0.930 0.802 Inception 0.872 0.919 0.853 0.959 0.730 0.903 0.725 Customized 0.917 0.938 0.917 0.905 0.958 0.911 0.873 VGG16 Sequential 0.896 0.922 0.888 0.885 0.948 0.887 0.841 Baseline Residual 0.861 0.887 0.868 0.882 0.933 0.875 0.809 Normal vs. Inception 0.809 0.846 0.753 0.848 0.861 0.798 0.688 Bacterial vs. Viral Pneumonia Customized 0.918 0.939 0.920 0.900 0.960 0.910 0.876 VGG16 Cropped Sequential 0.897 0.923 0.898 0.898 0.949 0.898 0.844 ROI Residual 0.879 0.909 0.883 0.890 0.941 0.887 0.825 Inception 0.821 0.865 0.778 0.855 0.878 0.815 0.714 * Bold numbers indicate superior performance. The customized VGG16 model demonstrates promising performance than the other CNNs under study. The model learned generic image features from ImageNet that served as a good initialization compared to random weights and trained end-to-end on the current tasks to learn task-specific features. This results in faster convergence with reduced bias, overfitting, and improved generalization. In classifying bacterial and viral pneumonia, no significant difference in performance is observed for the customized VGG16 model with the baseline and cropped ROI. In the multi-class classification task, the cropped ROI gave better results than the baseline data. However, we observed that the differences in performance are not significant. This may be due to the reason that the dataset under study already appeared as cropped, and the boundary detection algorithm resulted in a few under-segmented regions near the costophrenic angle. The customized sequential, residual, and inception CNNs with random weight initializations didn’t have the opportunity to learn discriminative features, owing to the sparse availability and imbalanced distribution of training data across the expected classes. We observed that the sequential CNN outperformed the residual and inception counterparts across the classification tasks. The usage of residual connections is beneficial in resolving the issue of representational bottlenecks and vanishing gradients in deep models. The CNNs used in this study have a shallow architecture. The residual connections did not introduce significant gains into the performance for the tasks of interest. Unlike ImageNet, the variability in the pediatric CXR data is several orders of magnitude smaller. The architecture of residual and inception CNNs are progressively more complex and did not seem to be a fitting tool to use for the tasks of interest. The confusion matrices Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 18 Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 18 inception CNNs with random weight initializations didn’t have the opportunity to learn inception CNNs with random weight initializations didn’t have the opportunity to learn discriminative features, owing to the sparse availability and imbalanced distribution of training data discriminative features, owing to the sparse availability and imbalanced distribution of training data across the expected classes. We observed that the sequential CNN outperformed the residual and across the expected classes. We observed that the sequential CNN outperformed the residual and inception counterparts across the classification tasks. The usage of residual connections is beneficial inception counterparts across the classification tasks. The usage of residual connections is beneficial in resolving the issue of representational bottlenecks and vanishing gradients in deep models. The in resolving the issue of representational bottlenecks and vanishing gradients in deep models. The CNNs used in this study have a shallow architecture. The residual connections did not introduce CNNs used in this study have a shallow architecture. The residual connections did not introduce Appl. Sci. 2018, 8, 1715 11 of 17 significant gains into the performance for the tasks of interest. Unlike ImageNet, the variability in the significant gains into the performance for the tasks of interest. Unlike ImageNet, the variability in the pediatric CXR data is several orders of magnitude smaller. The architecture of residual and inception pediatric CXR data is several orders of magnitude smaller. The architecture of residual and inception CNNs are progressively more complex and did not seem to be a fitting tool to use for the tasks of CNNs are progressively more complex and did not seem to be a fitting tool to use for the tasks of and AUC achieved with the customized VGG16 model are shown in Figures 8–10. We observed that interest. The confusion matrices and AUC achieved with the customized VGG16 model are shown in interest. The confusion matrices and AUC achieved with the customized VGG16 model are shown in the training metrics are poor compared to test accuracy. This is due to the fact that noisy images are Figures 8–10. We observed that the training metrics are poor compared to test accuracy. This is due Figures 8–10. We observed that the training metrics are poor compared to test accuracy. This is due included in the training data to reduce bias, overfitting, and improve model generalization. to the fact that noisy images are included in the training data to reduce bias, overfitting, and improve to the fact that noisy images are included in the training data to reduce bias, overfitting, and improve W model g e compar enered alizat the ion performance . of the customized VGG16 model trained with the cropped ROI, model generalization. We compared the performance of the customized VGG16 model trained with the cropped ROI, to the state-of-the-art. The results are tabulated in Table 4. We observed that our model outperforms We compared the performance of the customized VGG16 model trained with the cropped ROI, to the state-of-the-art. The results are tabulated in Table 4. We observed that our model outperforms the curr to the sta ent literatur te-of-the- e a in rt. Th all e r performance esults are tabu metrics lated in T acr able 4. We oss the obse classification rved that our tasks. model outper The customized forms the current literature in all performance metrics across the classification tasks. The customized the current literature in all performance metrics across the classification tasks. The customized sequential CNN demonstrates higher values for recall in: (i) classifying normal and pneumonia; sequential CNN demonstrates higher values for recall in: (i) classifying normal and pneumonia; and, sequential CNN demonstrates higher values for recall in: (i) classifying normal and pneumonia; and, and, (ii) identical recall measures to the customized VGG16 model in classifying bacterial and viral (ii) identical recall measures to the customized VGG16 model in classifying bacterial and viral (ii) identical recall measures to the customized VGG16 model in classifying bacterial and viral pneumonia. However, considering the balance between precision and recall as demonstrated by the pneumonia. However, considering the balance between precision and recall as demonstrated by the pneumonia. However, considering the balance between precision and recall as demonstrated by the F-Score and MCC, the customized VGG16 model outperforms the other CNNs and the state-of-the-art F-Score and MCC, the customized VGG16 model outperforms the other CNNs and the state-of-the- F-Score and MCC, the customized VGG16 model outperforms the other CNNs and the state-of-the- across the classification tasks. art across the classification tasks. art across the classification tasks. Figure 8. Confusion matrices for the performance of the customized VGG16 model: (a) Normal v. FigureFigure 8 8. Confusion . Confusion matrices for matrices for the the performance performance of of the the customized VGG16 model customized VGG16 model: : (a) N (ao )rma Normal l v. v. Pneumonia; (b) Bacterial v. Viral Pneumonia. Pneumonia; (b) Bacterial v. Viral Pneumonia. Pneumonia; (b) Bacterial v. Viral Pneumonia. Figure 9. ROC curves demonstrating the performance of the customized VGG16 model: (a) Normal Figure 9. ROC curves demonstrating the performance of the customized VGG16 model: (a) Normal Figure 9. ROC curves demonstrating the performance of the customized VGG16 model: (a) Normal v. v. Pneumonia; (b) Bacterial v. Viral Pneumonia. v. Pneumonia; (b) Bacterial v. Viral Pneumonia. Pneumonia; (b) Bacterial v. Viral Pneumonia. Table 4. Comparing the performance of the customized VGG16 model with the state-of-the-art. Task Model Accuracy AUC Precision Recall Specificity F-Score MCC Customized Normal v. 0.962 0.993 0.977 0.962 0.962 0.970 0.918 VGG16 Pneumonia Kermany et al. 0.928 0.968 - 0.932 0.901 - - Customized 0.936 0.962 0.920 0.984 0.860 0.951 0.862 Bacterial v. Viral VGG16 Pneumonia Kermany et al. 0.907 0.940 - 0.886 0.909 - - Customized Normal v. 0.918 0.939 0.920 0.900 0.960 0.910 0.876 VGG16 Bacterial v. Viral Kermany et al. - - - - - - - Pneumonia * Bold numbers indicate superior performance. Appl. Sci. 2018, 8, 1715 12 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 18 Figure 10. Performance of customized VGG16 model in multiclass classification: (a) Confusion matrix; Figure 10. Performance of customized VGG16 model in multiclass classification: (a) Confusion matrix; (b) ROC curves. (b) ROC curves. Table 4. Comparing the performance of the customized VGG16 model with the state-of-the-art. 4.2. Visualization through Discriminative Localization F- The customized Task VGG16 mode Modell has aAcc CAM-compatible uracy AUC Precisio arn Re chitectur call Spe e owing cificity to the pr MCC esence of Score the GAP layer. This helps in visualizing the model predictions using both CAM and grad-CAM Customized 0.962 0.993 0.977 0.962 0.962 0.970 0.918 Normal v. Pneumonia VGG16 visualization tools. Figures 11 and 12 demonstrate the results of applying these visualizations to Kermany et al. 0.928 0.968 - 0.932 0.901 - - localize the discriminative ROI in pneumonia-infected CXRs. Customized Appl. Sci. 2018, 8, x FOR PEER REVIEW 13 of 18 Bacterial v. Viral 0.936 0.962 0.920 0.984 0.860 0.951 0.862 VGG16 Pneumonia Kermany et al. 0.907 0.940 - 0.886 0.909 - - Customized Normal v. Bacterial v. 0.918 0.939 0.920 0.900 0.960 0.910 0.876 VGG16 Viral Pneumonia Kermany et al. - - - - - - - * Bold numbers indicate superior performance. 4.2. Visualization through Discriminative Localization The customized VGG16 model has a CAM-compatible architecture owing to the presence of the GAP layer. This helps in visualizing the model predictions using both CAM and grad-CAM visualization tools. Figures 11 and 12 demonstrate the results of applying these visualizations to localize the discriminative ROI in pneumonia-infected CXRs. Figure 11. Visual explanations through gradient-based localization using CAM: (a) Input CXRs; (b) Figure 11. Visual explanations through gradient-based localization using CAM: (a) Input CXRs; Bounding boxes localizing regions of activations; (c) CAM showing heat maps superimposed on the (b) Bounding boxes localizing regions of activations; (c) CAM showing heat maps superimposed on the original CXRs; (d) Automatically segmented lung masks; (e) CAM showing heat maps superimposed original CXRs; (d) Automatically segmented lung masks; (e) CAM showing heat maps superimposed on the cropped lungs. on the cropped lungs. Appl. Sci. 2018, 8, 1715 13 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 14 of 18 Figure 12. Visual explanations through gradient-based localization using grad-CAM: (a) Input CXRs; Figure 12. Visual explanations through gradient-based localization using grad-CAM: (a) Input CXRs; (b) Bounding boxes localizing regions of activations; (c) Grad-CAM showing heat maps (b) Bounding boxes localizing regions of activations; (c) Grad-CAM showing heat maps superimposed superimposed on the original CXRs; (d) Automatically segmented lung masks; (e) Grad-CAM on the original CXRs; (d) Automatically segmented lung masks; (e) Grad-CAM showing heat maps showing heat maps superimposed on the cropped lungs. superimposed on the cropped lungs. CXRs are fed to the trained model and the predictions are decoded. The heat maps are generated CXRs are fed to the trained model and the predictions are decoded. The heat maps are generated as as a two-dimensional score grid, computed for each input pixel location. Pixels carrying high a two-dimensional score grid, computed for each input pixel location. Pixels carrying high importance importance with respect to the expected class appeared bright red with distinct color transitions for with respect to the expected class appeared bright red with distinct color transitions for varying varying ranges. The generated heat maps are superimposed on the original input to localize image- ranges. The generated heat maps are superimposed on the original input to localize image-specific specific ROI. The lung masks that are generated with the boundary detection algorithm are applied ROI. The lung masks that are generated with the boundary detection algorithm are applied to extract to extract the localized ROI relevant to the lung regions. We observed that CAM and grad-CAM the localized visualizat ROI ionsr gener elevant ated to he the at m lung aps for regions. the pneumon We observed ia class tthat o higCAM hlight t and he vi grad-CAM sual difference visualizations s in the “pneumonia-like” regions of the image. generated heat maps for the pneumonia class to highlight the visual differences in the “pneumonia-like” We applied our novel method of average-CAM and average-grad-CAM to visualize the class- regions of the image. specific ROI, as shown in Figures 13 and 14. Lung masks are applied to the generated heat maps to We applied our novel method of average-CAM and average-grad-CAM to visualize the localize only the ROI specific to the lung regions. We observed that the class-specific ROI localized class-specific ROI, as shown in Figures 13 and 14. Lung masks are applied to the generated heat by the average-CAM and average-grad-CAM for the viral pneumonia class follows a diffuse pattern. maps to localize only the ROI specific to the lung regions. We observed that the class-specific ROI This is obvious for the reason that viral pneumonia manifests with diffuse interstitial patterns in both localized by the average-CAM and average-grad-CAM for the viral pneumonia class follows a diffuse lungs [30]. For the bacterial pneumonia class, we observed that the model layers are activated on both pattern. This is obvious for the reason that viral pneumonia manifests with diffuse interstitial patterns sides of the lungs, predominantly on the upper and middle right lung lobes. This is for the reason in both that lungs bacteri [30 al p ]. For neum the oni bacterial a manifest pneumonia s as lobar cons class, iderations we observed [30]. The pneumonia d that the model ataset layers und ar er study e activated has more pediatric patients with right lobar consolidations. on both sides of the lungs, predominantly on the upper and middle right lung lobes. This is for the reason that bacterial pneumonia manifests as lobar considerations [30]. The pneumonia dataset under study has more pediatric patients with right lobar consolidations. Appl. Sci. 2018, 8, 1715 14 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 15 of 18 Appl. Sci. 2018, 8, x FOR PEER REVIEW 15 of 18 Figure 13. Visual explanations through average-CAM: (a) Bacterial and viral CXR (top and bottom); Figure 13. Visual explanations through average-CAM: (a) Bacterial and viral CXR (top and Figure 13. Visual explanations through average-CAM: (a) Bacterial and viral CXR (top and bottom); (b) Average-CAM localizing class-specific ROI with bounding boxes highlighting the regions of bottom); (b) Average-CAM loca (b) Average-CAM lizing localizing class-spe class-specific cific ROI with ROI bound with bounding ing boxesb highlighting the oxes highlighting regions the regions of maximum activation; (c) Automatically segmented lung masks; (d) Average-CAM localizing class- of maximum activation; (c) Automatically segmented lung masks; (d) Average-CAM localizing maximum activation; (c) Automatically segmented lung masks; (d) Average-CAM localizing class- specific ROI with the extracted lung regions. class-specific ROI with the extracted lung regions. specific ROI with the extracted lung regions. Figure 14. Visual explanations through average-grad-CAM: (a) Bacterial and viral CXR (top and Figure 14. Visual explanations through average-grad-CAM: (a) Bacterial and viral CXR (top and Figure 14. Visual explanations through average-grad-CAM: (a) Bacterial and viral CXR (top and bottom); (b) Average-grad-CAM localizing class-specific ROI with bounding boxes highlighting the bottom); (b) Average-grad-CAM localizing class-specific ROI with bounding boxes highlighting the bottom); (b) Average-grad-CAM localizing class-specific ROI with bounding boxes highlighting the regions of maximum activation; (c) Automatically segmented lung masks; (d) Average-grad-CAM regions regionsof of maximum ac maximum activation; tivation; ( (cc)) Au Automatically tomatically seg segme mented nted lu lung ng m masks; asks; (d (d ) Average- ) Average-grad-CAM grad-CAM localizing class-specific ROI with the extracted lung regions. localizing localizing c class-specific lass-specificROI ROI w with ith the extracte the extracted d lung lung regions. regions. 4.3. Visual Explanations with LIME 4.3. Visual Explanations with LIME 4.3. Visual Explanations with LIME Figure 15 shows the explanations generated with LIME for sample instances of pediatric chest Figure 15 shows the explanations generated with LIME for sample instances of pediatric chest Figure 15 shows the explanations generated with LIME for sample instances of pediatric chest radiographs. radiographs. Lung Lung m masks asks ar are applied to e applied to the the explan explanations ations t to o loc localize alize only t only the he RO ROI I specif specific ic to to th the e lun lung g radiographs. Lung masks are applied to the explanations to localize only the ROI specific to the lung regions. The explanations are shown as follows: (i) Superpixels with the highest positive weights and regions. The explanations are shown as follows: (i) Superpixels with the highest positive weights and regions. The explanations are shown as follows: (i) Superpixels with the highest positive weights and the the rest are rest are greyed greyed out; out; and,and, (ii) sup (ii) superpixel erpixels s superimp supe osed rimposed on on the extracted the extracte lung d lung re regions. Wgio e observed ns. We the rest are greyed out; and, (ii) superpixels superimposed on the extracted lung regions. We that observed th the explainer at the explain focused on er the focused regions on the r with e high gions opacity with high . The o model pacity.dif The fer entiates model di bacterial fferentiatand es observed that the explainer focused on the regions with high opacity. The model differentiates bacterial and viral pneumonia by (i) showing superpixels with the highest positive activations in the viral pneumonia by (i) showing superpixels with the highest positive activations in the regions of lobar bacterial and viral pneumonia by (i) showing superpixels with the highest positive activations in the regions of lobar consolidations for bacterial pneumonia; and, (ii) diffuse interstitial patterns across regions of lobar consolidations for bacterial pneumonia; and, (ii) diffuse interstitial patterns across Appl. Sci. 2018, 8, 1715 15 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 16 of 18 consolidations for bacterial pneumonia; and, (ii) diffuse interstitial patterns across the lungs for viral pneumonia. We also observed that a number of false positive superpixels are reported. The reason is the lungs for viral pneumonia. We also observed that a number of false positive superpixels are that the current LIME implementation uses a sparse linear model to approximate the model behavior reported. The reason is that the current LIME implementation uses a sparse linear model to in the neighborhood of the explained predictions. However, these explanations result from a random approximate the model behavior in the neighborhood of the explained predictions. However, these sampling process and are not faithful if the underlying model is highly non-linear in the locality explanations result from a random sampling process and are not faithful if the underlying model is of predictions. highly non-linear in the locality of predictions. Figure 15. Visual explanations through LIME: (a) Input CXRs; (b) Automatically segmented lung Figure 15. Visual explanations through LIME: (a) Input CXRs; (b) Automatically segmented lung masks; (c) Copped lung regions; (d) Superpixels with the highest positive weights with the others masks; (c) Copped lung regions; (d) Superpixels with the highest positive weights with the others greyed out; (e) Superpixels with the highest positive weights are superimposed on the cropped lungs. greyed out; (e) Superpixels with the highest positive weights are superimposed on the cropped lungs. 5. Conclusions 5. Conclusions We proposed a CNN-based decision support system to detect pneumonia in pediatric CXRs to We proposed a CNN-based decision support system to detect pneumonia in pediatric CXRs to expedite accurate diagnosis of the pathology. We applied novel and state-of-the-art visualization expedite accurate diagnosis of the pathology. We applied novel and state-of-the-art visualization strategies to explain model predictions that is considered highly significant to clinical decision-making. strategies to explain model predictions that is considered highly significant to clinical decision- The study presents a universal approach to apply to an extensive range of visual recognition tasks. making. The study presents a universal approach to apply to an extensive range of visual recognition Classifying pneumonia in chest radiographs is a demanding task due to the presence of a high degree tasks. Classifying pneumonia in chest radiographs is a demanding task due to the presence of a high of variability in the input data. The promising performance of the customized VGG16 model trained degree of variability in the input data. The promising performance of the customized VGG16 model on the current tasks suggest that it effectively learns from a sparse collection of complex data with trained on the current tasks suggest that it effectively learns from a sparse collection of complex data reduced bias and improved generalization. We hope that our results are useful for developing clinically with reduced bias and improved generalization. We hope that our results are useful for developing useful solutions to detect and distinguish pneumonia types in chest radiographs. clinically useful solutions to detect and distinguish pneumonia types in chest radiographs. Author Author Co Contributions: ntributions: Conceptualization, Conceptualization, S.R. S.R. and and S.A.; S.A.; Method Methodology ology,, S S.R.; .R.; S Softwar oftware, e, S. S.R., R., S. S.C., C., an and d I. I.K.; K.; Validation, S.R., S.C., and I.K.; Formal Analysis, S.R.; Investigation, S.R., and S.A.; Resources, S.R. and S.C.; Data Validation, S.R., S.C., and I.K.; Formal Analysis, S.R.; Investigation, S.R., and S.A.; Resources, S.R. and S.C.; Data Curation, S.R.; Writing-Original Draft Preparation, S.R.; Writing-Review & Editing, S.A. and G.T.; Visualization, S.R.; Supervision, S.A. and G.T.; Project Administration, S.A. and G.T.; Funding Acquisition, S.A. and G.T. Appl. Sci. 2018, 8, 1715 16 of 17 Curation, S.R.; Writing-Original Draft Preparation, S.R.; Writing-Review & Editing, S.A. and G.T.; Visualization, S.R.; Supervision, S.A. and G.T.; Project Administration, S.A. and G.T.; Funding Acquisition, S.A. and G.T. Funding: This work was supported by the Intramural Research Program of the Lister Hill National Center for Biomedical Communications (LHNCBC), the National Library of Medicine (NLM), and the U.S. National Institutes of Health (NIH). Conflicts of Interest: The authors declare no conflict of interest. References 1. Le Roux, D.M.; Myer, L.; Nicol, M.P. Incidence and severity of childhood pneumonia in the first year of life in a South African birth cohort: The Drakenstein Child Health Study. Lancet Glob. Health 2015, 3, e95–e103. [CrossRef] 2. Mcluckie, A. Respiratory Disease and Its Management, 1st ed.; Springer: London, UK, 2009; pp. 51–59. ISBN 978-1-84882-094-4. 3. Cherian, T.; Mulholland, E.K.; Carlin, J.B.; Ostensen, H.; Amin, R.; De Campo, M.; Greenberg, D.; Lagos, R.; Lucero, M.; Madhi, S.A.; et al. Standardized interpretation of paediatric chest radiographs for the diagnosis of pneumonia in epidemiological studies. Bull. World Health Organ. 2005, 83, 353–359. [PubMed] 4. Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. 5. Karargyris, A.; Siegelman, J.; Tzortzis, D.; Jaeger, S.; Candemir, S.; Xue, Z.; KC, S.; Vajda, S.; Antani, S.K.; Folio, L.; et al. Combination of texture and shape features to detect pulmonary abnormalities in digital chest X-rays. Int. J. Comput. Assist. Radiol. Surg. 2016, 11, 99–106. [CrossRef] [PubMed] 6. Neuman, M.I.; Lee, E.Y.; Bixby, S.; Diperna, S.; Hellinger, J.; Markowitz, R.; Servaes, S.; Monuteaux, M.C.; Shah, S.S. Variability in the Interpretation of Chest Radiographs for the Diagnosis of Pneumonia in Children. J. Hosp. Med. 2012, 7, 294–298. [CrossRef] [PubMed] 7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. 8. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. 9. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 818–833. 10. Bar, Y.; Diamant, I.; Wolf, L.; Lieberman, S.; Konen, E.; Greenspan, H. Chest Pathology Detection Using Deep Learning with Non-Medical Training. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Brooklyn, NY, USA, 16–19 April 2015; pp. 294–297. 11. Oliveira, L.L.G.; Silva, S.A.E.; Ribeiro, L.H.V.; De Oliveira, R.M.; Coelho, C.J.; Ana Lúcia, A.L.S. Computer-Aided Diagnosis in Chest Radiography for Detection of Childhood Pneumonia. Int. J. Med. Inform. 2008, 77, 555–564. [CrossRef] [PubMed] 12. Abe, H.; Macmahon, H.; Shiraishi, J.; Li, Q.; Engelmann, R.; Doi, K. Computer-aided diagnosis in chest radiology. Semin. Ultrasound CT MR 2004, 25, 432–437. [CrossRef] [PubMed] 13. Giger, M.; MacMahon, H. Image processing and computer-aided diagnosis. Radiol. Clin. N. Am. 1996, 34, 565–596. [PubMed] 14. Monnier-Cholley, L.; MacMahon, H.; Katsuragawa, S.; Morishita, J.; Ishida, T.; Doi, K. Computer-aided diagnosis for detection of interstitial opacities on chest radiographs. AJR Am. J. Roentgenol. 1998, 171, 1651–1656. [CrossRef] [PubMed] 15. Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.S.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131. [CrossRef] [PubMed] Appl. Sci. 2018, 8, 1715 17 of 17 16. Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv. 2018. Available online: https://arxiv.org/abs/1711.05225 (accessed on 23 January 2018). 17. Guan, Q.; Huang, Y.; Zhong, Z.; Zheng, Z.; Zheng, L.; Yang, Y. Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification. arXiv. 2018. Available online: https://arxiv.org/abs/1801.09927v1 (accessed on 17 June 2018). 18. Candemir, S.; Jaeger, S.; Palaniappan, K.; Musco, J.P.; Singh, R.K.; Xue, Z.; Karargyris, A.; Antani, S.; Thoma, G.; McDonald, C.J. Lung Segmentation in Chest Radiographs Using Anatomical Atlases with Nonrigid Registration. IEEE Trans. Med. Imaging 2014, 33, 577–590. [CrossRef] [PubMed] 19. Candemir, S.; Antani, S.; Jaeger, S.; Browning, R.; Thoma, G. Lung Boundary Detection in Pediatric Chest X-Rays. In Proceedings of the SPIE Medical Imaging, Orlando, FL, USA, 21–26 February 2015; Volume 9418, p. 94180Q. 20. Liu, C.; Yuen, J.; Torralba, A. SIFT Flow: Dense Correspondence across Scenes and Its Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 978–994. [CrossRef] [PubMed] 21. Snoek, J.; Rippel, O.; Adams, R.P. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 2171–2180. 22. Deep Learning Using Bayesian Optimization. Available online: https://www.mathworks.com/help/nnet/ examples/deep-learning-using-bayesian-optimization.html (accessed on 14 January 2018). 23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 24. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. 25. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–32. 26. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. JMLR 2012, 13, 281–305. 27. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. 28. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference of Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. 29. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. 30. Sharma, S.; Maycher, B.; Eschun, G. Radiological imaging in pneumonia: Recent innovations. Curr. Opin. Pulm. Med. 2007, 13, 159–169. [CrossRef] [PubMed] © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Sciences Multidisciplinary Digital Publishing Institute

Visualization and Interpretation of Convolutional Neural Network Predictions in Detecting Pneumonia in Pediatric Chest Radiographs

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/visualization-and-interpretation-of-convolutional-neural-network-81kq7OlCC0
Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2019 MDPI (Basel, Switzerland) unless otherwise stated
ISSN
2076-3417
DOI
10.3390/app8101715
Publisher site
See Article on Publisher Site

Abstract

applied sciences Article Visualization and Interpretation of Convolutional Neural Network Predictions in Detecting Pneumonia in Pediatric Chest Radiographs Sivaramakrishnan Rajaraman * , Sema Candemir , Incheol Kim, George Thoma and Sameer Antani Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20894, USA; sema.candemir@nih.gov (S.C.); ickim@mail.nih.gov (I.K.); gthoma@mail.nih.gov (G.T.); santani@mail.nih.gov (S.A.) * Correspondence: sivaramakrishnan.rajaraman@nih.gov; Tel.: +1-301-827-2383 Received: 25 August 2018; Accepted: 18 September 2018; Published: 20 September 2018 Abstract: Pneumonia affects 7% of the global population, resulting in 2 million pediatric deaths every year. Chest X-ray (CXR) analysis is routinely performed to diagnose the disease. Computer-aided diagnostic (CADx) tools aim to supplement decision-making. These tools process the handcrafted and/or convolutional neural network (CNN) extracted image features for visual recognition. However, CNNs are perceived as black boxes since their performance lack explanations. This is a serious bottleneck in applications involving medical screening/diagnosis since poorly interpreted model behavior could adversely affect the clinical decision. In this study, we evaluate, visualize, and explain the performance of customized CNNs to detect pneumonia and further differentiate between bacterial and viral types in pediatric CXRs. We present a novel visualization strategy to localize the region of interest (ROI) that is considered relevant for model predictions across all the inputs that belong to an expected class. We statistically validate the models’ performance toward the underlying tasks. We observe that the customized VGG16 model achieves 96.2% and 93.6% accuracy in detecting the disease and distinguishing between bacterial and viral pneumonia respectively. The model outperforms the state-of-the-art in all performance metrics and demonstrates reduced bias and improved generalization. Keywords: computer vision; computer-aided diagnosis; convolutional neural networks; pediatric; pneumonia; visualization; explanation; chest X-rays; clinical decision 1. Introduction Pneumonia is a significant cause of mortality in children across the world. According to the World Health Organization (WHO), around 2 million pneumonia-related deaths are reported every year in children under 5 years of age, making it the most significant cause of pediatric death [1]. Pneumonia sourced from bacterial and viral pathogens are the two leading causes and require different forms of management [2]. Bacterial pneumonia is immediately treated with antibiotics while viral pneumonia requires supportive care, making timely and accurate diagnosis important. Chest X-ray (CXR) analysis is the most commonly performed radiographic examination for diagnosing and differentiating the types of pneumonia [3]. However, rapid radiographic diagnoses and treatment are adversely impacted by the lack of expert radiologists in resource-constrained regions where pediatric pneumonia is highly endemic with alarming mortality rates. Figure 1 shows sample instances of normal and infected pediatric CXRs. Appl. Sci. 2018, 8, 1715; doi:10.3390/app8101715 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 1715 2 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 2 of 18 Figure 1. Pediatric CXRs: (a) Normal CXR showing clear lungs with no abnormal opacification; (b) Figure 1. Pediatric CXRs: (a) Normal CXR showing clear lungs with no abnormal opacification; Bacterial pneumonia exhibiting focal lobar consolidation in the right upper lobe; (c) Viral pneumonia (b) Bacterial pneumonia exhibiting focal lobar consolidation in the right upper lobe; (c) Viral pneumonia manifesting with diffuse interstitial patterns in both lungs. manifesting with diffuse interstitial patterns in both lungs. Computer-aided diagnostic (CADx) tools aim to supplement clinical decision-making. They Computer-aided diagnostic (CADx) tools aim to supplement clinical decision-making. combine elements of computer vision and artificial intelligence with radiological image processing They combine elements of computer vision and artificial intelligence with radiological image processing for recognizing patterns [4]. Much of the published literature describes machine learning (ML) for recognizing patterns [4]. Much of the published literature describes machine learning (ML) algorithms that use handcrafted feature descriptors [5] that are optimized for individual datasets and algorithms that use handcrafted feature descriptors [5] that are optimized for individual datasets trained for specific variability in size, orientation, and position of the region of interest (ROI) [6]. In and trained for specific variability in size, orientation, and position of the region of interest (ROI) [6]. recent years, data-driven deep learning (DL) methods are shown to avoid the issues with handcrafted In recent featu years, res through en data-driven d-to-e deep nd felearning ature extr (DL) action methods and class ar ife icshown ation. to avoid the issues with handcrafted Convolutional neural networks (CNNs) belong to a class of DL models that are prominently features through end-to-end feature extraction and classification. used in computer vision [7]. These models have multiple processing layers to learn hierarchical Convolutional neural networks (CNNs) belong to a class of DL models that are prominently used feature representations from the input pixel data. The features in the early network layers are in computer vision [7]. These models have multiple processing layers to learn hierarchical feature abstracted through the mechanisms of local receptive fields, weight sharing, and pooling to form rich representations from the input pixel data. The features in the early network layers are abstracted feature representations toward learning and classifying the inputs to their respective classes. Due to through the mechanisms of local receptive fields, weight sharing, and pooling to form rich feature lack of sufficiently extensive medical image data, CNNs trained on large-scale data collections such representations toward learning and classifying the inputs to their respective classes. Due to lack as ImageNet [8] are used to transfer the knowledge of learned representations in the form of generic of sufficiently extensive medical image data, CNNs trained on large-scale data collections such as image features to the current task. CNNs are also shown to deliver promising results in object ImageNet [8] are used to transfer the knowledge of learned representations in the form of generic detection and localization tasks [9]. image features to the current task. CNNs are also shown to deliver promising results in object detection The astounding success of deep CNNs coupled with lack of explainable decision-making has resulted in a perception of doubt. This poorly understood model behavior has limited their use in and localization tasks [9]. routine clinical practice [10]. There aren’t enough studies pertaining to the visualization and The astounding success of deep CNNs coupled with lack of explainable decision-making has interpretation of CNNs in medical image analysis/understanding applications. In this article, we (i) resulted in a perception of doubt. This poorly understood model behavior has limited their use detect and distinguish pneumonia types in pediatric CXRs, and (ii) explain the internal operations in routine clinical practice [10]. There aren’t enough studies pertaining to the visualization and and predictions of CNNs applied to this challenge. interpretation of CNNs in medical image analysis/understanding applications. In this article, In this study, we evaluate, visualize, and explain the predictions of CNN models in classifying we (i) detect and distinguish pneumonia types in pediatric CXRs, and (ii) explain the internal pediatric CXRs to detect pneumonia and furthermore to differentiate between bacterial and viral operations and predictions of CNNs applied to this challenge. pneumonia to facilitate swift referrals that require urgent medical intervention. We propose a novel In this study, we evaluate, visualize, and explain the predictions of CNN models in classifying method to visualize the class-specific ROI that is considered significant for correct predictions across pediatric all the CXRs inputs to tha detect t belong to a pneumonia n expected and class. We furthermor evalua e te toadif nd sta ferentiate tistically val between idate the perf bacterial orma and nce viral of different customized CNNs that is trained end-to-end on the dataset under study to provide an pneumonia to facilitate swift referrals that require urgent medical intervention. We propose a novel accurate and timely diagnosis of the pathology. The work is organized as follows: Section 2 discusses method to visualize the class-specific ROI that is considered significant for correct predictions across the related work, Section 3 elaborates on the materials and methods, Section 4 discusses the results, all the inputs that belong to an expected class. We evaluate and statistically validate the performance and Section 5 concludes the study. of different customized CNNs that is trained end-to-end on the dataset under study to provide an accurate and timely diagnosis of the pathology. The work is organized as follows: Section 2 discusses 2. Related Work the related work, Section 3 elaborates on the materials and methods, Section 4 discusses the results, A study of the literature reveals several works pertaining to the use of handcrafted features for and Section 5 concludes the study. detecting pneumonia in chest radiographs [11–14]. However, few studies reported the performance of DL methods applied to pneumonia detection in pediatric CXRs. Relatively few researchers 2. Related Work A study of the literature reveals several works pertaining to the use of handcrafted features for detecting pneumonia in chest radiographs [11–14]. However, few studies reported the performance of DL methods applied to pneumonia detection in pediatric CXRs. Relatively few researchers attempted to Appl. Sci. 2018, 8, 1715 3 of 17 offer a qualitative explanation of their model’s learned behavior, internal computations, and predictions. The authors of [15] used a pretrained InceptionV3 model as a fixed feature extractor to classify normal and pneumonia-infected pediatric CXRs and further distinguish between bacterial and viral pneumonia with an area under the curve (AUC) of 0.968 and 0.940 respectively. In another study [4], the authors used a gradient-based ROI localization algorithm to detect and spatially locate pneumonia in CXRs. They released the largest collection of the National Institutes of Health (NIH) CXR dataset that contains 112,120 frontal CXRs, the associated labels are text-mined from radiological reports using natural language processing tools. The authors reported an AUC of 0.633 toward detecting the disease. The authors of [16] used a gradient-based visualization method to localize the ROI with heat maps toward pneumonia detection. They used a 121-layer densely connected neural network toward estimating the disease probability and obtained an AUC of 0.768 toward detecting pneumonia. The authors of [17] used an attention-guided mask inference algorithm to locate salient image regions that stand indicative of pneumonia. The features of local and global network branches in the proposed model are concatenated to estimate the probability of the disease. An AUC of 0.776 is reported for pneumonia detection. 3. Materials and Methods 3.1. Data Collection and Preprocessing We used a set of pediatric CXRs that have been made publicly available by the authors of [15]. The authors have obtained approvals from the Institutional Review Board (IRB) and Ethics Committee toward data collection and experimentation. The dataset includes anteroposterior CXRs of children from 1 to 5 years of age collected from Guangzhou Women and Children’s Medical Center in Guangzhou, China. The characteristics of the data and its distribution are shown in Table 1. The dataset is screened for quality control to remove unreadable and low-quality radiographs and curated by experts to avoid grading errors. Table 1. Dataset and its characteristics. Category Training Samples Test Samples File Type Normal 1349 234 JPG Bacterial 2538 242 JPG Viral 1345 148 JPG The CXRs contain regions other than the lungs that do not contribute to diagnosing pneumonia. Under these circumstances, the model may learn irrelevant feature representations from the underlying data. Using an algorithm based on anatomical atlases [18] to automatically detect the lung ROI can avoid this. A reference set of patient CXRs with expert-delineated lung masks are used as models [19] to register with the objective pediatric CXR. When presented with an objective chest radiograph, the algorithm uses the Bhattacharyya distance measure to select the most similar model CXRs. The correspondence between the model CXRs and objective CXR is computed by modeling the objective CXR with local image feature representations and identifying similar locations by applying SIFT-flow algorithm [20]. This map is the transformation applied to the model lung masks to transform them into the approximate lung model for the objective chest radiograph. The lung boundaries are cropped to the size of a bounding box to include all the lung pixels that constitute the ROI for the current task. The baseline data (whole CXRs) and the cropped bounding box are resampled to 1024  1024 pixel dimensions and mean normalized to assist the models in faster convergence. The detected lung boundaries for the sample pediatric CXRs are shown in Figure 2. Appl. Sci. 2018, 8, 1715 4 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 4 of 18 Figure 2. Detected boundaries in sample pediatric CXRs. Figure 2. Detected boundaries in sample pediatric CXRs. 3.2. Configuring CNNs for Pneumonia Detection 3.2. Configuring CNNs for Pneumonia Detection We evaluated the performance of different customized CNNs and a VGG16 model in detecting We evaluated the performance of different customized CNNs and a VGG16 model in detecting pneumonia and furthermore distinguishing between bacterial and viral types to facilitate timely and pneumonia and furthermore distinguishing between bacterial and viral types to facilitate timely and acaccur curata ete d disei ase sease d diagniagnosis. We ev osis. We evaluatealuat d theed the perform perfo anceromance of th f three differree diffe ent custom rent customized ized CNN architeCNN ctures: (i) architectu Sequentia res: l CN (i) Se N; ( qu ii)ential CNNCNN; with r(i ei) C sidu N aN with re l connecti so id ns ual connect (Residual ion CN sN (R );es an idu d, a (i l CNN) ii) CNN ; and, with (iI ii n )c CNN eption with Inception modules (Inception CNN). modules (Inception CNN). 3.2.1. Sequential CNN 3.2.1. Sequential CNN A sequential CNN model belongs to the class of deep, feed-forward artificial neural networks A sequential CNN model belongs to the class of deep, feed-forward artificial neural networks that that ar are com e commonly monly appl applied ied t to o visua visuall re recognition cognition [ [7 7]. ]. It It is a l is a linear inear st stack ack of conv of convolutional, olutional, no nonlinear nlinear, , pooling, poolingand , and dense layers. We dense layers. We optimized optimized the se the sequential quential CNN CNN arch architectur itecture and its e and itshyperparameters hyperparameters for the datasets under study through Bayesian learning [21,22]. The procedure uses a Gaussian for the datasets under study through Bayesian learning [21,22]. The procedure uses a Gaussian process model of an objective function and its evaluation to optimize the network depth, learning process model of an objective function and its evaluation to optimize the network depth, learning rate, momentum, and L2-regularization. These parameters are passed as arguments in the form of rate, momentum, and L2-regularization. These parameters are passed as arguments in the form optimization variables to evaluate the objective function. We initialized the search ranges to [110], [1 of optimization variables to evaluate the objective function. We initialized the search ranges to −7 −1 −10 −2 7 1 10 2 × 10 1 × 10 ], [0.7 0.99], and [1 × 10 1 × 10 ] for the network depth, learning rate, momentum, and [110], [1  10 1  10 ], [0.7 0.99], and [1  10 1  10 ] for the network depth, learning rate, L2-regularization respectively. The objective function takes these variables as input, trains, validates momentum, and L2-regularization respectively. The objective function takes these variables as input, and saves the optimal network that gives the minimum classification error on the test data. Figure 3 trains, validates and saves the optimal network that gives the minimum classification error on the test illustrates the steps involved in optimization. data. Figure 3 illustrates the steps involved in optimization. 3.2.2. Residual CNN 3.2.2. Residual CNN In a sequential CNN, the succeeding network layer learns the feature representations from only In a sequential CNN, the succeeding network layer learns the feature representations from only the the preceding layer. These networks are constrained by the level of information they can process. preceding layer. These networks are constrained by the level of information they can process. Residual Residual networks are proposed by [23] that won the ImageNet Large Scale Visual Recognition networks are proposed by [23] that won the ImageNet Large Scale Visual Recognition (ILSVRC) (ILSVRC) Challenge in 2015. These networks tackle the issue of representational bottlenecks by Challenge in 2015. These networks tackle the issue of representational bottlenecks by injecting the injecting the information from the earlier network layers downstream to prevent loss of information. information from the earlier network layers downstream to prevent loss of information. They also They also prevent the gradients from vanishing by introducing a linear information carry track to prevent the gradients from vanishing by introducing a linear information carry track to propagate propagate gradients through deep network layers. In this study, we propose a customized CNN that gradients through deep network layers. In this study, we propose a customized CNN that is made up is made up of six residual blocks, as shown in Figure 4. of six residual blocks, as shown in Figure 4. 3.2.3. Inception CNN 3.2.3. Inception CNN The Inception architecture, proposed by [24] consists of independent modules having parallel The Inception architecture, proposed by [24] consists of independent modules having parallel branches that are concatenated to form the resultant feature map that is fed into the succeeding branches that are concatenated to form the resultant feature map that is fed into the succeeding modules. modules. Unlike sequential CNN, this method of stacking modules help in separately learning the Unlike sequential CNN, this method of stacking modules help in separately learning the spatial and spatial and channel-wise feature representations. The 1 × 1 convolution filters used in these modules channel-wise feature representations. The 1  1 convolution filters used in these modules factor out factor out the channel and spatial feature learning by computing features from the channels without the channel and spatial feature learning by computing features from the channels without mixing mixing spatial information by looking at one input tile at a given point in time. We construct a spatial information by looking at one input tile at a given point in time. We construct a customized customized Inception CNN by stacking six InceptionV3 modules [23], as shown in Figure 5. Inception CNN by stacking six InceptionV3 modules [23], as shown in Figure 5. Appl. Sci. 2018, 8, 1715 5 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 5 of 18 Appl. Sci. 2018, 8, x FOR PEER REVIEW 5 of 18 Figure 3. Flowchart describing the optimization procedure. Figure 3. Flowchart describing the optimization procedure. Figure 3. Flowchart describing the optimization procedure. Figure 4. The architecture of customized residual CNN: (a) Residual block; (b) Customized residual Figure 4. The architecture of customized residual CNN: (a) Residual block; (b) Customized residual CNN stacked with six residual blocks. CNN stacked with six residual blocks. Figure 4. The architecture of customized residual CNN: (a) Residual block; (b) Customized residual CNN stacked with six residual blocks. Appl. Sci. 2018, 8, 1715 6 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 6 of 18 Figure 5. The architecture of customized InceptionV3 CNN: (a) InceptionV3 module; (b) Customized Figure 5. The architecture of customized InceptionV3 CNN: (a) InceptionV3 module; (b) Customized Inception CNN stacked with six InceptionV3 modules. Inception CNN stacked with six InceptionV3 modules. 3.2.4. Customized VGG16 3.2.4. Customized VGG16 VGG16 is proposed and trained by the Oxford’s Visual Geometry Group (VGG) [25] for object VGG16 is proposed and trained by the Oxford’s Visual Geometry Group (VGG) [25] for object recognition. The model scored first in ILSVRC image localization and second in image classification recognition. The model scored first in ILSVRC image localization and second in image classification tasks. We customized the architecture of VGG16 model and evaluated its performance toward the tasks. We customized the architecture of VGG16 model and evaluated its performance toward the tasks of interest. The model is truncated at the deepest convolutional layer and added with a global tasks of interest. The model is truncated at the deepest convolutional layer and added with a global average pooling (GAP) and dense layer as shown in Figure 6. We refer to this model as customized average pooling (GAP) and dense layer as shown in Figure 6. We refer to this model as customized VGG16 in this study. VGG16 in this study. The hyperparameters of the customized residual, Inception and VGG16 models are optimized The hyperparameters of the customized residual, Inception and VGG16 models are optimized through a randomized grid search [26] that searches and optimizes the value of hyperparameters through a randomized grid search [26] that searches and optimizes the value of hyperparameters including learning rate, momentum, and L2-regularization. The search ranges are initialized to including learning rate, momentum, and L2-regularization. The search ranges are initialized to [1 × 6 1 10 1 [1− 6 10 −11  10 ], [0.7 0.99], −and 10 [1 −1 10 1  10 ] for the learning rate, momentum, and 10 1 × 10 ], [0.7 0.99], and [1 × 10 1 × 10 ] for the learning rate, momentum, and L2-regularization L2-regularization respectively. Callbacks are used to view the internal states during training and respectively. Callbacks are used to view the internal states during training and retain the best retain the best performing model for analysis. We performed hold-out testing with the test data after performing model for analysis. We performed hold-out testing with the test data after every step. every step. The performance of customized CNNs are evaluated in terms of the following performance The performance of customized CNNs are evaluated in terms of the following performance metrics: metrics: (i) accuracy; (ii) AUC; (iii) precision; (iv) recall; (v) specificity; (vi) F-Score; and, (vii) Matthews (i) accuracy; (ii) AUC; (iii) precision; (iv) recall; (v) specificity; (vi) F-Score; and, (vii) Matthews Correlation Coefficient (MCC). We used the NIH Biowulf Linux cluster (https://hpc.nih.gov/) and Correlation Coefficient (MCC). We used the NIH Biowulf Linux cluster (https://hpc.nih.gov/) and the the high performance computing facility at the National Library of Medicine (NLM) for computational high performance computing facility at the National Library of Medicine (NLM) for computational analyses. Software frameworks included with Matlab R2017b are used to configure and evaluate analyses. Software frameworks included with Matlab R2017b are used to configure and evaluate the the sequential CNN along with Keras and Tensorflow backend for other customized models used in sequential CNN along with Keras and Tensorflow backend for other customized models used in this this study. study. Appl. Sci. 2018, 8, 1715 7 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 7 of 18 Figure 6. VGG16 model truncated at the deepest convolutional layer and added with a GAP and Figure 6. VGG16 model truncated at the deepest convolutional layer and added with a GAP and dense layer. dense layer. 3.3. Visualization Studies 3.3. Visualization Studies The interpretation and understanding of CNNs is a hotly debated topic in ML, particularly in The interpretation and understanding of CNNs is a hotly debated topic in ML, particularly in the context of clinical decision-making [4]. CNNs are perceived as black boxes and it is imperative to the context of clinical decision-making [4]. CNNs are perceived as black boxes and it is imperative to explain their working to build trust in their predictions [9]. This helps to understand their working explain their working to build trust in their predictions [9]. This helps to understand their working principles, assist in hyperparameter tuning and optimization, identify and get an intuition of the principles, assist in hyperparameter tuning and optimization, identify and get an intuition of the reason reason behind the model failures, and explain the predictions to the end-user prior to possible behind the model failures, and explain the predictions to the end-user prior to possible deployment. deployment. The methods of visualizing CNNs are broadly categorized into (i) preliminary methods The methods of visualizing CNNs are broadly categorized into (i) preliminary methods that help to that help to visualize the overall structure of the model; and, (ii) gradient-based methods that visualize the overall structure of the model; and, (ii) gradient-based methods that manipulate the manipulate the gradients from the forward and backward pass during training [27]. We demonstrated the overall structure of the CNNs, as shown in Figures 4–6. gradients from the forward and backward pass during training [27]. We demonstrated the overall structure of the CNNs, as shown in Figures 4–6. 3.3.1. Visual Explanation through Discriminative Localization 3.3.1. Visual Explanation through Discriminative Localization The trained model focusses on discriminative parts of the image to arrive at the predictions. Class Activation Maps (CAM) help in visualizing and debugging model predictions, particularly in The trained model focusses on discriminative parts of the image to arrive at the predictions. case of a prediction error when the model predicts based on the surrounding context [27]. The output Class Activation Maps (CAM) help in visualizing and debugging model predictions, particularly in of the GAP layer is fed to the dense layer to identify the discriminative ROI localized to classify the case of a prediction error when the model predicts based on the surrounding context [27]. The output inputs to their respective classes. Let G denote the GAP that spatially averages the m-th feature map of the GAP layer is fed to the dense layer to identify the discriminative ROI localized to classify the from the deepest convolutional layer, and 𝑤 denote the weights connecting the m-th feature map inputs to their respective classes. Let G denote the GAP that spatially averages the m-th feature map to the output neuron corresponding to the expected class p. A prediction score Sp at the output neuron from the deepest convolutional layer, and w denote the weights connecting the m-th feature map to is expressed as a weighted sum of GAP as shown in Equation (1). the output neuron corresponding to the expected class p. A prediction score S at the output neuron is 𝑆 = ∑ 𝑤 ∑ 𝑔 (𝑥, 𝑦 ) =∑∑ 𝑤 𝑔 (𝑥, 𝑦) (1) , , expressed as a weighted sum of GAP as shown in Equation (1). The value gm (x, y) denotes the m-th feature map activation in the spatial location (x, y). The CAM p p for the class p denoted S by = CAMw p is expressed g (x a,sy the weighted ) = sum of the w g (x,a yc)tivations from all the (1) p å m å m å å m m m x,y x,y m feature maps with respect to the expected class p at the spatial location (x, y) as shown in Equation The value g (x, y) denotes the m-th feature map activation in the spatial location (x, y). The CAM (2). for the class p denoted by CAM is expressed as the weighted sum of the activations from all the feature 𝐶𝐴𝑀 (𝑥, 𝑦 ) = ∑ 𝑤 𝑔 (𝑥, 𝑦) (2) maps with respect to the expected class p at the spatial location (x, y) as shown in Equation (2). CAM gives information pertaining to the importance of the activations at each spatial grid (x, y) to classify an input image to its expect C A M ed cla (x, ss y)p= . It is rescal w g ed to the siz (x, y) e of the input image to locate (2) p m å m the discriminative ROI used to classify the image to its expected class. This helps to answer queries CAM gives information pertaining to the importance of the activations at each spatial grid (x, pertaining to the ability of the model in predicting and localizing the ROI specific to its category. We y) to classify an input image to its expected class p. It is rescaled to the size of the input image to propose a novel visualization method called average-CAM to represent the class-level ROI that is locate the discriminative ROI used to classify the image to its expected class. This helps to answer most commonly considered significant for correct prediction across all the inputs that belong to a given class. The average-CAM for the class p is computed by averaging the CAM outputs as shown queries pertaining to the ability of the model in predicting and localizing the ROI specific to its category. in Equation (3). We propose a novel visualization method called average-CAM to represent the class-level ROI that is most commonly considered significant for correct pred (ictio )n ac ∑ ross all the inputs that belong to a given class. 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 − 𝐶𝐴𝑀 𝑥, 𝑦 = 𝐶𝐴𝑀 (𝑥, 𝑦) (3) The average-CAM for the class p is computed by averaging the CAM outputs as shown in Equation (3). 𝐶𝐴𝑀 (x, y) denotes the CAM for the a-th image in the expected class p. This helps to identify the ROI specific to the expected class, improve the interpretability of the i a nternal representations, and average C A M (x, y) = C A M (x, y) (3) p å explainability of the model predictions. C A M (x, y) denotes the CAM for the a-th image in the expected class p. This helps to identify the ROI specific to the expected class, improve the interpretability of the internal representations, and explainability of the model predictions. Appl. Sci. 2018, 8, 1715 8 of 17 CAM visualization can only be applied to networks with a GAP layer. Gradient-weighted CAM Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 18 (grad-CAM) is a strict generalization of CAM that can be applied to all existing CNNs [28]. It uses Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 18 CAM visualization can only be applied to networks with a GAP layer. Gradient-weighted CAM the gradient information of the expected class, flowing back into the deepest convolutional layer to CAM visua (gr liza ad-C tion ca AM) n only be is a strict appl gen ied t erao li netw zation of orks wi CAth a GAP la M that can b yer. Gra e applied dient-wei to all exi ghted C sting C AM NNs [28]. It uses generate explanations. Grad-CAM produces the weighted sum of all the feature maps in the deepest (grad-CAM) is a thst e gr rict adient gene i ra nli form zatio at n of ion C ofA tM t he exp hat c ect an b ed cl e ap ass, p f lied lowing ba to all exi ck sitn ing to t C hNNs e deepest convol [28]. It usesu tional layer to convolutional layer for the expected class p as shown in Equation (4). A ReLU nonlinearity is applied the gradient inform genera ation te expla of thn ea exp tions ect . Grad- ed clas Cs, A f M lowing ba produces ck the weig into the deepest convol hted sum of all the uti fe ona ature l layer to maps in the deepest to avoid the negative weights from influencing the class p. This is based on the consideration that the Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 18 generate explanconvolution ations. Grad- al layer CAM produ for the c es expected class the weighted sum p as shown of all the in Eq fe u ature ation ( maps 4). A Re in the d LU neepest onlinearity is applied pixels with negative weights are likely to belong to other classes. CAM visua convolution lizata io l layer n ca to a n only be v for the oid the nega expected class applied t tive wei o netw gp hts f as orks wi shown rom ith a GAP la nf iln uenci Equn ag the class tion ( yer. Gra 4). A Re dp ient-wei . Thi LU sn is oghted C nline based ar on it A y t M is h e c applied onsider ation that the grad C A M (x, y) = ReLU b g (x, y) (4) (grad-CA to a M) v i oi s a d the nega strict gen piti xel e ve wei ra s wi lizat th nega gihts f on of rom ti C ve weights a A iM t nfluenci hat cn ag the class n b re li eke ap ly pto belong to o lied pp. Thi to al s is l exi based stther classe ing on å C tNNs he c ms. o [ m nsider 28]. It uses ation that the the gradient pixel isn wi form th nega ationti o ve weights a f the expecte re li d cl ke asly s, f to belong to o lowing back tiher classe nto the deepest convol s. utional layer to − 𝐶 (𝑥, 𝑦 ) =𝑅𝑒𝐿𝑈( ∑ 𝛽 𝑔 (𝑥, 𝑦 )) (4) The value b is obtained by computing the gradient of the prediction score S with respect to the generate explanations. Grad-CAM produces the weighted sum of all the feature maps in the deepest p − 𝐶 (𝑥, 𝑦 ) =𝑅𝑒𝐿𝑈( ∑ 𝛽 𝑔 (𝑥, 𝑦 )) (4) m-th The value feature map 𝛽 as ishown s obtaine ind Equation by comput (5).ing the gradient of the prediction score Sp with respect to convolutional layer for the expected class p as shown in Equation (4). A ReLU nonlinearity is applied The value the 𝛽 m- is obt th fe aat ine ure m d bya comput p as shown in E ing the gr quat adient ion ( o 5f). t he prediction score Sp with respect to to avoid the negative weights from influencing the class p. This is based on the consideration that the ¶S the m-th feature map as shown in Equation (5). b = (5) pixels with negative weights are likely to belong to other classes. m å x,y ¶g (x, y) 𝛽 = ∑ m (5) (,) ( ) ∑ ( ) − 𝐶 𝑥, 𝑦 =𝑅𝑒𝐿𝑈( ∑ 𝛽 𝑔 𝑥, 𝑦 ) (4) 𝛽 = (5) p p (,) Accor Accordi ding ng to Eq to Equations uations (1) a (1)nd and (4), (4), 𝛽 is precise b is precisely ly the same the as same 𝑤 as for networks wi w for networks th a Cwith AM- m m The value 𝛽 is obtained by computing the gr adient of the prediction score Sp with respect to According acompatible to Eq CAM-compatible uations ( architecture. T 1) and ar (4 chitectur ), 𝛽 h e d is precise e.ifference The dif ly the lies in ference same lies applying the as in 𝑤 apply for networks wi ingReLU non the ReLU th -linearity to non-linearity a CAM- exclude the to exclude the m-th feature map as shown in Equation (5). compatible architecture. T the influence of influence hnegative weights that e d of negative ifference lies in weights applying the that are ar like e likely ly to bReLU non to elo belong ng to ot to -linearity to her cl other ass classes. es. exclude the The av Theerage average-grad-CAM -grad-CAM for influence of negative weights that for the class the class p is p computed by is computed are likeby aver ly t averaging o b agin elog th ng tthe e grad-CA o ot grad-CAM her class M eoutp s. outputs The av uts as sh erage as shown own in E -grad in -C Equation AM fo quation (6). The v r (6). The value alue 𝛽 = (5) (,) the class p is computed by grad- grad-C 𝐶𝐴𝑀 A M ( aver ( xx , y , y ) d )agin denotes enot g th es te grad-CA the he gr grad-CAM ad-CAM M outp for for tuts as sh the he a- a-th th imag image own in E e in the expected class in the quaexpected tion (6). The v classp pa. .lue Accordi gradn -𝐶𝐴𝑀 g to Eq (xua , yti ) d ons ( enot 1e ) a s tn hd e gr (4), a d 𝛽 -CAM is precise for the lya- the th imag same e in the expected class as 𝑤 for networks wi p. th a CAM- ( ) ∑ − − 𝐶 𝑥, 𝑦 = − 𝐶 (𝑥, 𝑦) (6) average grad C A M (x, y) = grad C A M (x, y) (6) p å compatible architecture. The difference lies in applying the ReLU non-linearity to exclude the − − 𝐶 (𝑥, 𝑦 ) = ∑ − 𝐶 (𝑥, 𝑦) (6) influence of negative weights that are likely to belong to other classes. The average-grad-CAM for 3.3.2. Model-Agnostic Visual Explanations the class p is computed by 3.3.2. aver Model-Agnostic aging the grad-CA VisualM Explanations outputs as shown in Equation (6). The value 3.3.2. Model-Agnostic Visual Explanations grad-𝐶𝐴𝑀 (x, y) denotes the grad-CAM for the a-th image in the expected class p. Local interpretable model-agnostic explanations (LIME) is a visualization tool proposed by [29]. Local interpretable model-agnostic explanations (LIME) is a visualization tool proposed by [29]. Local interp Itr helps t etable m o provide odel-agnost a qu ic exp alitatl ive int anation erpr s (LI eta M tiE) on of is a the rel visuala iz tiat onshi ion tp ool be p tween perturbed i roposed by [29]. nput instances − It helps − 𝐶 to provide a qualitative (𝑥, 𝑦 ) = interpr ∑ − 𝐶 etation of the (𝑥, 𝑦) relationship between perturbed input instances (6) It helps to provide and t a q he mod ualitate ive int l predeict rpr ions eta.t The inp ion of the rel ut im aage tionshi is dp ivid between perturbed i ed into contiguous n supe put in rpi sta xn els ces and a dataset of and the model predictions. The input image is divided into contiguous superpixels and a dataset and the model pred perturbed input insta ictions. The input i nm ces is construct age is divided int ed by tu o contrni iguous ng on/of supef these i rpixels and nterpreta a databset le components. The of of perturbed input instances is constructed by turning on/off these interpretable components. 3.3.2. Model-Agnostic Visual Explanations perturbed input insta perturbed nces is construct instances are ed by tu weighted b rning on/of y their f these i similarity to the nterpretabexpl le components. The ained instance. The algorithm The perturbed instances are weighted by their similarity to the explained instance. The algorithm Local perturbed interpretin abstances are leapproxim model-agnost atweighted b es t ic exp he CNN by lanat y their ion a sp s (LI sim arse MiE) la , lri iisty to the ne aar mode visualizexpl at l tion hat a t i is we ned ool piight rnop sta osed b ed nce. The only in t y [29a ]l . hgori e nethm ighborhood of the approximates the CNN by a sparse, linear model that is weighted only in the neighborhood of the It helps tapproxim o provide at a q es t uh al explained e CNN by itative int pr ea sp rpr edictions. earse tation of , line An the rel ar mode explan atil t onshi ation hat is we pis gener between perturbed i ighta ed ted only in t in the f hn e n oput rm e ighborhood of insta superp nces ixof t els wi he th the highest explained predictions. An explanation is generated in the form of superpixels with the highest positive and the mod explained el pred pr ictedictions. ions posi . The inp tive weights tha An explan ut image ation t demonstra is diis gener vided int te the discri ate o cont d in the f iguous mina o supe rm tive of R rpi Osuperp I xlels ocal and iizxeels wi d by the model a datath the highest set of to classify the image weights that demonstrate the discriminative ROI localized by the model to classify the image to d d d 0 d perturbed input insta positive weights tha nto its expecte ces is construct t demonstra d class. Let e td by tu e the discri rni k ∈ mi ng on/of ℝ na ti bve e t fh R these i e O exp I local lain niterpreta ze ed d by the model instance, ble components. The and to cla k’ ∈ ssi {0fy the , 1} , the imabi ge na ry vector that its expected class. Let k 2 R be the explained instance, and k 2 {0, 1} , the binary vector that d d perturbed to its expecte instances are d class. Let denotes the p weighted b k ∈ ry esence/ab ℝ their be thsim es exp enila ce of a supe ri lain ty to the ed inst rpixel. Let ance, expl a and ined g k’ ∈ in∈ G st a {0 n denote the ex , 1 ce. The } , the bi algori na plan ry vector tha thm ation wh ere t G is a class of denotes the presence/absence of a superpixel. Let g 2 G denote the explanation where G is a class of approxim denotes the p ates the CNN by resence/ab interpret a sparse s aen ble ce of a supe , l line inea ar mode r model rpixel. Let l t s. Let hat is we ℽ(g g ) ∈ denote the complexi ight G ed denote the ex only in the n plan ty mea eighborhood ation wh sure ere associ of t G is h ae ta ed class of with the exp lanation interpretable linear models. Let (g) denote the complexity measure associated with the explanation explained int pr erpret edictions. able line gAn a∈ r explan model G. The v sation . Let alue ℽis gener (ℽ g()g denote the complexi ) denot ated es the n in the f umber of n orm ty mea of superp os n-zero ure aissoci xcoefficients fo els wi ated th the highest with the exp r the line laar m natio on del. Let m: ℝ → g 2 G. The value (g) denotes the number of non-zero coefficients for the linear model. Let m: R ! R positive weights tha g ∈ G. The v t demonstra alue ℝ dℽ e( note g) denot t the expl e the discri es the n aine mi u dmber of n na mode tive lR aO no d I n-zero lm ocal (k), i t zcoefficients fo h ed by the model e probability t r the line hat to cla k bel sar m si ofy the ngs oto a gi del. Let imag ven cl em: ℝ ass. → Let Πk(x) denote denote the explained model and m(k), the probability that k belongs to a given class. Let P (x) denote d d to its expecte ℝ denote d class. Let the expl the measur a k ine∈d ℝ mode be e tof prox lh ae n exp d m(limity ain k), te hd e probab between the instance, ilit and y th instance at k’ k∈ bel {0 o x ngs , 1 to } to a gi , the k and bi ven cl P na (m ry vector tha , g ass. , Π Let k) dΠ eknote t (x t ) deh note e loss of g toward the measure of proximity between the instance x to k and P(m, g, P ) denote the loss of g toward denotes the p the measur resence/ab e of prox approxim sence of a supe imity at between the ing rpixel. Let m in the neighborhood instance g ∈ G denote the ex x to k and defin P e(d by plan m, gation wh ,Π Π k. k The val ) denote t ere u Ge h is e Pa (loss o m cl , g ass of , fΠ g k ) is towminimiz ard ed and the approximating m in the neighborhood defined by P . The value P(m, g, P ) is minimized and the k k interpret approxim able lineaat r model ing value o m sin . Let the neighborhood f ℽℽ((g g)) remain denote the complexi s low enough defined by ty mea forΠ in k. The val t se ure rpret associ abil ue ia t P y t(ed .m Eq , g wi uation , th the exp Πk) is (7) minimiz gives the exp lanatie od and n lan the ations produced value of (g) remains low enough for interpretability. Equation (7) gives the explanations produced g ∈ G. The v value o alue f ℽ(ℽ g()g remain ) denot by LIME. s low enough es the number of n for in otn-zero erpreta coefficients fo bility. Equation r the line (7) gives the exp ar model. Let lanm: ations pro ℝ → duced by LIME. ℝ denote by LIME. the explained model and m(k), the probability that k belongs to a given class. Let Πk(x) denote b(k) = argminP(m, g, P ) + (g) (7) ( ) ( k) 𝛽 𝑘 =argmin 𝑃 𝑚, 𝑔, 𝛱 +ℽ(𝑔) (7) g2G the measure of proximity between the instance x to k and P(m, g , ∈ Π k) denote the loss of g toward 𝛽 (𝑘 ) =argmin 𝑃 (𝑚, 𝑔, 𝛱 ) +ℽ(𝑔) (7) approximating m in the neighborhood defined by Πk. The value P(m, g, Πk) is minimized and the The value P (m, g, P ) is approximated by drawing samples weighted by P . Equation (8) shows The value P (m, g, Πk) is approximated by drawing samples weighted by Πk. Equation (8) shows k k value of ℽ(g) remains low enough for interpretability. Equation (7) gives the explanations produced an exponential kernel defined on the L2-distance function (J) with width e. For a given input perturbed The value P an exponential kernel defined on (m, g, Πk) is approximated by drawing the sL2-dist amples we ance ight funct ed by ion Π (Jk) . E wit quh a widt tion (8h) sho €. For ws a given input 0 d by LIME. d’ sample b 2 {0, 1} containing a fraction of non-zero elements, the label for the explanation model m(b) an exponential kernel defined on perturbed sample b’ the ∈ {0L2-dist , 1} contain ance ifng unct a fraction of no ion (J) with widt n-zero h € eleme . For n ts, the label a given input for the explan ation d’ d is obtained by recovering the sample in the original representation b 2 R as shown in Equation (9). perturbed sample model b’ ∈m {(0 b, ) is obtained 1} containing by a fraction of no recovering the samp n-zero eleme le in tn hts, the label e original re for the explan presentation ation b ∈ ℝ as shown in 𝛽 (𝑘 ) =argmin 𝑃 (𝑚, 𝑔, 𝛱 ) +ℽ(𝑔) (7) model m(b) is obtained Equation (9 by ). recovering the sample in the original representation b ∈ ℝ as shown in (y, b) The value Equation (9 P (m,). g, Πk) is approximated by drawing samples weighted by Πk. Equation (8) shows (,) P = exp J (8) Π =exp(−𝐽 2 ) (8) an exponential kernel defined on the L2-distance function (J) wit h width € €. For a given input (,) (8) Π =exp(−𝐽 ) d’ perturbed sample b’ ∈ {0, 1} containing a fraction of non-zero elements, the label for the explan ation ( ) ∑ ( ) ( ) 𝑃 𝑚, 𝑔, Π = Π (𝑚 𝑏 −𝑔 𝑏 ) (9) ,∈ model m(b) is obtained by recovering the sample in the original representation b ∈ ℝ as shown in 𝑃 (𝑚, 𝑔, Π ) = ∑ Π (𝑚 (𝑏 ) −𝑔 (𝑏 )) (9) ,∈ LIME provides explanations that help to make an informed decision about the trustworthiness Equation (9). LIME provof t ideh se pred explan ict atiions ons t and hatga he in lp cruc to m ia al kinsi e an g h in tsformed decision a into the model beh bout the trustworthi avior. ness (,) (8) of the predictions and gain crucia Π l insi =e gx hp ts( int −𝐽 o the m ) odel behavior. 𝑃 (𝑚, 𝑔, Π ) = ∑ Π (𝑚 (𝑏 ) −𝑔 (𝑏 )) (9) ,∈ LIME provides explanations that help to make an informed decision about the trustworthiness of the predictions and gain crucial insights into the model behavior. 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 𝐴𝑀 𝑔𝑟𝑎𝑑 Appl. Sci. 2018, 8, 1715 9 of 17 b 0 P(m, g, P ) = P (m(b) g b ) (9) k å b,b2B LIME provides explanations that help to make an informed decision about the trustworthiness of the predictions and gain crucial insights into the model behavior. 4. Results Appl. and Sci. Discussion 2018, 8, x FOR PEER REVIEW 9 of 18 4. Results and Discussion 4.1. Performance Evaluation of Customized CNNs 4.1. Performance Evaluation of Customized CNNs Figure 7 shows the optimized architecture and parameters of the sequential CNN, obtained through Bayesian learnFig ing u.re 7 sho We p wes the optimize rformed 100 d architect objectiv ue re fand p unctia orameters of the n evaluation sequenti s towaral C d oNN, ptim obtaine izing dt he model 3 6 through Bayesian learning. We performed 100 objective function evaluations toward optimizing the parameters. The optimized values are found to be 6, 1 10 , 0.9, and 1 10 for the network depth, −3 −6 model parameters. The optimized values are found to be 6, 1 × 10 , 0.9, and 1 × 10 for the network learning rate, momentum, and L2-regularization parameters respectively. The number of convolutional depth, learning rate, momentum, and L2-regularization parameters respectively. The number of layer filters is increased by a factor of 2 each time a max-pooling layer is used, in order to ensure roughly convolutional layer filters is increased by a factor of 2 each time a max-pooling layer is used, in order the same ntu om ensure ber ofrou cog m hly t puh ta e tisam onse number o in the netfw comput ork laat ye ions rs. in t Rech te net ifiedw L oirn k l ea aryer Usn . iRe t (ct Rifie eLU d Lin ) lae yar er Unit s are added to (ReLU) layers are added to introduce non-linearity and prevent vanishing gradients during introduce non-linearity and prevent vanishing gradients during backpropagation [7]. backpropagation [7]. Figure 7. The optimized architecture of customized sequential CNN. Figure 7. The optimized architecture of customized sequential CNN. Our analysis shows an increase in the performance of the residual and inception CNNs when Our analysis shows an increase in the performance of the residual and inception CNNs when the the number of filters in the convolutional layers of the succeeding blocks are increased by a factor of number of filters in the convolutional layers of the succeeding blocks are increased by a factor of 2. 2. We found the optimal hyperparameter values for the residual, inception, and VGG16 models through a randomized grid search. The values are tabulated in Table 2. We found the optimal hyperparameter values for the residual, inception, and VGG16 models through a randomized grid search. The values are tabulated in Table 2. The customized CNNs are evaluated with the baseline and cropped ROI data. The results are tabulated in Table 3. We observed that the performance of the models with the cropped ROI is relatively promising in comparison to the baseline in classifying normal and pneumonia infected CXRs. This is obvious because the models trained with the cropped ROI learn relevant feature representations toward classifying the task of interest. Appl. Sci. 2018, 8, 1715 10 of 17 Table 2. Optimal values for the hyperparameters of the customized residual and inception CNNs obtained through a randomized grid search. Model Learning Rate Momentum L2 Regularization 3 6 Residual CNN 1  10 0.9 1  10 2 4 Inception CNN 1  10 0.95 1  10 4 6 Customized VGG16 1  10 0.99 1  10 Table 3. Performance of customized CNNs with baseline and cropped ROI data. Task Data Models Accuracy AUC Precision Recall Specificity F-Score MCC Customized 0.957 0.990 0.951 0.983 0.915 0.967 0.908 VGG16 Baseline Sequential 0.943 0.983 0.920 0.980 0.855 0.957 0.878 Residual 0.910 0.967 0.908 0.954 0.838 0.931 0.806 Normal vs. Inception 0.886 0.922 0.887 0.939 0.800 0.913 0.755 Pneumonia Customized 0.962 0.993 0.977 0.962 0.962 0.970 0.918 VGG16 Cropped Sequential 0.941 0.984 0.930 0.995 0.877 0.955 0.873 ROI Residual 0.917 0.971 0.913 0.959 0.847 0.936 0.820 Inception 0.897 0.932 0.896 0.947 0.817 0.921 0.778 Customized 0.936 0.962 0.920 0.984 0.860 0.951 0.862 VGG16 Baseline Sequential 0.928 0.954 0.909 0.984 0.838 0.946 0.848 Residual 0.897 0.921 0.880 0.967 0.784 0.922 0.780 Bacterial vs. Viral Inception 0.854 0.901 0.841 0.934 0.714 0.886 0.675 Pneumonia Customized 0.936 0.962 0.920 0.984 0.860 0.951 0.862 VGG16 Cropped Sequential 0.928 0.956 0.909 0.984 0.838 0.946 0.848 ROI Residual 0.908 0.933 0.888 0.976 0.798 0.930 0.802 Inception 0.872 0.919 0.853 0.959 0.730 0.903 0.725 Customized 0.917 0.938 0.917 0.905 0.958 0.911 0.873 VGG16 Sequential 0.896 0.922 0.888 0.885 0.948 0.887 0.841 Baseline Residual 0.861 0.887 0.868 0.882 0.933 0.875 0.809 Normal vs. Inception 0.809 0.846 0.753 0.848 0.861 0.798 0.688 Bacterial vs. Viral Pneumonia Customized 0.918 0.939 0.920 0.900 0.960 0.910 0.876 VGG16 Cropped Sequential 0.897 0.923 0.898 0.898 0.949 0.898 0.844 ROI Residual 0.879 0.909 0.883 0.890 0.941 0.887 0.825 Inception 0.821 0.865 0.778 0.855 0.878 0.815 0.714 * Bold numbers indicate superior performance. The customized VGG16 model demonstrates promising performance than the other CNNs under study. The model learned generic image features from ImageNet that served as a good initialization compared to random weights and trained end-to-end on the current tasks to learn task-specific features. This results in faster convergence with reduced bias, overfitting, and improved generalization. In classifying bacterial and viral pneumonia, no significant difference in performance is observed for the customized VGG16 model with the baseline and cropped ROI. In the multi-class classification task, the cropped ROI gave better results than the baseline data. However, we observed that the differences in performance are not significant. This may be due to the reason that the dataset under study already appeared as cropped, and the boundary detection algorithm resulted in a few under-segmented regions near the costophrenic angle. The customized sequential, residual, and inception CNNs with random weight initializations didn’t have the opportunity to learn discriminative features, owing to the sparse availability and imbalanced distribution of training data across the expected classes. We observed that the sequential CNN outperformed the residual and inception counterparts across the classification tasks. The usage of residual connections is beneficial in resolving the issue of representational bottlenecks and vanishing gradients in deep models. The CNNs used in this study have a shallow architecture. The residual connections did not introduce significant gains into the performance for the tasks of interest. Unlike ImageNet, the variability in the pediatric CXR data is several orders of magnitude smaller. The architecture of residual and inception CNNs are progressively more complex and did not seem to be a fitting tool to use for the tasks of interest. The confusion matrices Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 18 Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 18 inception CNNs with random weight initializations didn’t have the opportunity to learn inception CNNs with random weight initializations didn’t have the opportunity to learn discriminative features, owing to the sparse availability and imbalanced distribution of training data discriminative features, owing to the sparse availability and imbalanced distribution of training data across the expected classes. We observed that the sequential CNN outperformed the residual and across the expected classes. We observed that the sequential CNN outperformed the residual and inception counterparts across the classification tasks. The usage of residual connections is beneficial inception counterparts across the classification tasks. The usage of residual connections is beneficial in resolving the issue of representational bottlenecks and vanishing gradients in deep models. The in resolving the issue of representational bottlenecks and vanishing gradients in deep models. The CNNs used in this study have a shallow architecture. The residual connections did not introduce CNNs used in this study have a shallow architecture. The residual connections did not introduce Appl. Sci. 2018, 8, 1715 11 of 17 significant gains into the performance for the tasks of interest. Unlike ImageNet, the variability in the significant gains into the performance for the tasks of interest. Unlike ImageNet, the variability in the pediatric CXR data is several orders of magnitude smaller. The architecture of residual and inception pediatric CXR data is several orders of magnitude smaller. The architecture of residual and inception CNNs are progressively more complex and did not seem to be a fitting tool to use for the tasks of CNNs are progressively more complex and did not seem to be a fitting tool to use for the tasks of and AUC achieved with the customized VGG16 model are shown in Figures 8–10. We observed that interest. The confusion matrices and AUC achieved with the customized VGG16 model are shown in interest. The confusion matrices and AUC achieved with the customized VGG16 model are shown in the training metrics are poor compared to test accuracy. This is due to the fact that noisy images are Figures 8–10. We observed that the training metrics are poor compared to test accuracy. This is due Figures 8–10. We observed that the training metrics are poor compared to test accuracy. This is due included in the training data to reduce bias, overfitting, and improve model generalization. to the fact that noisy images are included in the training data to reduce bias, overfitting, and improve to the fact that noisy images are included in the training data to reduce bias, overfitting, and improve W model g e compar enered alizat the ion performance . of the customized VGG16 model trained with the cropped ROI, model generalization. We compared the performance of the customized VGG16 model trained with the cropped ROI, to the state-of-the-art. The results are tabulated in Table 4. We observed that our model outperforms We compared the performance of the customized VGG16 model trained with the cropped ROI, to the state-of-the-art. The results are tabulated in Table 4. We observed that our model outperforms the curr to the sta ent literatur te-of-the- e a in rt. Th all e r performance esults are tabu metrics lated in T acr able 4. We oss the obse classification rved that our tasks. model outper The customized forms the current literature in all performance metrics across the classification tasks. The customized the current literature in all performance metrics across the classification tasks. The customized sequential CNN demonstrates higher values for recall in: (i) classifying normal and pneumonia; sequential CNN demonstrates higher values for recall in: (i) classifying normal and pneumonia; and, sequential CNN demonstrates higher values for recall in: (i) classifying normal and pneumonia; and, and, (ii) identical recall measures to the customized VGG16 model in classifying bacterial and viral (ii) identical recall measures to the customized VGG16 model in classifying bacterial and viral (ii) identical recall measures to the customized VGG16 model in classifying bacterial and viral pneumonia. However, considering the balance between precision and recall as demonstrated by the pneumonia. However, considering the balance between precision and recall as demonstrated by the pneumonia. However, considering the balance between precision and recall as demonstrated by the F-Score and MCC, the customized VGG16 model outperforms the other CNNs and the state-of-the-art F-Score and MCC, the customized VGG16 model outperforms the other CNNs and the state-of-the- F-Score and MCC, the customized VGG16 model outperforms the other CNNs and the state-of-the- across the classification tasks. art across the classification tasks. art across the classification tasks. Figure 8. Confusion matrices for the performance of the customized VGG16 model: (a) Normal v. FigureFigure 8 8. Confusion . Confusion matrices for matrices for the the performance performance of of the the customized VGG16 model customized VGG16 model: : (a) N (ao )rma Normal l v. v. Pneumonia; (b) Bacterial v. Viral Pneumonia. Pneumonia; (b) Bacterial v. Viral Pneumonia. Pneumonia; (b) Bacterial v. Viral Pneumonia. Figure 9. ROC curves demonstrating the performance of the customized VGG16 model: (a) Normal Figure 9. ROC curves demonstrating the performance of the customized VGG16 model: (a) Normal Figure 9. ROC curves demonstrating the performance of the customized VGG16 model: (a) Normal v. v. Pneumonia; (b) Bacterial v. Viral Pneumonia. v. Pneumonia; (b) Bacterial v. Viral Pneumonia. Pneumonia; (b) Bacterial v. Viral Pneumonia. Table 4. Comparing the performance of the customized VGG16 model with the state-of-the-art. Task Model Accuracy AUC Precision Recall Specificity F-Score MCC Customized Normal v. 0.962 0.993 0.977 0.962 0.962 0.970 0.918 VGG16 Pneumonia Kermany et al. 0.928 0.968 - 0.932 0.901 - - Customized 0.936 0.962 0.920 0.984 0.860 0.951 0.862 Bacterial v. Viral VGG16 Pneumonia Kermany et al. 0.907 0.940 - 0.886 0.909 - - Customized Normal v. 0.918 0.939 0.920 0.900 0.960 0.910 0.876 VGG16 Bacterial v. Viral Kermany et al. - - - - - - - Pneumonia * Bold numbers indicate superior performance. Appl. Sci. 2018, 8, 1715 12 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 18 Figure 10. Performance of customized VGG16 model in multiclass classification: (a) Confusion matrix; Figure 10. Performance of customized VGG16 model in multiclass classification: (a) Confusion matrix; (b) ROC curves. (b) ROC curves. Table 4. Comparing the performance of the customized VGG16 model with the state-of-the-art. 4.2. Visualization through Discriminative Localization F- The customized Task VGG16 mode Modell has aAcc CAM-compatible uracy AUC Precisio arn Re chitectur call Spe e owing cificity to the pr MCC esence of Score the GAP layer. This helps in visualizing the model predictions using both CAM and grad-CAM Customized 0.962 0.993 0.977 0.962 0.962 0.970 0.918 Normal v. Pneumonia VGG16 visualization tools. Figures 11 and 12 demonstrate the results of applying these visualizations to Kermany et al. 0.928 0.968 - 0.932 0.901 - - localize the discriminative ROI in pneumonia-infected CXRs. Customized Appl. Sci. 2018, 8, x FOR PEER REVIEW 13 of 18 Bacterial v. Viral 0.936 0.962 0.920 0.984 0.860 0.951 0.862 VGG16 Pneumonia Kermany et al. 0.907 0.940 - 0.886 0.909 - - Customized Normal v. Bacterial v. 0.918 0.939 0.920 0.900 0.960 0.910 0.876 VGG16 Viral Pneumonia Kermany et al. - - - - - - - * Bold numbers indicate superior performance. 4.2. Visualization through Discriminative Localization The customized VGG16 model has a CAM-compatible architecture owing to the presence of the GAP layer. This helps in visualizing the model predictions using both CAM and grad-CAM visualization tools. Figures 11 and 12 demonstrate the results of applying these visualizations to localize the discriminative ROI in pneumonia-infected CXRs. Figure 11. Visual explanations through gradient-based localization using CAM: (a) Input CXRs; (b) Figure 11. Visual explanations through gradient-based localization using CAM: (a) Input CXRs; Bounding boxes localizing regions of activations; (c) CAM showing heat maps superimposed on the (b) Bounding boxes localizing regions of activations; (c) CAM showing heat maps superimposed on the original CXRs; (d) Automatically segmented lung masks; (e) CAM showing heat maps superimposed original CXRs; (d) Automatically segmented lung masks; (e) CAM showing heat maps superimposed on the cropped lungs. on the cropped lungs. Appl. Sci. 2018, 8, 1715 13 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 14 of 18 Figure 12. Visual explanations through gradient-based localization using grad-CAM: (a) Input CXRs; Figure 12. Visual explanations through gradient-based localization using grad-CAM: (a) Input CXRs; (b) Bounding boxes localizing regions of activations; (c) Grad-CAM showing heat maps (b) Bounding boxes localizing regions of activations; (c) Grad-CAM showing heat maps superimposed superimposed on the original CXRs; (d) Automatically segmented lung masks; (e) Grad-CAM on the original CXRs; (d) Automatically segmented lung masks; (e) Grad-CAM showing heat maps showing heat maps superimposed on the cropped lungs. superimposed on the cropped lungs. CXRs are fed to the trained model and the predictions are decoded. The heat maps are generated CXRs are fed to the trained model and the predictions are decoded. The heat maps are generated as as a two-dimensional score grid, computed for each input pixel location. Pixels carrying high a two-dimensional score grid, computed for each input pixel location. Pixels carrying high importance importance with respect to the expected class appeared bright red with distinct color transitions for with respect to the expected class appeared bright red with distinct color transitions for varying varying ranges. The generated heat maps are superimposed on the original input to localize image- ranges. The generated heat maps are superimposed on the original input to localize image-specific specific ROI. The lung masks that are generated with the boundary detection algorithm are applied ROI. The lung masks that are generated with the boundary detection algorithm are applied to extract to extract the localized ROI relevant to the lung regions. We observed that CAM and grad-CAM the localized visualizat ROI ionsr gener elevant ated to he the at m lung aps for regions. the pneumon We observed ia class tthat o higCAM hlight t and he vi grad-CAM sual difference visualizations s in the “pneumonia-like” regions of the image. generated heat maps for the pneumonia class to highlight the visual differences in the “pneumonia-like” We applied our novel method of average-CAM and average-grad-CAM to visualize the class- regions of the image. specific ROI, as shown in Figures 13 and 14. Lung masks are applied to the generated heat maps to We applied our novel method of average-CAM and average-grad-CAM to visualize the localize only the ROI specific to the lung regions. We observed that the class-specific ROI localized class-specific ROI, as shown in Figures 13 and 14. Lung masks are applied to the generated heat by the average-CAM and average-grad-CAM for the viral pneumonia class follows a diffuse pattern. maps to localize only the ROI specific to the lung regions. We observed that the class-specific ROI This is obvious for the reason that viral pneumonia manifests with diffuse interstitial patterns in both localized by the average-CAM and average-grad-CAM for the viral pneumonia class follows a diffuse lungs [30]. For the bacterial pneumonia class, we observed that the model layers are activated on both pattern. This is obvious for the reason that viral pneumonia manifests with diffuse interstitial patterns sides of the lungs, predominantly on the upper and middle right lung lobes. This is for the reason in both that lungs bacteri [30 al p ]. For neum the oni bacterial a manifest pneumonia s as lobar cons class, iderations we observed [30]. The pneumonia d that the model ataset layers und ar er study e activated has more pediatric patients with right lobar consolidations. on both sides of the lungs, predominantly on the upper and middle right lung lobes. This is for the reason that bacterial pneumonia manifests as lobar considerations [30]. The pneumonia dataset under study has more pediatric patients with right lobar consolidations. Appl. Sci. 2018, 8, 1715 14 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 15 of 18 Appl. Sci. 2018, 8, x FOR PEER REVIEW 15 of 18 Figure 13. Visual explanations through average-CAM: (a) Bacterial and viral CXR (top and bottom); Figure 13. Visual explanations through average-CAM: (a) Bacterial and viral CXR (top and Figure 13. Visual explanations through average-CAM: (a) Bacterial and viral CXR (top and bottom); (b) Average-CAM localizing class-specific ROI with bounding boxes highlighting the regions of bottom); (b) Average-CAM loca (b) Average-CAM lizing localizing class-spe class-specific cific ROI with ROI bound with bounding ing boxesb highlighting the oxes highlighting regions the regions of maximum activation; (c) Automatically segmented lung masks; (d) Average-CAM localizing class- of maximum activation; (c) Automatically segmented lung masks; (d) Average-CAM localizing maximum activation; (c) Automatically segmented lung masks; (d) Average-CAM localizing class- specific ROI with the extracted lung regions. class-specific ROI with the extracted lung regions. specific ROI with the extracted lung regions. Figure 14. Visual explanations through average-grad-CAM: (a) Bacterial and viral CXR (top and Figure 14. Visual explanations through average-grad-CAM: (a) Bacterial and viral CXR (top and Figure 14. Visual explanations through average-grad-CAM: (a) Bacterial and viral CXR (top and bottom); (b) Average-grad-CAM localizing class-specific ROI with bounding boxes highlighting the bottom); (b) Average-grad-CAM localizing class-specific ROI with bounding boxes highlighting the bottom); (b) Average-grad-CAM localizing class-specific ROI with bounding boxes highlighting the regions of maximum activation; (c) Automatically segmented lung masks; (d) Average-grad-CAM regions regionsof of maximum ac maximum activation; tivation; ( (cc)) Au Automatically tomatically seg segme mented nted lu lung ng m masks; asks; (d (d ) Average- ) Average-grad-CAM grad-CAM localizing class-specific ROI with the extracted lung regions. localizing localizing c class-specific lass-specificROI ROI w with ith the extracte the extracted d lung lung regions. regions. 4.3. Visual Explanations with LIME 4.3. Visual Explanations with LIME 4.3. Visual Explanations with LIME Figure 15 shows the explanations generated with LIME for sample instances of pediatric chest Figure 15 shows the explanations generated with LIME for sample instances of pediatric chest Figure 15 shows the explanations generated with LIME for sample instances of pediatric chest radiographs. radiographs. Lung Lung m masks asks ar are applied to e applied to the the explan explanations ations t to o loc localize alize only t only the he RO ROI I specif specific ic to to th the e lun lung g radiographs. Lung masks are applied to the explanations to localize only the ROI specific to the lung regions. The explanations are shown as follows: (i) Superpixels with the highest positive weights and regions. The explanations are shown as follows: (i) Superpixels with the highest positive weights and regions. The explanations are shown as follows: (i) Superpixels with the highest positive weights and the the rest are rest are greyed greyed out; out; and,and, (ii) sup (ii) superpixel erpixels s superimp supe osed rimposed on on the extracted the extracte lung d lung re regions. Wgio e observed ns. We the rest are greyed out; and, (ii) superpixels superimposed on the extracted lung regions. We that observed th the explainer at the explain focused on er the focused regions on the r with e high gions opacity with high . The o model pacity.dif The fer entiates model di bacterial fferentiatand es observed that the explainer focused on the regions with high opacity. The model differentiates bacterial and viral pneumonia by (i) showing superpixels with the highest positive activations in the viral pneumonia by (i) showing superpixels with the highest positive activations in the regions of lobar bacterial and viral pneumonia by (i) showing superpixels with the highest positive activations in the regions of lobar consolidations for bacterial pneumonia; and, (ii) diffuse interstitial patterns across regions of lobar consolidations for bacterial pneumonia; and, (ii) diffuse interstitial patterns across Appl. Sci. 2018, 8, 1715 15 of 17 Appl. Sci. 2018, 8, x FOR PEER REVIEW 16 of 18 consolidations for bacterial pneumonia; and, (ii) diffuse interstitial patterns across the lungs for viral pneumonia. We also observed that a number of false positive superpixels are reported. The reason is the lungs for viral pneumonia. We also observed that a number of false positive superpixels are that the current LIME implementation uses a sparse linear model to approximate the model behavior reported. The reason is that the current LIME implementation uses a sparse linear model to in the neighborhood of the explained predictions. However, these explanations result from a random approximate the model behavior in the neighborhood of the explained predictions. However, these sampling process and are not faithful if the underlying model is highly non-linear in the locality explanations result from a random sampling process and are not faithful if the underlying model is of predictions. highly non-linear in the locality of predictions. Figure 15. Visual explanations through LIME: (a) Input CXRs; (b) Automatically segmented lung Figure 15. Visual explanations through LIME: (a) Input CXRs; (b) Automatically segmented lung masks; (c) Copped lung regions; (d) Superpixels with the highest positive weights with the others masks; (c) Copped lung regions; (d) Superpixels with the highest positive weights with the others greyed out; (e) Superpixels with the highest positive weights are superimposed on the cropped lungs. greyed out; (e) Superpixels with the highest positive weights are superimposed on the cropped lungs. 5. Conclusions 5. Conclusions We proposed a CNN-based decision support system to detect pneumonia in pediatric CXRs to We proposed a CNN-based decision support system to detect pneumonia in pediatric CXRs to expedite accurate diagnosis of the pathology. We applied novel and state-of-the-art visualization expedite accurate diagnosis of the pathology. We applied novel and state-of-the-art visualization strategies to explain model predictions that is considered highly significant to clinical decision-making. strategies to explain model predictions that is considered highly significant to clinical decision- The study presents a universal approach to apply to an extensive range of visual recognition tasks. making. The study presents a universal approach to apply to an extensive range of visual recognition Classifying pneumonia in chest radiographs is a demanding task due to the presence of a high degree tasks. Classifying pneumonia in chest radiographs is a demanding task due to the presence of a high of variability in the input data. The promising performance of the customized VGG16 model trained degree of variability in the input data. The promising performance of the customized VGG16 model on the current tasks suggest that it effectively learns from a sparse collection of complex data with trained on the current tasks suggest that it effectively learns from a sparse collection of complex data reduced bias and improved generalization. We hope that our results are useful for developing clinically with reduced bias and improved generalization. We hope that our results are useful for developing useful solutions to detect and distinguish pneumonia types in chest radiographs. clinically useful solutions to detect and distinguish pneumonia types in chest radiographs. Author Author Co Contributions: ntributions: Conceptualization, Conceptualization, S.R. S.R. and and S.A.; S.A.; Method Methodology ology,, S S.R.; .R.; S Softwar oftware, e, S. S.R., R., S. S.C., C., an and d I. I.K.; K.; Validation, S.R., S.C., and I.K.; Formal Analysis, S.R.; Investigation, S.R., and S.A.; Resources, S.R. and S.C.; Data Validation, S.R., S.C., and I.K.; Formal Analysis, S.R.; Investigation, S.R., and S.A.; Resources, S.R. and S.C.; Data Curation, S.R.; Writing-Original Draft Preparation, S.R.; Writing-Review & Editing, S.A. and G.T.; Visualization, S.R.; Supervision, S.A. and G.T.; Project Administration, S.A. and G.T.; Funding Acquisition, S.A. and G.T. Appl. Sci. 2018, 8, 1715 16 of 17 Curation, S.R.; Writing-Original Draft Preparation, S.R.; Writing-Review & Editing, S.A. and G.T.; Visualization, S.R.; Supervision, S.A. and G.T.; Project Administration, S.A. and G.T.; Funding Acquisition, S.A. and G.T. Funding: This work was supported by the Intramural Research Program of the Lister Hill National Center for Biomedical Communications (LHNCBC), the National Library of Medicine (NLM), and the U.S. National Institutes of Health (NIH). Conflicts of Interest: The authors declare no conflict of interest. References 1. Le Roux, D.M.; Myer, L.; Nicol, M.P. Incidence and severity of childhood pneumonia in the first year of life in a South African birth cohort: The Drakenstein Child Health Study. Lancet Glob. Health 2015, 3, e95–e103. [CrossRef] 2. Mcluckie, A. Respiratory Disease and Its Management, 1st ed.; Springer: London, UK, 2009; pp. 51–59. ISBN 978-1-84882-094-4. 3. Cherian, T.; Mulholland, E.K.; Carlin, J.B.; Ostensen, H.; Amin, R.; De Campo, M.; Greenberg, D.; Lagos, R.; Lucero, M.; Madhi, S.A.; et al. Standardized interpretation of paediatric chest radiographs for the diagnosis of pneumonia in epidemiological studies. Bull. World Health Organ. 2005, 83, 353–359. [PubMed] 4. Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. 5. Karargyris, A.; Siegelman, J.; Tzortzis, D.; Jaeger, S.; Candemir, S.; Xue, Z.; KC, S.; Vajda, S.; Antani, S.K.; Folio, L.; et al. Combination of texture and shape features to detect pulmonary abnormalities in digital chest X-rays. Int. J. Comput. Assist. Radiol. Surg. 2016, 11, 99–106. [CrossRef] [PubMed] 6. Neuman, M.I.; Lee, E.Y.; Bixby, S.; Diperna, S.; Hellinger, J.; Markowitz, R.; Servaes, S.; Monuteaux, M.C.; Shah, S.S. Variability in the Interpretation of Chest Radiographs for the Diagnosis of Pneumonia in Children. J. Hosp. Med. 2012, 7, 294–298. [CrossRef] [PubMed] 7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. 8. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. 9. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 818–833. 10. Bar, Y.; Diamant, I.; Wolf, L.; Lieberman, S.; Konen, E.; Greenspan, H. Chest Pathology Detection Using Deep Learning with Non-Medical Training. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Brooklyn, NY, USA, 16–19 April 2015; pp. 294–297. 11. Oliveira, L.L.G.; Silva, S.A.E.; Ribeiro, L.H.V.; De Oliveira, R.M.; Coelho, C.J.; Ana Lúcia, A.L.S. Computer-Aided Diagnosis in Chest Radiography for Detection of Childhood Pneumonia. Int. J. Med. Inform. 2008, 77, 555–564. [CrossRef] [PubMed] 12. Abe, H.; Macmahon, H.; Shiraishi, J.; Li, Q.; Engelmann, R.; Doi, K. Computer-aided diagnosis in chest radiology. Semin. Ultrasound CT MR 2004, 25, 432–437. [CrossRef] [PubMed] 13. Giger, M.; MacMahon, H. Image processing and computer-aided diagnosis. Radiol. Clin. N. Am. 1996, 34, 565–596. [PubMed] 14. Monnier-Cholley, L.; MacMahon, H.; Katsuragawa, S.; Morishita, J.; Ishida, T.; Doi, K. Computer-aided diagnosis for detection of interstitial opacities on chest radiographs. AJR Am. J. Roentgenol. 1998, 171, 1651–1656. [CrossRef] [PubMed] 15. Kermany, D.S.; Goldbaum, M.; Cai, W.; Valentim, C.C.S.; Liang, H.; Baxter, S.L.; McKeown, A.; Yang, G.; Wu, X.; Yan, F.; et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell 2018, 172, 1122–1131. [CrossRef] [PubMed] Appl. Sci. 2018, 8, 1715 17 of 17 16. Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv. 2018. Available online: https://arxiv.org/abs/1711.05225 (accessed on 23 January 2018). 17. Guan, Q.; Huang, Y.; Zhong, Z.; Zheng, Z.; Zheng, L.; Yang, Y. Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification. arXiv. 2018. Available online: https://arxiv.org/abs/1801.09927v1 (accessed on 17 June 2018). 18. Candemir, S.; Jaeger, S.; Palaniappan, K.; Musco, J.P.; Singh, R.K.; Xue, Z.; Karargyris, A.; Antani, S.; Thoma, G.; McDonald, C.J. Lung Segmentation in Chest Radiographs Using Anatomical Atlases with Nonrigid Registration. IEEE Trans. Med. Imaging 2014, 33, 577–590. [CrossRef] [PubMed] 19. Candemir, S.; Antani, S.; Jaeger, S.; Browning, R.; Thoma, G. Lung Boundary Detection in Pediatric Chest X-Rays. In Proceedings of the SPIE Medical Imaging, Orlando, FL, USA, 21–26 February 2015; Volume 9418, p. 94180Q. 20. Liu, C.; Yuen, J.; Torralba, A. SIFT Flow: Dense Correspondence across Scenes and Its Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 978–994. [CrossRef] [PubMed] 21. Snoek, J.; Rippel, O.; Adams, R.P. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 2171–2180. 22. Deep Learning Using Bayesian Optimization. Available online: https://www.mathworks.com/help/nnet/ examples/deep-learning-using-bayesian-optimization.html (accessed on 14 January 2018). 23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 24. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. 25. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–32. 26. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. JMLR 2012, 13, 281–305. 27. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. 28. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference of Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. 29. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. 30. Sharma, S.; Maycher, B.; Eschun, G. Radiological imaging in pneumonia: Recent innovations. Curr. Opin. Pulm. Med. 2007, 13, 159–169. [CrossRef] [PubMed] © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Journal

Applied SciencesMultidisciplinary Digital Publishing Institute

Published: Sep 20, 2018

There are no references for this article.