Access the full text.
Sign up today, get DeepDyve free for 14 days.
DE GRUYTER Current Directions in Biomedical Engineering 2022;8(1): 34-37 Finn Behrendt*, Debayan Bhattacharya, Julia Krüger, Roland Opfer, and Alexander Schlaefer Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs https://doi.org/10.1515/cdbme-2022-0009 1 Introduction Abstract: Radiographs are a versatile diagnostic tool for the detection and assessment of pathologies, for treatment plan- Chest radiographs (CXR) are commonly used for the iden- ning or for navigation and localization purposes in clinical in- tification, assessment and localization of pathologies. CXRs terventions. However, their interpretation and assessment by enable a cost- and time-effective examination with low ra- radiologists can be tedious and error-prone. Thus, a wide va- diation dose and allow clinicians to detect a wide range riety of deep learning methods have been proposed to support of diseases, plan treatments and localize specific anatomic radiologists interpreting radiographs. structures. Therefore, CXRs are the most performed imag- Mostly, these approaches rely on convolutional neural net- ing study with an annually increasing number of examinations works (CNN) to extract features from images. Especially for [2, 9, 10, 13]. A direct consequence of the increasing amount the multi-label classification of pathologies on chest radio- of CXR examinations is a significantly increased workload graphs (Chest X-Rays, CXR), CNNs have proven to be well for radiologists. Therefore, radiologists need to assess a large suited. On the Contrary, Vision Transformers (ViTs) have not amount of CXRs manually in their daily routine which can been applied to this task despite their high classification per- lead to an increased amount of human-errors [1, 2]. Thus, formance on generic images and interpretable local saliency a well-integrated computer-assisted tool that could give cues maps which could add value to clinical interventions. ViTs do to the radiologists on what pathology might be present and not rely on convolutions but on patch-based self-attention and where to look, could accelerate clinical workflows and reduce in contrast to CNNs, no prior knowledge of local connectivity the number of human errors. Furthermore, such systems could is present. While this leads to increased capacity, ViTs typi- be especially helpful for inexperienced radiologists and help cally require an excessive amount of training data which rep- to prioritize assessments of CXRs [13]. Various computer- resents a hurdle in the medical domain as high costs are asso- assisted tools, including feature engineering and later statis- ciated with collecting large medical data sets. tical models that learn from training data have been proposed In this work, we systematically compare the classification per- in the past for this task. Finally, the publication of large-scale formance of ViTs and CNNs for different data set sizes and data sets such as CheXpert or MIMIC-CXR [6, 7] paved the evaluate more data-efficient ViT variants (DeiT). Our results way towards human-level classification performance on CXRs show that while the performance between ViTs and CNNs is with deep-learning based CNNs [6, 8, 13]. Furthermore, CNNs on par with a small benefit for ViTs, DeiTs outperform the are proposed for a wide variety of tasks such as classification, former if a reasonably large data set is available for training. localization, segmentation or automated report generation and have emerged to be the de-facto standard for the processing of Keywords: Deep Learning, Chest Radiograph, Vision Trans- radiographs. However, a recent publication challenges CNNs former, Convolutional Neural Network, CheXpert and proposes Vision Transformers (ViT) that use multi-headed self-attention between image patches instead of convolutions to learn meaningful feature representations from images [4]. Originally, transformer networks have shown strong perfor- mance for modeling and interpreting sequence-data like sen- *Corresponding author: Finn Behrendt, Institute of Medical tences and outperform traditional recurrent neural networks in Technology and Intelligent Systems, Hamburg University of Technology, e-mail: finn.behrendt@tuhh.de many sequence-related tasks [12]. Applying the core princi- Debayan Bhattacharya, Alexander Schlaefer, Institute of ples of Transformers to the image domain as it is done in ViTs Medical Technology and Intelligent Systems, Hamburg University has shown to outperform plain CNNs for large-scale databases of Technology of generic images such as ImageNet. Julia Krüger, Roland Opfer, Jung Diagnostics GmbH, Hamburg, Beside the potential performance gains, ViTs share the appeal- Germany Open Access. © 2022 The Author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License. 34 Finn Behrendt et al., Data-Efficient Vision Transformers for Classification on Chest Radiographs ing property of class-level local attention maps [12]. These at- 2.2 Deep Learning Models tention maps could be helpful not only for the classification task in CXRs but also for tasks where localization of anatom- For our experiments, we utilize DenseNets [5] as baseline ical structures is required. However, ViTs do not impose prior CNNs as they have proven to be a strong baseline for the clas- knowledge of the local connectivity of image pixels as it is sification task on CXRs [13]. In general, CNNs utilize blocks the case with convolutions. Thus, ViTs require an excessive of convolutions, together with a normalization, non-linear acti- amount of training data and are often only applicable when vation functions and pooling operations stacked on each other pre-trained on large-scale data sets [11]. to map an input image to a feature vector. A linear layer maps This opens the question, whether ViTs can be leveraged for the feature vector to the output vector which is compared with multi-class classification problems with CXRs. As large-scale the class labels. DenseNets add specific skip connections be- data sets are crucial for pre-training ViTs, it is of interest if tween the convolutional blocks to allow training deep stacks their performance improvement against CNNs can also hold in of these blocks [5]. We compare different versions of the base- an image domain different from ImageNet with smaller, medi- line CNN, namely DenseNet-121 and DenseNet-201 where cal data sets available for fine-tuning. In this work, we leverage the main difference is the depth of the architecture and thus, ViTs for the classification task of pathologies in CXRs and in- the number of trainable parameters. vestigate the use of knowledge distillation for data efficiency. In contrast to CNNs, ViTs do not process image arrays by con- 𝐻×𝑊 ×𝐶 In summary, our contribution is three-fold: volutions. Instead, the image 𝑋 ∈ R is cropped into 𝑁×(𝑃 ×𝐶 ) – We investigate the use of ViTs for multi-label classifica- 𝑁 patches 𝑥 ∈ R , where 𝐻, 𝑊 is the dimension tion in CXRs and compare their performance to CNNs. of the image, 𝐶 is the number of channels and 𝑃 the reso- – We study if knowledge distillation with data-efficient Vi- lution of the cropped patches. The patches 𝑥 are flattened sion Transformers (DeiT) [11] can improve the classifica- and mapped to a fixed dimension 𝐷 by a linear layer. Ad- tion performance. ditionally, a class token is prepended to the mapping which – We systematically compare the effect of varying training is later used as input for a classification layer. Furthermore, set sizes for CNNs, ViTs and DeiTs, respectively. A 1-dimensional position embedding is added to each patch embedding. The resulting sequence of image patches, class to- kens and positional embeddings is used as input to the encoder of the ViT. The encoder includes multiple stacked transformer 2 Methods blocks. Each block consists of a multi-headed self-attention and a multilayer perceptron layer with a normalization layer 2.1 Data Set and skip-connections in between. We include different versions of ViTs in our experiments, We use the publicly available CheXpert Data set [6]. The data namely ViT-Small (ViT-S) and ViT-Base (ViT-B). The differ- set consists of 224316 CXRs of 65240 patients together with ences between the versions are the number of encoder layers, 14 labels, that are automatically generated from radiology re- the dimension of the embeddings 𝐷, the MLP configuration ports. There are three types of auto-generated labels. The la- and the number of attention heads [12]. bels 0 and 1 indicate positive and negative labels, respectively. We further include data-efficient Vision Transformers (DeiT) The third label -1, denotes an uncertain decision. In this work, [11] to our study. DeiTs share the same overall architecture we treat all uncertain samples as positive samples. Further, we as ViTs. In addition to the class token, a distillation token is focus on the classification of v fi e different pathologies, namely added to the patch embedding. Similar to the class token, the Atelectasis, Cardiomegaly, Consolidation, Edema and Pleu- distillation token interacts with the patch embeddings through ral Effusion. We split our data into a train, validation and test self-attention in the encoder blocks and is processed by a clas- set. 20% of the data are used for evaluation (𝒟 ). From the 𝑡𝑒𝑠𝑡 sification layer to obtain an output vector. It is used in a knowl- remaining data, we sample 5 data folds, each consisting of edge distillation framework, where the Kullback-Leibler di- 𝑖 𝑖 80% training data (𝒟 ) and 20% validation data (𝒟 ), 𝑡𝑟𝑎𝑖𝑛 𝑣𝑎𝑙 vergence between the output of a teacher network and the out- where 𝑖 indicates the fold. To simulate varying training set put of the distillation token is added to the loss function to- sizes we sample subsets including {10,20,...,90}% of 𝒟 𝑡𝑟𝑎𝑖𝑛 gether with the loss between the class token and the ground for all folds respectively. truth. The authors of [11] speculate that by this, the inductive We do not perform a specific pre-processing of the images but bias of CNNs can be distilled to ViTs, which makes DeiTs resize the images to a resolution of 224× 224px. For data aug- more data-efficient compared to plain ViTs. mentation, we apply random augment [3] and random erasing. We include the pre-trained versions Deit-S and Deit-B in our studies and investigate two use-cases of DeiTs. First, we use 35 Finn Behrendt et al., Data-Efficient Vision Transformers for Classification on Chest Radiographs Model F1 AUROC Param. (10 ) Model DenseNet-201 DenseNet-121 63.05±0.77 81.91±0.56 6.96 DeiT-B-Dist DenseNet-201 62.79±0.62 81.59±0.71 18.10 ViT-B ViT-S 62.67±0.24 81.79±0.38 21.67 ViT-B 62.32±0.39 81.92±0.50 85.80 DeiT-S 63.85±0.93 83.02±0.70 21.67 DeiT-B 64.93±0.88 84.02±0.90 85.81 DeiT-S-Dist 63.97±1.17 82.73±1.06 21.67 DeiT-B-Dist 65.51±0.79 84.56±0.91 85.81 20000 40000 60000 80000 100000 120000 140000 Tab. 1: Classification Performance on the CheXpert Data set. AU- # Training Samples ROC and F1 scores are provided in percent and the average of 5 cross-validation folds is reported together with the standard devia- Fig. 1: Classification performance for different proportions of the tion. Distilled indicates if a model is trained with knowledge distil- CheXpert data set as training set. The average AUROC values lation from a DenseNet-201 teacher. Param. denotes the number of the 5-Fold cross-validation are reported in percent. Standard of trainable parameters in million. The suffix *-Dist denotes models deviations are visualized as enveloping intervals. that are fine-tuned with knowledge distillation with DenseNet201 as a teacher. pre-trained DeiT networks that apply the knowledge distil- lation process only during pre-training on ImageNet. Sec- ond, we investigate using knowledge distillation with a trained lar classification performance can be observed. DenseNet-201 as a teacher network during fine-tuning on the Figure 1 shows that for all models the data set size has a CheXpert data set. These distilled models are denoted as Deit- crucial impact on the classification performance. For DeiT-B- S-Dist and Deit-B-Dist, respectively. Dist, a higher performance gain can be observed compared to We train our networks for a maximum number of 50 epochs DenseNet-201 and ViT-B especially when training with larger and use early stopping based on the validation loss. We use training sets. Overall, it can be observed that even for small binary cross-entropy loss as a loss function with inverse fre- data set sizes, the transformer-based models show similar per- quency weighting to account for the class imbalance in the formance compared to DenseNets. training data. For both, CNNs and ViTs, we use AdamW as op- To visualize the pixel-wise attention of the networks, saliency timizer and a batch size of 128. We scale our learning rate with maps are provided in Figure 2. For DenseNets, a Grad-Cam a cosine schedule and use two warmup epochs where we lin- approach is used to visualize the attention. For transformer- early increase the learning rate. While we use an initial learn- based models, the self-attention weights are visualized. It is ing rate of 𝑙𝑟 = 0.0001 for CNNs, ViTs require a smaller initial noticeable that both networks attend to meaningful regions in learning rate of 𝑙𝑟 = 0.00005. We search the hyperparameters the CXR. While the visualization of the attention map weights based on the performance on the validation set 𝒟 . of transformers leads to local attention maps, Grad-CAM- 𝑣𝑎𝑙 based saliency maps rather highlight coarse regions. 3 Results 4 Discussion and Conclusion We report the Area under Receiver Operator Curve (AUROC) and the F1 score to evaluate the classification performance. Recently, ViTs show performance gains over classical CNNs Both metrics are calculated as weighted averages over the v fi e on generic images from benchmark data sets such as the Im- different pathology classes, where each class is weighted by ageNet data set. Furthermore, they add appealing properties the number of true instances for each label. We report the av- like directly accessible and local attention maps. However, erage performance of the 5-Fold cross-validation together with due to the missing inductive bias and the exceeding number of the standard deviation. trainable parameters, training ViTs requires large-scale data As shown in Table 1, it can be observed that ViT models sets [11]. are on par with the DenseNet baselines. Notably, DenseNet- In this work, we investigate if we can utilize ViT models for 121 shows competitive performance to ViT-B while requiring multi-label classification on CXR images and compare their significantly fewer parameters. Considering DeiT, both vari- performance to a baseline CNN. We investigate the effect of ants show superior classification performance compared to different data set sizes and explore if knowledge distillation DenseNet and ViT. Comparing Deit-B and Deit-B-Dist, simi- can make the training more data-efficient. AUROC Finn Behrendt et al., Data-Efficient Vision Transformers for Classification on Chest Radiographs References [1] Leonard Berlin, Accuracy of diagnostic procedures: Has it improved over the past five decades?, AJR. 188 (2007), 1173–8. [2] Adrian P. Brady, Error and discrepancy in radiology: in- evitable or avoidable?, Insights into Imaging 8 (2017), no. 1, 171–182 (eng). [3] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le, Randaugment: Practical automated data augmentation with a reduced search space, NIPS 2020, vol. 33, Curran Asso- Fig. 2: Attention visualization for DeiT-B (left) and DenseNet-B ciates, Inc., 2020, pp. 18613–18624. (right). The exemplary shown image is labelled with Atelectasis [4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk and Pleural Effusion. For DeiT-B, the attention map is directly Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa accessed form the last layer and interpolated to the image dimen- Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, sion. For DenseNet-B, Grad-Cam is used to generate the attention Jakob Uszkoreit, and Neil Houlsby, An image is worth 16x16 visualization. words: Transformers for image recognition at scale, ICLR, [5] Gao Huang, Zhuang Liu, Geoff Pleiss, Laurens Van Our results indicate, that the amount of available training data Der Maaten, and Kilian Weinberger, Convolutional networks might not be sufficient to reveal the true power of ViT models. with dense connectivity, IEEE PAMI (2019). We assume that for ViTs, increasing performance will occur [6] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil- at even larger data sets that are not included in this study. In viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad contrast to that, the more data-efficient DeiT model shows in- Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, creasing performance already for smaller training sets. While David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, we can conclude that the distillation process makes the training Matthew P. Lungren, and Andrew Y. Ng, Chexpert: A large more data-efficient, it is hard to verify if the data efficiency chest radiograph dataset with uncertainty labels and expert is achieved by mimicking the inductive bias of the teacher comparison, AAAI Press, 2019. CNN [11]. Furthermore, even though the required amount [7] Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, of labelled training data is reduced, still, large data sets are Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng, MIMIC-CXR, a required to achieve performance improvements over CNNs de-identified publicly available database of chest radiographs with transformer-based networks. However, regularizing the with free-text reports, Scientific Data 6 (2019), no. 1, 317 training by knowledge distillation shows to be beneficial and (eng). can help to efficiently train transformer-based models. [8] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Besides the improved performance of the transformer-based Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoo- models, they show local and dense saliency patterns. This ob- rian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez, A survey on deep learning in medical im- servation indicates that the attention maps of transformers can age analysis, Med. Image Anal. 42 (2017), 60–88. be helpful for the localization of lung diseases from CXRs and [9] United Nations Scientific Committee on the Effects of have the potential to guide treatment planning. Atomic Radiation et al., Effects of ionizing radiation, Scien- Overall, we show that self-attention-based ViT models can be tific Annexes E (2008), 203–204. valuable alternatives for multi-label pathology classification, [10] Suhail Raoof, David Feigin, Arthur Sung, Sabiha Raoof, La- vanya Irugulpati, and Edward C Rosenow III, Interpretation especially in combination with knowledge distillation. of plain chest roentgenogram, Chest 141 (2012), no. 2, 545– Our results motivate the research on combinations of CNNs that enforce local connectivity priors and highly expressive [11] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco ViTs with global attention. This could be a promising direc- Massa, Alexandre Sablayrolles, and Herve Jegou, Training tion, especially for the application of ViTs in the medical data-efficient image transformers amp; distillation through domain, where annotated data sets are typically small. attention, ICML, vol. 139, July 2021, pp. 10347–10357. [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Author Statement Polosukhin, Attention is all you need, NIPS 2017, vol. 30, Research funding: This work was partially funded by Grant Curran Associates, Inc., 2017. Number KK5208101KS0. [13] Erdi Çallı, Ecem Sogancioglu, Bram van Ginneken, Kicky G. Conflict of interest: Authors state no conflict of interest. van Leeuwen, and Keelin Murphy, Deep learning for chest x- ray analysis: A survey, Med. Image Anal. 72 (2021), 102125.
Current Directions in Biomedical Engineering – de Gruyter
Published: Jul 1, 2022
Keywords: Deep Learning; Chest Radiograph; Vision Transformer; Convolutional Neural Network; CheXpert
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.