Does BERT need domain adaptation for clinical negation detection?

Lin,, Chen; Bethard,, Steven; Dligach,, Dmitriy; Sadeque,, Farig; Savova,, Guergana; Miller, Timothy, A

doi:10.1093/jamia/ocaa001

Does BERT need domain adaptation for clinical negation detection?

Lin,, Chen;Bethard,, Steven;Dligach,, Dmitriy;Sadeque,, Farig;Savova,, Guergana;Miller, Timothy, A 2020-04-01 00:00:00 Abstract Introduction Classifying whether concepts in an unstructured clinical text are negated is an important unsolved task. New domain adaptation and transfer learning methods can potentially address this issue. Objective We examine neural unsupervised domain adaptation methods, introducing a novel combination of domain adaptation with transformer-based transfer learning methods to improve negation detection. We also want to better understand the interaction between the widely used bidirectional encoder representations from transformers (BERT) system and domain adaptation methods. Materials and Methods We use 4 clinical text datasets that are annotated with negation status. We evaluate a neural unsupervised domain adaptation algorithm and BERT, a transformer-based model that is pretrained on massive general text datasets. We develop an extension to BERT that uses domain adversarial training, a neural domain adaptation method that adds an objective to the negation task, that the classifier should not be able to distinguish between instances from 2 different domains. Results The domain adaptation methods we describe show positive results, but, on average, the best performance is obtained by plain BERT (without the extension). We provide evidence that the gains from BERT are likely not additive with the gains from domain adaptation. Discussion Our results suggest that, at least for the task of clinical negation detection, BERT subsumes domain adaptation, implying that BERT is already learning very general representations of negation phenomena such that fine-tuning even on a specific corpus does not lead to much overfitting. Conclusion Despite being trained on nonclinical text, the large training sets of models like BERT lead to large gains in performance for the clinical negation detection task. natural language processing, machine learning, domain adaptation, deep learning, negation INTRODUCTION One of the core tasks of clinical natural language processing (NLP) is concept extraction and normalization,1–3 which involves mapping words and phrases in unstructured health texts to concepts in terminologies. However, these concepts can carry a variety of “assertion statuses” that affect the meaning; they can be negated, hedged, generic, family-related, or conditional, depending on the context in which they are mentioned. Of these assertion statuses, negation is the most common, and it is crucially important because it essentially has the opposite meaning as a non-negated concept. The negation extraction problem has been studied extensively in the clinical informatics literature,4–8 with some solutions claiming quite high performance to the extent that it may have been considered a solved problem. However, recent work has shown that, while existing methods perform well within any given dataset, performance suffers significantly in the cross-domain setting going from an average of 86.9 F1 in-domain to 74.0 in the cross-domain setting.9 In that paper, Wu et al explore a few possible explanations in depth for domain differences including size of the corpus, length of named entities, and what types of entities are annotated, and only the difference in entity length seems to be important. That work also showed great improvements from supervised domain adaptation, where small amounts of labeled training data from the target domain are used to update the classifier trained on the source domain. More recent work has looked at the possibility of applying unsupervised domain adaptation (UDA), where only unlabeled target data is used to update the classifier, and showed that 25% of the cross-domain losses could be reclaimed with existing UDA methods.10 This represents an attempt to move in a more realistic direction, as obtaining labeled concept negation data at all the sites that want to run concept extraction systems is probably unrealistic. However, the most recent advances in neural domain adaptation and transfer learning have not yet been applied to the negation extraction problem. Recent basic research in domain adaptation methods has focused largely on neural network-based models that learn domain-invariant feature representations. A model called structural correspondence learning (SCL),11,12 which creates new features based on the weights of linear classifiers that predict feature values given other features, has been updated for the neural era. In neural SCL, a multi-layer perceptron is used to predict feature values from other features. These approaches have been successful13,14 but still use traditional feature engineering approaches to define the feature space, and, like the original SCL method, they require the practitioner to decide which parts of the feature space are the predictors and which should be predicted. Another strain of neural domain adaptation works directly on word inputs passed through an embedding layer.15,16 One of the most successful approaches along these lines is the domain adversarial neural network (DANN).15 The DANN model is an attempt to implement ideas from domain adaptation theory17 which say that classifier performance in the target domain will degrade proportionally to how easily the source and target domain are distinguished given the feature set. Based on this theoretical result, the DANN model explicitly trains a neural network that can predict the label of interest (eg, whether a concept is negated or not) but that is not able to tell apart the 2 domains from the feature set. The goal is that the earlier stage of the network should learn to map domain-specific cues to similar variables in the final layer’s representation while learning to discard domain-specific noise. The way this is implemented is that the network is given data from both source and target domain and, for each instance, has to predict both the task label (negation status) and the domain that it came from. The loss function combines the loss incurred from the prediction of the task label (negation) and the prediction of the domain that a given instance came from. Since the goal is to not be able to tell domains apart, a gradient reversal layer is inserted between the learned representation layer and the domain prediction (while the task prediction uses the learned representation layer directly). During the forward pass, the gradient reversal layer simply passes its input forward. However, during the backwards pass, this layer multiplies the gradients computed by back-propagation by −1, so that the updates to weights in earlier layers move the weights in a direction that makes it more difficult to tell the domains apart. During training, source instances are passed in with their gold label and a domain label, while target instances only have a domain label. Meanwhile, a parallel track of research is based on transfer learning.18–21 In this approach, very large models are first pretrained on general tasks like language modeling on massive datasets. The idea is that pretraining on diverse massive datasets teaches the network something general about language. Then, during a fine-tuning stage, a final classifier is trained on a specific supervised task of interest, and, typically, the weights throughout the network are updated as well. These models have obtained state-of-the-art performance on multiple NLP benchmarks, often by large margins. Recent work has adapted 1 of these models, bidirectional encoder representations from transformers (BERT), to the clinical domain. This is done by starting from downloadable checkpoints, continuing the pretraining on MIMIC III22 for several steps, then continuing on to the fine-tuning stage.23–25 The combination of transfer learning and domain adaptation is not yet well-studied. In the only recent relevant work we are aware of,26 the BERT model was adapted in a sequence labeling task by using the pretraining tasks (masked language modeling) in the target domain. While interesting, the task of part-of-speech tagging and the domains (early modern English and tweets) are very different than those that we address in this work. The work we describe here is thus at the forefront of this important research direction. OBJECTIVE The objective of this work is to apply and extend the latest advances in neural domain adaptation and transfer learning to the important task of clinical concept negation extraction. The questions this work explores are 1) whether neural UDA methods improve on existing UDA methods; 2) how models like BERT, trained on massive datasets, perform on cross-domain negation extraction task; 3) how to perform model selection in a domain adaptation setting; and 4) the degree to which models like BERT still overfit on source datasets, and whether domain adaptation techniques can be applied on top of BERT. MATERIALS AND METHODS Datasets We use the same 4 corpora of clinical notes with negation annotation as Miller et al10: 2010 i2b2/VA NLP Challenge Corpus [i2b2],1 the Multi-source Integrated Platform for Answering Clinical Questions corpus [MiPACQ],27 SHARP Seed [Seed], and SHARP Stratified [Strat]. The i2b2 dataset contains deidentified notes (mostly discharge summaries) from Beth Israel Deaconess Medical Center, Partners HealthCare, and University of Pittsburgh Medical Center. MiPACQ was developed for a question answering system but is annotated for concept assertion status. MiPACQ contains Mayo Clinic notes from patients with colon cancer, articles from a now-defunct online medical encyclopedia (Medpedia), and clinical questions collected by the National Library of Medicine (http://clinques.nlm.nih.gov). The Seed corpus contains notes from patients with pulmonary arterial disease at Mayo Clinic, and patients with breast cancer at Seattle Group Health. The Strat corpus contains notes from patients at Mayo Clinic, spread across a variety of specialties and note types, in an attempt to make the dataset as diverse and general as possible. All of these corpora are deidentified, and this work was therefore approved by the Institutional Review Board at Boston Children’s Hospital, as “not human subjects research.” We first examine differences in our datasets for 2 negation-related variables, cue word distribution and distribution of entity lists. For the first, we take 5 common negation cue terms from the Negex system4 (no, not, denies, without, and negative for) and query their frequency in all 4 datasets. Figure 1a shows the distribution of these cue terms. Even among these common terms, there is substantial variation; for example, denies barely occurs in Seed while occurring more often in all 3 other datasets. Terms with lower frequency are likely to be even more problematic. We also examined the occurrences of entities in lists across different corpora. Because lists are notoriously difficult for NLP systems to parse, differences in this type of negated entity may be a cause of differing performance. We used simple patterns to test whether an entity was part of a list, including whether it was followed or preceded by commas or coordination terms. Figure 1b shows that the distribution of entities in lists varies from 10% (Strat) to 34% (i2b2 and Seed). This means that datasets trained on i2b2 and Seed will have many more examples of lists to learn from, perhaps making learning systems better able to handle these difficult instances. Figure 1. Open in new tabDownload slide The plot in sub-figure (a) shows the distribution of 5 common negation cues in different corpora. Sub-figure (b) shows the prevalence of negated concepts in lists across different corpora. Figure 1. Open in new tabDownload slide The plot in sub-figure (a) shows the distribution of 5 common negation cues in different corpora. Sub-figure (b) shows the prevalence of negated concepts in lists across different corpora. Domain adaptation methods The first domain adaptation algorithm we apply is a neural version of structural correspondence learning (NSCL).14 This method builds on the work of Ziser and Reichart,13 who first extended SCL. Their method replaces a set of linear classifiers with a single multi-layer perceptron with multiple sigmoid outputs. As in traditional SCL, the first step is defining “pivot features” for the task which, in this case, is done by using mutual information feature selection between features and labels in the source domain. They then train their neural network to learn feature correspondences by passing in both source and target data without labels and predicting the value of all pivot features in an instance, given the values of all the remaining features. The hidden layer of this network is then treated as a feature extractor for training on the source data with the labels for the task of interest. In Miller’s extension,14 this method was extended by doing the task and feature correspondence learning jointly, so that the feature extractor is optimized for both the task and for pivot feature prediction. In this work we apply it for the first time to clinical NLP tasks. Recently, pretrained general-purpose language encoders such as ELMo,21 GPT,28 GPT2,29 BERT,20 and XLNet30 have been trained on vast amounts of unlabeled text and have brought significant performance gains for many individual tasks. Yet, the cross-domain performance of these models hasn’t been fully examined. Picking the most representative encoder, BERT, we aim to study its cross-domain negation detection performance in the fine-tuning setting. BERT takes a word sequence of certain length as its input. In order to mark the position of the entities being classified for negation, we use a pair of nonword tags, “es” and “ee”, to signify the starting and ending of an entity in an input sequence (as shown in Figure 2). Figure 2. Open in new tabDownload slide Architecture that integrates domain adversarial training into a BERT-based classifier. Figure 2. Open in new tabDownload slide Architecture that integrates domain adversarial training into a BERT-based classifier. We also investigate using a domain adversarial neural network on top of BERT (see Figure 2). Training instances from both source and target domains are fed into BERT and get their encoded representation from the top CLS (classification) embedding. The CLS embedding is then used for predicting 2 binary outputs: the label (negation 1/−1) and the domain that a given instance came from (source/target). A gradient reversal layer is applied to the domain prediction for domain adversarial back propagation. This layer acts as a pass-through layer during the forward pass of neural-network training: it simply copies its input forward. During back-propagation, it passes gradients backwards after multiplying them by −1, so that the weights in the earlier representation-learning layers in the network will move in a direction that makes it harder for them to memorize the domain. The primary difference between NSCL and BERT+DANN is that the former operates over traditional feature representations, and the latter operates directly over the tokens in a sentence. They are similar in that they both propose to learn representations that are domain-independent while being predictive. NSCL learns to model the structure of the feature space by jointly learning to predict the task label while trying to predict 1 subset of features from the remaining features. Since the feature prediction task can be applied to both source and target data, the representation learned for that task should avoid overfitting to 1 domain. The representation also needs to be useful for the primary task, so the network should learn to combine features in a way that obscures which corpus they came from. Experimental protocol The standard domain adaptation task setup is to treat 1 of the corpora as the source domain, train the model on the source, and then evaluate on another corpus simulating a target domain. Each corpus has standard splits of train, development, and test sets; but, when a corpus is the target, we evaluate on its train split as the larger size increases the stability of the estimate. For training UDA models in the domain-adversarial setting, we use input from both the source and target training splits to enable domain-adversarial training; but we hide the negation labels for the target instance so that, during training, the network only sees supervision from the source training data. Inputs are then fed in alternating batches of source and target data, so that the network is continually updating weights for both the task and domain classifiers. The source domain development set is used for tuning model hyperparameters and model selection. In the first experiment, we explore model selection for the UDA system. This experiment is intended to help us choose the best models to evaluate in the main experiment. We use the Seed/Strat pair as a “development pair” to experiment with different model selection methods and used the best settings from that pair on other pairs to avoid unrealistic tuning on each target dataset. We record performance across a grid search of hyperparameters (learning rate, training epochs, and domain-loss weight for UDA models) for the source development set and the target train set. In the next experiment, we compare several systems for the cross-domain negation extraction tasks. First, we include 2 baselines, the No adaptation setting and SCL. We compare those baselines against several systems: Neural SCL, BERT, and BERT+UDA. To examine the BERT+UDA system, we use an ensemble technique based on the results from the first experiment. Specifically, the systems using different hyperparameters were ranked based on source development set performance, and we keep the 33% highest ranked systems. We use majority voting over this set of systems to get classification decisions for BERT+UDA. We also include the results that BERT obtains if it is allowed to use the same ensemble technique. In the final experiment, we examine the overlap in the results between adapted and unadapted systems. Our motivation is that we may see improvements from 2 different types of systems (domain adapted and BERT-like models), and we would like to know to what extent these improvements are additive. This also helps us examine the issue of whether models like BERT are overfitting to source datasets during fine-tuning. To measure this, we look at the outputs of 2 systems—BERT and SCL—in comparison to the no adaptation baseline. For the subset of instances where the no adaptation system makes an error, we count how many instances BERT and SCL get right. We then take the intersection to find the number that BERT gets right that SCL gets wrong, and vice versa. Our intuition is that the systems are additive if the intersection is low. Since we are primarily interested in whether there is potential gain from using UDA on top of BERT, we introduce a metric “% over BERT,” that we define as “# that only SCL fixes/# total BERT fixes.” In other words, we measure how much of an increase in corrections we would get over BERT if we add all of the instances that SCL uniquely gets correct. RESULTS Model selection results Figure 3 shows a scatterplot of the model selection experiments, where the x-axis shows the F1 scores the BERT+UDA models get on the source domain development set; and the y-axis shows the F1 scores on the target domain where each point represents a different configuration of the hyperparameter grid search. The model that gets the best source development set F1, 0.891, only gets 0.783 F1 score on the out-of-domain test set, while the best model on the target domain, F1 = 0.854, gets F1 = 0.840 on the source development set. The correlation coefficient between the source development set F1 scores and the target test set F1 scores is 0.801 (the equivalent correlation coefficient for the vanilla BERT models is 0.775). This is quite a strong correlation, but nevertheless it is hard to single out 1 specific model from its development set performance and guarantee it will optimize performance to the test set. From these results, we decided that an ensemble-based approach would be preferable to a single model. Therefore, for the remaining BERT+UDA experiments below we use the models that perform in the top 33% on the source development set and take a majority vote from each of these models. For fairness, we also report the results of an ensemble of fine-tuned BERT classifiers with different hyperparameters again using majority vote from the top 33% of models based on source development set performance. Figure 3. Open in new tabDownload slide Scatterplot of Strat performance as a function of Seed Dev performance. Figure 3. Open in new tabDownload slide Scatterplot of Strat performance as a function of Seed Dev performance. Table 1 shows the results of our main experiment. We are able to closely replicate results from Miller et al10 on “No Adapt” and “SCL,” with a small gain in SCL that is probably caused by changes in the preprocessing of the feature set. The No Adapt setting is a feature-engineered support vector machine (SVM), similar to that reported by Wu et al.9 Neural SCL shows larger gains than SCL, but has inexplicably low performance on Strat→Seed. BERT performs better than the first 2 domain adaptation methods, gaining 2 points on average and recording the best performance on 6 out of 12 pairs. BERT+UDA ensemble then obtains performance on par with NSCL but not as high as BERT on average. The BERT ensemble, included for a fair comparison, actually has lower performance than single-model BERT chosen based on source development performance. Table 1. Main negation results Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source and Target columns indicate which dataset (domain) was used to train the system (source) and evaluate (target), simulating domain transfer. Remaining columns report F1 scores of negation systems in different adaptation setups explained in the main text. Abbreviations: BERT, bidirectional encoder representations from transformers; NSCL, neural structural correspondence learning; SCL, structural correspondence learning; UDA, unsupervised domain adaptation. Open in new tab Table 1. Main negation results Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source and Target columns indicate which dataset (domain) was used to train the system (source) and evaluate (target), simulating domain transfer. Remaining columns report F1 scores of negation systems in different adaptation setups explained in the main text. Abbreviations: BERT, bidirectional encoder representations from transformers; NSCL, neural structural correspondence learning; SCL, structural correspondence learning; UDA, unsupervised domain adaptation. Open in new tab Table 2 shows the results of the analysis of the intersection of SCL and BERT errors. In most pairs, BERT is fixing many mistakes, while SCL is making very few fixes. As a result, for most pairs, the additional gains available from SCL are minimal. This suggests that even of the fixes that SCL makes, most of them are subsumed by the fixes that BERT makes. Table 2. SCL doesn't get many examples right that BERT doesn't also get Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Open in new tab Table 2. SCL doesn't get many examples right that BERT doesn't also get Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Open in new tab DISCUSSION In our primary result (Table 1), the single best BERT model scores highest on average with 0.78F, with the best performance on 6 out of the total 12 pairs. This shows the strength of BERT as a universal encoder trained on a huge amount of data. This model alone is very robust to domain changes, at least for this negation task where informative negation cues are close to the entities in question. We were somewhat surprised by the size of the increase BERT provided, but perhaps we should not have been given other recent results. But what is truly surprising is the BERT+UDA system that adds domain adaptation on top of BERT is not any better than BERT alone. This result led to the obvious question of whether BERT is somehow subsuming what domain adaptation already does, which is what originally led to the experiment featured in Table 2. In Table 2, we show that even when an adaptation algorithm is getting significantly better results than no adaptation, it is not doing much that BERT doesn’t already do. First, by virtue of being better performing, BERT fixes more examples than the conventional SCL method does. But if we look at the instances that are only fixed by SCL, there are very few compared to what BERT does alone, meaning for most pairs we get fewer than an additional 10% correct. This can help explain the perhaps surprising result that BERT+UDA rarely outperforms BERT—there are few gains to be had over vanilla BERT, and they make take some effort to learn in a way that does not harm performance on other examples. Table 2 has 1 pair with results that stick out slightly—MiPACQ to i2b2, where SCL provides a gain of 27.64% on top of BERT. This pair is worth investigating to see whether UDA methods may still have potential for improving BERT. We examined the 34 instances that SCL gets correct beyond BERT’s improvements to see if there were patterns. We first find that all but 1 of these instances are false positives—the gold standard says the entity is not negated, but both the “No Adaptation” setting and BERT label it as negated. Further, we find that over half of these false positives occur in the setting of conditional lists, where the note lists a set of possible concerning outcomes. For example, in the sentence “She is asked to call with any fevers, chills, increasing weakness or numbness, or any bowel and bladder disruptions,” there are several events that need to be classified (fevers, chills, weakness, etc.), but, in the gold standard, these are all marked as not negated. These can cause confusion for some models, because the construction “any X, Y, or Z” can also be used for negation lists, although normally only in combination with a negation cue word. Next, we looked at whether it is the case that BERT gets these wrong with any training set (ie, that SCL was doing unusually well for the MiPACQ→i2b2 pair), or whether BERT can sometimes get them correct (ie, that BERT did unusually poor for the MiPACQ→i2b2 pair). We find that BERT does unusually poor for this pair. When the source corpus is either Seed or Strat, BERT can in fact get many of these false positives correct— 21 out of 33 for Seed and 23 out of 33 for Strat. This means that the MiPACQ corpus is uniquely bad at preparing BERT to handle lists of conditional items. Ongoing work in our group and many others is exploring a variety of techniques for understanding BERT-trained models, which we will apply to these models in future work. While, on average, explicit domain adaptation performs worse than BERT, we saw in development that the best BERT+UDA model on the development set can outperform BERT, and so model selection deserves some attention. Our result in the first experiment showed that there are individual BERT+UDA models with higher performance than the best BERT model, but the mapping from source development performance to target performance does not hold up quite as well for BERT+UDA as it does for BERT alone. This is an interesting intellectual question, and future work should investigate this, but, given the relatively small gains that Table 2 suggests are available, it may not justify the highest priority level. As mentioned in the introduction, there are several recent BERT models that have had their pretraining extended using the unlabeled MIMIC III data. For this work, we fine-tune with vanilla BERT, as we already see large gains over existing methods, and we wanted to examine these models deeply before studying the more complex setting with 2 pretraining phases. Preliminary experiments with ClinicalBERT on our development pair also showed that performance did not surpass vanilla BERT, despite extensive tuning. In other work, the task showing the best performance of MIMIC pretraining on BERT is named entity recognition.25 In that task, one can imagine the exposure to the large MIMIC text dataset is important for recognizing medical terms. In our task, by contrast, the medical entities are given, and the task is more of a general language task in the sense that the model needs to find a negation cue and then relate it to the given entity. For that reason, it is not surprising that additional pretraining on MIMIC does not help. These results suggest a number of future directions. First, while improved performance is desirable, the models that BERT trains are much larger and less interpretable than existing methods, including feature-based classifiers as well as rule-based methods. One direction that is already being explored in the general domain has been coined “BERTology,” the science of understanding how BERT works. This involves examining the value of attention heads and intermediate nodes to try to build a theory (or at least a mental model) of how BERT reasons about language. Improving our understanding of how BERT recognizes complex linguistic phenomena suggests an opportunity to leverage that understanding to develop better simple methods—by improving feature engineering in standard linear models, for example. This direction can potentially address concerns about both the size and complexity of these models. The next direction is extending the BERT model to other classification tasks related to concept status. Modeling uncertainty, genericity, and historical status, among others, are all important but are more difficult than the negation classification task. The success of BERT on so many NLP tasks suggests it will be wise to revisit these problems. As we saw in our error analysis, where conditional mentions were confused with negated mentions, there is overlap between these tasks. This suggests an approach that uses joint inference or multi-task learning may be better able to model the interactions between the different concept status classification tasks. The conclusions from this work are not guaranteed to apply to other clinical NLP tasks. As mentioned above, negation detection (and probably other assertion status classification tasks) probably benefits from the fact that BERT learns from a massive general dataset, for the task of relating negation cue words to named entities. In contrast, for tasks like finding medical named entities, even large datasets like Wikipedia will lack coverage, and, therefore, BERT may be more dependent on the source dataset to learn the task. In this case, it may be that BERT can overfit, but this work still needs to be done. CONCLUSION The main result of this work is that training BERT for clinical negation detection is likely to be as valuable as developing new domain adaptation methods. This does not mean that BERT has made domain adaptation unnecessary, because, as mentioned above, it may still benefit from domain adaptation for other NLP tasks. However, our interpretation of these results is that the best path forward for practical negation detection is to use existing datasets with some flavor of BERT. Since there is still variation in performance based on the size and coverage of the source corpus, the best performing BERT-based negation model should be built using multiple clinical datasets. We are currently working on incorporating such models into the open source Apache cTAKES system to make these models available to the public. FUNDING Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under award numbers R01LM012918 and R01LM012973. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. AUTHOR CONTRIBUTIONS CL, SB, and TM designed the study. CL carried out the primary experiments. CL and TM carried out analysis and secondary experiments. All authors contributed to developing the outline and editing the manuscript. CL and TM wrote the manuscript. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1 Uzuner Ö , South BR , Shen S , et al. . 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text . J Am Med Inform Assoc 2011 ; 18 ( 5 ): 552 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Pradhan S , Elhadad N , South BR , et al. . Task 1: ShARe/CLEF eHealth evaluation lab 2013. In: Proceedings of the ShARE/CLEF Evaluation Lab 2013; 2013 . 3 Wang Y , Wang L , Rastegar-Mojarad M , et al. . Clinical information extraction applications: a literature review . J Biomed Inform 2018 ; 77 : 34 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Chapman WW , Bridewell W , Hanbury P , et al. . A simple algorithm for identifying negated findings and diseases in discharge summaries . J Biomed Inform 2001 ; 34 ( 5 ): 301 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Harkema H , Dowling JN , Thornblade T , et al. . ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports . J Biomed Inform 2009 ; 42 ( 5 ): 839 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Mehrabi S , Krishnan A , Sohn S , et al. . DEEPEN: a negation detection system for clinical text incorporating dependency relation into NegEx . J Biomed Inform 2015 ; 54 : 213 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Sohn S , Wu S , Chute CG. Dependency Parser-based negation detection in clinical narratives . AMIA Jt Summits Transl Sci Proc 2012; 2012: 1–8. OpenURL Placeholder Text WorldCat 8 Bhatia P , Celikkaya B , Khalilia M. Joint entity extraction and assertion detection for clinical text. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019 : 954 – 9 . 9 Wu S , Miller T , Masanz J , et al. . Negation’s not solved: generalizability versus optimizability in clinical natural language processing . PLoS One 2014 ; 9 ( 11 ): e112774 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Miller T , Bethard S , Amiri H , et al. . Unsupervised domain adaptation for clinical negation detection. In: BioNLP 2017 . Vancouver, Canada : Association for Computational Linguistics ; 2017 : 165 – 70 . http://www.aclweb.org/anthology/W17-2320. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 11 Blitzer J , Dredze M , Pereira F. Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Prague, Czech Republic: Association for Computational Linguistics; 2007 : 440. http://www.cs.brandeis.edu/ marc/misc/proceedings/acl-2007/ACLMain/pdf/ACLMain56.pdf. 12 Blitzer J , McDonald R , Pereira F. Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006) Published Online First: 2006 .http://dl.acm.org/citation.cfm? id=1610094 13 Ziser Y , Reichart R. Neural structural correspondence learning for domain adaptation. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: Association for Computational Linguistics; 2017 : 400–10. doi: 10.18653/v1/K17-1040. 14 Miller T. Simplified neural unsupervised domain adaptation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, MN: Association for Computational Linguistics; 2019 : 414 – 419 . 15 Ganin Y , Ustinova E , Ajakan H , et al. . Domain-adversarial training of neural networks . J Mach Learn Res 2016 ; 17 : 1 – 35 . OpenURL Placeholder Text WorldCat 16 Chen M , Xu Z , Weinberger KQ , et al. . Marginalized denoising autoencoders for domain adaptation. In: Proceedings of the 29th International Coference on International Conference on Machine Learning. Edinburgh, Scotland: Omnipress; 2012 : 1627 – 34 . 17 Ben-David S , Blitzer J , Crammer K , et al. . A theory of learning from different domains . Mach Learn 2010 ; 79 ( 1–2 ): 151 – 75 . Google Scholar Crossref Search ADS WorldCat 18 Akbik A , Blythe D , Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics; Santa Fe, New Mexico; 2018 : 1638 – 49 . 19 Howard J , Ruder S. Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018 : 328 – 39 . 20 Devlin J , Chang M-W , Lee K , et al. . BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, MN: Association for Computational Linguistics; 2019 : 4171 – 86 . 21 Peters M , Neumann M , Iyyer M , et al. . Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, LA: Association for Computational Linguistics; 2018 : 2227 – 37 . 22 Johnson AEW , Pollard TJ , Shen L , et al. . MIMIC-III, a freely accessible critical care database . Sci Data 2016 ; 3 ( 1 ): 160035 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Alsentzer E , Murphy J , Boag W , et al. . Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Minneapolis, MN: Association for Computational Linguistics; 2019 : 72 – 8 . 24 Huang K , Altosaar J , Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. ArXiv 2019; abs/1904.05342. 25 Si Y , Wang J , Xu H , et al. . Enhancing clinical concept extraction with contextual embeddings . J Am Med Inform Assoc 2019 ; 26(11): 1297–304. OpenURL Placeholder Text WorldCat 26 Han X , Eisenstein J. Unsupervised domain adaptation of contextualized embeddings: a case study in early modern English. CoRR 2019 ; abs/1904.02817. http://arxiv.org/abs/1904.02817 27 Albright D , Lanfranchi A , Fredriksen A , et al. . Towards comprehensive syntactic and semantic annotations of the clinical narrative . J Am Med Inform Assoc 2013 ; 20 ( 5 ): 922 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Radford A , Narasimhan K , Salimans T , et al. Improving language understanding by generative pre-training. 2018. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf. Accessed January 21, 2020. 29 Radford A , Wu J , Child R , et al. . Language models are unsupervised multitask learners . OpenAI Blog 2019 ; 1 (8): 9. OpenURL Placeholder Text WorldCat 30 Yang Z , Dai Z , Yang Y , et al. . XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv: 190608237 2019 . © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/does-bert-need-domain-adaptation-for-clinical-negation-detection-CJ3YeFlA7B

Loading next page...

References (34)

Emily Alsentzer, John Murphy, Willie Boag, W. Weng, Di Jin, Tristan Naumann, Matthew McDermott (2019)
Publicly Available Clinical BERT Embeddings
ArXiv, abs/1904.03323
Timothy Miller (2019)
Simplified Neural Unsupervised Domain Adaptation
Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, 2019
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minmin Chen, Z. Xu, Kilian Weinberger, Fei Sha (2012)
Marginalized Denoising Autoencoders for Domain Adaptation
ArXiv, abs/1206.4683
Saeed Mehrabi, Saeed Mehrabi, A. Krishnan, S. Sohn, A. Roch, Heidi Schmidt, J. Kesterson, Chris Beesley, P. Dexter, C. Schmidt, Hongfang Liu, M. Palakal (2015)
DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx
Journal of biomedical informatics, 54
D. Mowery, S. Velupillai, B. South, Lee Christensen, D. Martínez, Liadh Kelly, L. Goeuriot, Noémie Elhadad, Sameer Pradhan, G. Savova, W. Chapman (2013)
Task 1: ShARe/CLEF eHealth Evaluation Lab 2013
Parminder Bhatia, B. Celikkaya, Mohammed Khalilia (2018)
Joint Entity Extraction and Assertion Detection for Clinical Text
Kexin Huang, Jaan Altosaar, R. Ranganath (2019)
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
ArXiv, abs/1904.05342
(2017)
BioNLP 2017
Yftah Ziser, Roi Reichart (2016)
Neural Structural Correspondence Learning for Domain Adaptation
John Blitzer, Mark Dredze, Fernando Pereira (2007)
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification
Shai Ben-David, John Blitzer, K. Crammer, Alex Kulesza, Fernando Pereira, Jennifer Vaughan (2010)
A theory of learning from different domains
Machine Learning, 79
Xiaochuang Han, Jacob Eisenstein (2019)
Unsupervised Domain Adaptation of Contextualized Embeddings: A Case Study in Early Modern English
ArXiv, abs/1904.02817
Alec Radford, Jeff Wu, Rewon Child, D. Luan, Dario Amodei, Ilya Sutskever (2019)
Language Models are Unsupervised Multitask Learners
John Blitzer, Ryan McDonald, Fernando Pereira (2006)
Domain Adaptation with Structural Correspondence Learning
Dependency Parser-based negation detection in clinical narratives
AMIA Jt Summits Transl Sci Proc
Alec Radford, Karthik Narasimhan (2018)
Improving Language Understanding by Generative Pre-Training
D. Albright, Arrick Lanfranchi, Anwen Fredriksen, IV WilliamF.Styler, Colin Warner, Jena Hwang, Jinho Choi, Dmitriy Dligach, Rodney Nielsen, James Martin, Wayne Ward, Martha Palmer, G. Savova (2013)
Towards comprehensive syntactic and semantic annotations of the clinical narrative
Journal of the American Medical Informatics Association : JAMIA, 20
S. Sohn, Stephen Wu, C. Chute (2012)
Dependency Parser-based Negation Detection in Clinical Narratives
AMIA Summits on Translational Science Proceedings, 2012
A. Akbik, Duncan Blythe, Roland Vollgraf (2018)
Contextual String Embeddings for Sequence Labeling
(2016)
Domain-adversarial training of neural networks
J Mach Learn Res, 17
Yaroslav Ganin, E. Ustinova, Hana Ajakan, Pascal Germain, H. Larochelle, François Laviolette, M. Marchand, V. Lempitsky (2015)
Domain-Adversarial Training of Neural Networks
ArXiv, abs/1505.07818
H. Harkema, J. Dowling, Tyler Thornblade, W. Chapman (2009)
ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports
Journal of biomedical informatics, 42 5
Yanshan Wang, Liwei Wang, M. Rastegar-Mojarad, Sungrim Moon, F. Shen, N. Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, S. Sohn, Hongfang Liu (2018)
Clinical information extraction applications: A literature review
Journal of biomedical informatics, 77
Yuqi Si, Jingqi Wang, Hua Xu, Kirk Roberts (2019)
Enhancing Clinical Concept Extraction with Contextual Embedding
Journal of the American Medical Informatics Association : JAMIA
Jeremy Howard, Sebastian Ruder (2018)
Universal Language Model Fine-tuning for Text Classification
Timothy Miller, Steven Bethard, Hadi Amiri, G. Savova (2017)
Unsupervised Domain Adaptation for Clinical Negation Detection
Stephen Wu, Timothy Miller, James Masanz, Matt Coarr, Scott Halgrim, D. Carrell, Cheryl Clark (2014)
Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing
PLoS ONE, 9
W. Chapman, Will Bridewell, P. Hanbury, G. Cooper, B. Buchanan (2001)
A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries
Journal of biomedical informatics, 34 5
Alistair Johnson, T. Pollard, Lu Shen, Li-wei Lehman, M. Feng, M. Ghassemi, Benjamin Moody, Peter Szolovits, L. Celi, R. Mark (2016)
MIMIC-III, a freely accessible critical care database
Scientific Data, 3
Özlem Uzuner, B. South, Shuying Shen, S. Duvall (2011)
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
Journal of the American Medical Informatics Association : JAMIA, 18 5
Zhilin Yang, Zihang Dai, Yiming Yang, J. Carbonell, R. Salakhutdinov, Quoc Le (2019)
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer (2018)
Deep Contextualized Word Representations
ArXiv, abs/1802.05365
(2017)
citation_publisher=Association for Computational Linguistics, Vancouver, Canada; BioNLP 2017

Publisher: Oxford University Press
Copyright: © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN: 1067-5027
eISSN: 1527-974X
DOI: 10.1093/jamia/ocaa001
Publisher site: See Article on Publisher Site

Abstract

Abstract Introduction Classifying whether concepts in an unstructured clinical text are negated is an important unsolved task. New domain adaptation and transfer learning methods can potentially address this issue. Objective We examine neural unsupervised domain adaptation methods, introducing a novel combination of domain adaptation with transformer-based transfer learning methods to improve negation detection. We also want to better understand the interaction between the widely used bidirectional encoder representations from transformers (BERT) system and domain adaptation methods. Materials and Methods We use 4 clinical text datasets that are annotated with negation status. We evaluate a neural unsupervised domain adaptation algorithm and BERT, a transformer-based model that is pretrained on massive general text datasets. We develop an extension to BERT that uses domain adversarial training, a neural domain adaptation method that adds an objective to the negation task, that the classifier should not be able to distinguish between instances from 2 different domains. Results The domain adaptation methods we describe show positive results, but, on average, the best performance is obtained by plain BERT (without the extension). We provide evidence that the gains from BERT are likely not additive with the gains from domain adaptation. Discussion Our results suggest that, at least for the task of clinical negation detection, BERT subsumes domain adaptation, implying that BERT is already learning very general representations of negation phenomena such that fine-tuning even on a specific corpus does not lead to much overfitting. Conclusion Despite being trained on nonclinical text, the large training sets of models like BERT lead to large gains in performance for the clinical negation detection task. natural language processing, machine learning, domain adaptation, deep learning, negation INTRODUCTION One of the core tasks of clinical natural language processing (NLP) is concept extraction and normalization,1–3 which involves mapping words and phrases in unstructured health texts to concepts in terminologies. However, these concepts can carry a variety of “assertion statuses” that affect the meaning; they can be negated, hedged, generic, family-related, or conditional, depending on the context in which they are mentioned. Of these assertion statuses, negation is the most common, and it is crucially important because it essentially has the opposite meaning as a non-negated concept. The negation extraction problem has been studied extensively in the clinical informatics literature,4–8 with some solutions claiming quite high performance to the extent that it may have been considered a solved problem. However, recent work has shown that, while existing methods perform well within any given dataset, performance suffers significantly in the cross-domain setting going from an average of 86.9 F1 in-domain to 74.0 in the cross-domain setting.9 In that paper, Wu et al explore a few possible explanations in depth for domain differences including size of the corpus, length of named entities, and what types of entities are annotated, and only the difference in entity length seems to be important. That work also showed great improvements from supervised domain adaptation, where small amounts of labeled training data from the target domain are used to update the classifier trained on the source domain. More recent work has looked at the possibility of applying unsupervised domain adaptation (UDA), where only unlabeled target data is used to update the classifier, and showed that 25% of the cross-domain losses could be reclaimed with existing UDA methods.10 This represents an attempt to move in a more realistic direction, as obtaining labeled concept negation data at all the sites that want to run concept extraction systems is probably unrealistic. However, the most recent advances in neural domain adaptation and transfer learning have not yet been applied to the negation extraction problem. Recent basic research in domain adaptation methods has focused largely on neural network-based models that learn domain-invariant feature representations. A model called structural correspondence learning (SCL),11,12 which creates new features based on the weights of linear classifiers that predict feature values given other features, has been updated for the neural era. In neural SCL, a multi-layer perceptron is used to predict feature values from other features. These approaches have been successful13,14 but still use traditional feature engineering approaches to define the feature space, and, like the original SCL method, they require the practitioner to decide which parts of the feature space are the predictors and which should be predicted. Another strain of neural domain adaptation works directly on word inputs passed through an embedding layer.15,16 One of the most successful approaches along these lines is the domain adversarial neural network (DANN).15 The DANN model is an attempt to implement ideas from domain adaptation theory17 which say that classifier performance in the target domain will degrade proportionally to how easily the source and target domain are distinguished given the feature set. Based on this theoretical result, the DANN model explicitly trains a neural network that can predict the label of interest (eg, whether a concept is negated or not) but that is not able to tell apart the 2 domains from the feature set. The goal is that the earlier stage of the network should learn to map domain-specific cues to similar variables in the final layer’s representation while learning to discard domain-specific noise. The way this is implemented is that the network is given data from both source and target domain and, for each instance, has to predict both the task label (negation status) and the domain that it came from. The loss function combines the loss incurred from the prediction of the task label (negation) and the prediction of the domain that a given instance came from. Since the goal is to not be able to tell domains apart, a gradient reversal layer is inserted between the learned representation layer and the domain prediction (while the task prediction uses the learned representation layer directly). During the forward pass, the gradient reversal layer simply passes its input forward. However, during the backwards pass, this layer multiplies the gradients computed by back-propagation by −1, so that the updates to weights in earlier layers move the weights in a direction that makes it more difficult to tell the domains apart. During training, source instances are passed in with their gold label and a domain label, while target instances only have a domain label. Meanwhile, a parallel track of research is based on transfer learning.18–21 In this approach, very large models are first pretrained on general tasks like language modeling on massive datasets. The idea is that pretraining on diverse massive datasets teaches the network something general about language. Then, during a fine-tuning stage, a final classifier is trained on a specific supervised task of interest, and, typically, the weights throughout the network are updated as well. These models have obtained state-of-the-art performance on multiple NLP benchmarks, often by large margins. Recent work has adapted 1 of these models, bidirectional encoder representations from transformers (BERT), to the clinical domain. This is done by starting from downloadable checkpoints, continuing the pretraining on MIMIC III22 for several steps, then continuing on to the fine-tuning stage.23–25 The combination of transfer learning and domain adaptation is not yet well-studied. In the only recent relevant work we are aware of,26 the BERT model was adapted in a sequence labeling task by using the pretraining tasks (masked language modeling) in the target domain. While interesting, the task of part-of-speech tagging and the domains (early modern English and tweets) are very different than those that we address in this work. The work we describe here is thus at the forefront of this important research direction. OBJECTIVE The objective of this work is to apply and extend the latest advances in neural domain adaptation and transfer learning to the important task of clinical concept negation extraction. The questions this work explores are 1) whether neural UDA methods improve on existing UDA methods; 2) how models like BERT, trained on massive datasets, perform on cross-domain negation extraction task; 3) how to perform model selection in a domain adaptation setting; and 4) the degree to which models like BERT still overfit on source datasets, and whether domain adaptation techniques can be applied on top of BERT. MATERIALS AND METHODS Datasets We use the same 4 corpora of clinical notes with negation annotation as Miller et al10: 2010 i2b2/VA NLP Challenge Corpus [i2b2],1 the Multi-source Integrated Platform for Answering Clinical Questions corpus [MiPACQ],27 SHARP Seed [Seed], and SHARP Stratified [Strat]. The i2b2 dataset contains deidentified notes (mostly discharge summaries) from Beth Israel Deaconess Medical Center, Partners HealthCare, and University of Pittsburgh Medical Center. MiPACQ was developed for a question answering system but is annotated for concept assertion status. MiPACQ contains Mayo Clinic notes from patients with colon cancer, articles from a now-defunct online medical encyclopedia (Medpedia), and clinical questions collected by the National Library of Medicine (http://clinques.nlm.nih.gov). The Seed corpus contains notes from patients with pulmonary arterial disease at Mayo Clinic, and patients with breast cancer at Seattle Group Health. The Strat corpus contains notes from patients at Mayo Clinic, spread across a variety of specialties and note types, in an attempt to make the dataset as diverse and general as possible. All of these corpora are deidentified, and this work was therefore approved by the Institutional Review Board at Boston Children’s Hospital, as “not human subjects research.” We first examine differences in our datasets for 2 negation-related variables, cue word distribution and distribution of entity lists. For the first, we take 5 common negation cue terms from the Negex system4 (no, not, denies, without, and negative for) and query their frequency in all 4 datasets. Figure 1a shows the distribution of these cue terms. Even among these common terms, there is substantial variation; for example, denies barely occurs in Seed while occurring more often in all 3 other datasets. Terms with lower frequency are likely to be even more problematic. We also examined the occurrences of entities in lists across different corpora. Because lists are notoriously difficult for NLP systems to parse, differences in this type of negated entity may be a cause of differing performance. We used simple patterns to test whether an entity was part of a list, including whether it was followed or preceded by commas or coordination terms. Figure 1b shows that the distribution of entities in lists varies from 10% (Strat) to 34% (i2b2 and Seed). This means that datasets trained on i2b2 and Seed will have many more examples of lists to learn from, perhaps making learning systems better able to handle these difficult instances. Figure 1. Open in new tabDownload slide The plot in sub-figure (a) shows the distribution of 5 common negation cues in different corpora. Sub-figure (b) shows the prevalence of negated concepts in lists across different corpora. Figure 1. Open in new tabDownload slide The plot in sub-figure (a) shows the distribution of 5 common negation cues in different corpora. Sub-figure (b) shows the prevalence of negated concepts in lists across different corpora. Domain adaptation methods The first domain adaptation algorithm we apply is a neural version of structural correspondence learning (NSCL).14 This method builds on the work of Ziser and Reichart,13 who first extended SCL. Their method replaces a set of linear classifiers with a single multi-layer perceptron with multiple sigmoid outputs. As in traditional SCL, the first step is defining “pivot features” for the task which, in this case, is done by using mutual information feature selection between features and labels in the source domain. They then train their neural network to learn feature correspondences by passing in both source and target data without labels and predicting the value of all pivot features in an instance, given the values of all the remaining features. The hidden layer of this network is then treated as a feature extractor for training on the source data with the labels for the task of interest. In Miller’s extension,14 this method was extended by doing the task and feature correspondence learning jointly, so that the feature extractor is optimized for both the task and for pivot feature prediction. In this work we apply it for the first time to clinical NLP tasks. Recently, pretrained general-purpose language encoders such as ELMo,21 GPT,28 GPT2,29 BERT,20 and XLNet30 have been trained on vast amounts of unlabeled text and have brought significant performance gains for many individual tasks. Yet, the cross-domain performance of these models hasn’t been fully examined. Picking the most representative encoder, BERT, we aim to study its cross-domain negation detection performance in the fine-tuning setting. BERT takes a word sequence of certain length as its input. In order to mark the position of the entities being classified for negation, we use a pair of nonword tags, “es” and “ee”, to signify the starting and ending of an entity in an input sequence (as shown in Figure 2). Figure 2. Open in new tabDownload slide Architecture that integrates domain adversarial training into a BERT-based classifier. Figure 2. Open in new tabDownload slide Architecture that integrates domain adversarial training into a BERT-based classifier. We also investigate using a domain adversarial neural network on top of BERT (see Figure 2). Training instances from both source and target domains are fed into BERT and get their encoded representation from the top CLS (classification) embedding. The CLS embedding is then used for predicting 2 binary outputs: the label (negation 1/−1) and the domain that a given instance came from (source/target). A gradient reversal layer is applied to the domain prediction for domain adversarial back propagation. This layer acts as a pass-through layer during the forward pass of neural-network training: it simply copies its input forward. During back-propagation, it passes gradients backwards after multiplying them by −1, so that the weights in the earlier representation-learning layers in the network will move in a direction that makes it harder for them to memorize the domain. The primary difference between NSCL and BERT+DANN is that the former operates over traditional feature representations, and the latter operates directly over the tokens in a sentence. They are similar in that they both propose to learn representations that are domain-independent while being predictive. NSCL learns to model the structure of the feature space by jointly learning to predict the task label while trying to predict 1 subset of features from the remaining features. Since the feature prediction task can be applied to both source and target data, the representation learned for that task should avoid overfitting to 1 domain. The representation also needs to be useful for the primary task, so the network should learn to combine features in a way that obscures which corpus they came from. Experimental protocol The standard domain adaptation task setup is to treat 1 of the corpora as the source domain, train the model on the source, and then evaluate on another corpus simulating a target domain. Each corpus has standard splits of train, development, and test sets; but, when a corpus is the target, we evaluate on its train split as the larger size increases the stability of the estimate. For training UDA models in the domain-adversarial setting, we use input from both the source and target training splits to enable domain-adversarial training; but we hide the negation labels for the target instance so that, during training, the network only sees supervision from the source training data. Inputs are then fed in alternating batches of source and target data, so that the network is continually updating weights for both the task and domain classifiers. The source domain development set is used for tuning model hyperparameters and model selection. In the first experiment, we explore model selection for the UDA system. This experiment is intended to help us choose the best models to evaluate in the main experiment. We use the Seed/Strat pair as a “development pair” to experiment with different model selection methods and used the best settings from that pair on other pairs to avoid unrealistic tuning on each target dataset. We record performance across a grid search of hyperparameters (learning rate, training epochs, and domain-loss weight for UDA models) for the source development set and the target train set. In the next experiment, we compare several systems for the cross-domain negation extraction tasks. First, we include 2 baselines, the No adaptation setting and SCL. We compare those baselines against several systems: Neural SCL, BERT, and BERT+UDA. To examine the BERT+UDA system, we use an ensemble technique based on the results from the first experiment. Specifically, the systems using different hyperparameters were ranked based on source development set performance, and we keep the 33% highest ranked systems. We use majority voting over this set of systems to get classification decisions for BERT+UDA. We also include the results that BERT obtains if it is allowed to use the same ensemble technique. In the final experiment, we examine the overlap in the results between adapted and unadapted systems. Our motivation is that we may see improvements from 2 different types of systems (domain adapted and BERT-like models), and we would like to know to what extent these improvements are additive. This also helps us examine the issue of whether models like BERT are overfitting to source datasets during fine-tuning. To measure this, we look at the outputs of 2 systems—BERT and SCL—in comparison to the no adaptation baseline. For the subset of instances where the no adaptation system makes an error, we count how many instances BERT and SCL get right. We then take the intersection to find the number that BERT gets right that SCL gets wrong, and vice versa. Our intuition is that the systems are additive if the intersection is low. Since we are primarily interested in whether there is potential gain from using UDA on top of BERT, we introduce a metric “% over BERT,” that we define as “# that only SCL fixes/# total BERT fixes.” In other words, we measure how much of an increase in corrections we would get over BERT if we add all of the instances that SCL uniquely gets correct. RESULTS Model selection results Figure 3 shows a scatterplot of the model selection experiments, where the x-axis shows the F1 scores the BERT+UDA models get on the source domain development set; and the y-axis shows the F1 scores on the target domain where each point represents a different configuration of the hyperparameter grid search. The model that gets the best source development set F1, 0.891, only gets 0.783 F1 score on the out-of-domain test set, while the best model on the target domain, F1 = 0.854, gets F1 = 0.840 on the source development set. The correlation coefficient between the source development set F1 scores and the target test set F1 scores is 0.801 (the equivalent correlation coefficient for the vanilla BERT models is 0.775). This is quite a strong correlation, but nevertheless it is hard to single out 1 specific model from its development set performance and guarantee it will optimize performance to the test set. From these results, we decided that an ensemble-based approach would be preferable to a single model. Therefore, for the remaining BERT+UDA experiments below we use the models that perform in the top 33% on the source development set and take a majority vote from each of these models. For fairness, we also report the results of an ensemble of fine-tuned BERT classifiers with different hyperparameters again using majority vote from the top 33% of models based on source development set performance. Figure 3. Open in new tabDownload slide Scatterplot of Strat performance as a function of Seed Dev performance. Figure 3. Open in new tabDownload slide Scatterplot of Strat performance as a function of Seed Dev performance. Table 1 shows the results of our main experiment. We are able to closely replicate results from Miller et al10 on “No Adapt” and “SCL,” with a small gain in SCL that is probably caused by changes in the preprocessing of the feature set. The No Adapt setting is a feature-engineered support vector machine (SVM), similar to that reported by Wu et al.9 Neural SCL shows larger gains than SCL, but has inexplicably low performance on Strat→Seed. BERT performs better than the first 2 domain adaptation methods, gaining 2 points on average and recording the best performance on 6 out of 12 pairs. BERT+UDA ensemble then obtains performance on par with NSCL but not as high as BERT on average. The BERT ensemble, included for a fair comparison, actually has lower performance than single-model BERT chosen based on source development performance. Table 1. Main negation results Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source and Target columns indicate which dataset (domain) was used to train the system (source) and evaluate (target), simulating domain transfer. Remaining columns report F1 scores of negation systems in different adaptation setups explained in the main text. Abbreviations: BERT, bidirectional encoder representations from transformers; NSCL, neural structural correspondence learning; SCL, structural correspondence learning; UDA, unsupervised domain adaptation. Open in new tab Table 1. Main negation results Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source . Target . No Adapt . SCL . NSCL . BERT . BERT+UDA Ensemble . BERT Ensemble . Seed Strat 0.76 0.8 0.79 0.83 0.82 0.82 MiPACQ 0.66 0.7 0.75 0.74 0.71 0.72 I2b2 0.79 0.83 0.81 0.87 0.83 0.83 Strat Seed 0.66 0.67 0.6 0.71 0.65 0.67 MiPACQ 0.67 0.68 0.74 0.72 0.73 0.74 I2b2 0.79 0.8 0.74 0.84 0.83 0.85 MiPACQ Seed 0.73 0.73 0.73 0.75 0.73 0.72 Strat 0.78 0.79 0.82 0.83 0.82 0.79 I2b2 0.77 0.85 0.83 0.87 0.87 0.88 I2b2 Seed 0.65 0.72 0.76 0.72 0.71 0.72 Strat 0.59 0.69 0.77 0.73 0.70 0.74 MiPACQ 0.64 0.69 0.74 0.74 0.70 0.71 Average 0.71 0.75 0.76 0.78 0.76 0.77 Source and Target columns indicate which dataset (domain) was used to train the system (source) and evaluate (target), simulating domain transfer. Remaining columns report F1 scores of negation systems in different adaptation setups explained in the main text. Abbreviations: BERT, bidirectional encoder representations from transformers; NSCL, neural structural correspondence learning; SCL, structural correspondence learning; UDA, unsupervised domain adaptation. Open in new tab Table 2 shows the results of the analysis of the intersection of SCL and BERT errors. In most pairs, BERT is fixing many mistakes, while SCL is making very few fixes. As a result, for most pairs, the additional gains available from SCL are minimal. This suggests that even of the fixes that SCL makes, most of them are subsumed by the fixes that BERT makes. Table 2. SCL doesn't get many examples right that BERT doesn't also get Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Open in new tab Table 2. SCL doesn't get many examples right that BERT doesn't also get Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Source . Target . SCL uniquely fixed . BERT uniquely fixed . % over BERT . Seed Strat 2 47 2.47 MiPACQ 10 291 3.13 I2b2 3 207 0.95 Strat Seed 10 85 7.87 MiPACQ 7 251 2.69 I2b2 1 164 0.51 MiPACQ Seed 18 147 11.39 Strat 0 38 0 I2b2 34 103 27.64 I2b2 Seed 16 172 7.96 Strat 0 44 0 MiPACQ 31 320 8.78 Average 22.8 149.4 6.12 Open in new tab DISCUSSION In our primary result (Table 1), the single best BERT model scores highest on average with 0.78F, with the best performance on 6 out of the total 12 pairs. This shows the strength of BERT as a universal encoder trained on a huge amount of data. This model alone is very robust to domain changes, at least for this negation task where informative negation cues are close to the entities in question. We were somewhat surprised by the size of the increase BERT provided, but perhaps we should not have been given other recent results. But what is truly surprising is the BERT+UDA system that adds domain adaptation on top of BERT is not any better than BERT alone. This result led to the obvious question of whether BERT is somehow subsuming what domain adaptation already does, which is what originally led to the experiment featured in Table 2. In Table 2, we show that even when an adaptation algorithm is getting significantly better results than no adaptation, it is not doing much that BERT doesn’t already do. First, by virtue of being better performing, BERT fixes more examples than the conventional SCL method does. But if we look at the instances that are only fixed by SCL, there are very few compared to what BERT does alone, meaning for most pairs we get fewer than an additional 10% correct. This can help explain the perhaps surprising result that BERT+UDA rarely outperforms BERT—there are few gains to be had over vanilla BERT, and they make take some effort to learn in a way that does not harm performance on other examples. Table 2 has 1 pair with results that stick out slightly—MiPACQ to i2b2, where SCL provides a gain of 27.64% on top of BERT. This pair is worth investigating to see whether UDA methods may still have potential for improving BERT. We examined the 34 instances that SCL gets correct beyond BERT’s improvements to see if there were patterns. We first find that all but 1 of these instances are false positives—the gold standard says the entity is not negated, but both the “No Adaptation” setting and BERT label it as negated. Further, we find that over half of these false positives occur in the setting of conditional lists, where the note lists a set of possible concerning outcomes. For example, in the sentence “She is asked to call with any fevers, chills, increasing weakness or numbness, or any bowel and bladder disruptions,” there are several events that need to be classified (fevers, chills, weakness, etc.), but, in the gold standard, these are all marked as not negated. These can cause confusion for some models, because the construction “any X, Y, or Z” can also be used for negation lists, although normally only in combination with a negation cue word. Next, we looked at whether it is the case that BERT gets these wrong with any training set (ie, that SCL was doing unusually well for the MiPACQ→i2b2 pair), or whether BERT can sometimes get them correct (ie, that BERT did unusually poor for the MiPACQ→i2b2 pair). We find that BERT does unusually poor for this pair. When the source corpus is either Seed or Strat, BERT can in fact get many of these false positives correct— 21 out of 33 for Seed and 23 out of 33 for Strat. This means that the MiPACQ corpus is uniquely bad at preparing BERT to handle lists of conditional items. Ongoing work in our group and many others is exploring a variety of techniques for understanding BERT-trained models, which we will apply to these models in future work. While, on average, explicit domain adaptation performs worse than BERT, we saw in development that the best BERT+UDA model on the development set can outperform BERT, and so model selection deserves some attention. Our result in the first experiment showed that there are individual BERT+UDA models with higher performance than the best BERT model, but the mapping from source development performance to target performance does not hold up quite as well for BERT+UDA as it does for BERT alone. This is an interesting intellectual question, and future work should investigate this, but, given the relatively small gains that Table 2 suggests are available, it may not justify the highest priority level. As mentioned in the introduction, there are several recent BERT models that have had their pretraining extended using the unlabeled MIMIC III data. For this work, we fine-tune with vanilla BERT, as we already see large gains over existing methods, and we wanted to examine these models deeply before studying the more complex setting with 2 pretraining phases. Preliminary experiments with ClinicalBERT on our development pair also showed that performance did not surpass vanilla BERT, despite extensive tuning. In other work, the task showing the best performance of MIMIC pretraining on BERT is named entity recognition.25 In that task, one can imagine the exposure to the large MIMIC text dataset is important for recognizing medical terms. In our task, by contrast, the medical entities are given, and the task is more of a general language task in the sense that the model needs to find a negation cue and then relate it to the given entity. For that reason, it is not surprising that additional pretraining on MIMIC does not help. These results suggest a number of future directions. First, while improved performance is desirable, the models that BERT trains are much larger and less interpretable than existing methods, including feature-based classifiers as well as rule-based methods. One direction that is already being explored in the general domain has been coined “BERTology,” the science of understanding how BERT works. This involves examining the value of attention heads and intermediate nodes to try to build a theory (or at least a mental model) of how BERT reasons about language. Improving our understanding of how BERT recognizes complex linguistic phenomena suggests an opportunity to leverage that understanding to develop better simple methods—by improving feature engineering in standard linear models, for example. This direction can potentially address concerns about both the size and complexity of these models. The next direction is extending the BERT model to other classification tasks related to concept status. Modeling uncertainty, genericity, and historical status, among others, are all important but are more difficult than the negation classification task. The success of BERT on so many NLP tasks suggests it will be wise to revisit these problems. As we saw in our error analysis, where conditional mentions were confused with negated mentions, there is overlap between these tasks. This suggests an approach that uses joint inference or multi-task learning may be better able to model the interactions between the different concept status classification tasks. The conclusions from this work are not guaranteed to apply to other clinical NLP tasks. As mentioned above, negation detection (and probably other assertion status classification tasks) probably benefits from the fact that BERT learns from a massive general dataset, for the task of relating negation cue words to named entities. In contrast, for tasks like finding medical named entities, even large datasets like Wikipedia will lack coverage, and, therefore, BERT may be more dependent on the source dataset to learn the task. In this case, it may be that BERT can overfit, but this work still needs to be done. CONCLUSION The main result of this work is that training BERT for clinical negation detection is likely to be as valuable as developing new domain adaptation methods. This does not mean that BERT has made domain adaptation unnecessary, because, as mentioned above, it may still benefit from domain adaptation for other NLP tasks. However, our interpretation of these results is that the best path forward for practical negation detection is to use existing datasets with some flavor of BERT. Since there is still variation in performance based on the size and coverage of the source corpus, the best performing BERT-based negation model should be built using multiple clinical datasets. We are currently working on incorporating such models into the open source Apache cTAKES system to make these models available to the public. FUNDING Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under award numbers R01LM012918 and R01LM012973. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. AUTHOR CONTRIBUTIONS CL, SB, and TM designed the study. CL carried out the primary experiments. CL and TM carried out analysis and secondary experiments. All authors contributed to developing the outline and editing the manuscript. CL and TM wrote the manuscript. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1 Uzuner Ö , South BR , Shen S , et al. . 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text . J Am Med Inform Assoc 2011 ; 18 ( 5 ): 552 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Pradhan S , Elhadad N , South BR , et al. . Task 1: ShARe/CLEF eHealth evaluation lab 2013. In: Proceedings of the ShARE/CLEF Evaluation Lab 2013; 2013 . 3 Wang Y , Wang L , Rastegar-Mojarad M , et al. . Clinical information extraction applications: a literature review . J Biomed Inform 2018 ; 77 : 34 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Chapman WW , Bridewell W , Hanbury P , et al. . A simple algorithm for identifying negated findings and diseases in discharge summaries . J Biomed Inform 2001 ; 34 ( 5 ): 301 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Harkema H , Dowling JN , Thornblade T , et al. . ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports . J Biomed Inform 2009 ; 42 ( 5 ): 839 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Mehrabi S , Krishnan A , Sohn S , et al. . DEEPEN: a negation detection system for clinical text incorporating dependency relation into NegEx . J Biomed Inform 2015 ; 54 : 213 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Sohn S , Wu S , Chute CG. Dependency Parser-based negation detection in clinical narratives . AMIA Jt Summits Transl Sci Proc 2012; 2012: 1–8. OpenURL Placeholder Text WorldCat 8 Bhatia P , Celikkaya B , Khalilia M. Joint entity extraction and assertion detection for clinical text. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019 : 954 – 9 . 9 Wu S , Miller T , Masanz J , et al. . Negation’s not solved: generalizability versus optimizability in clinical natural language processing . PLoS One 2014 ; 9 ( 11 ): e112774 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Miller T , Bethard S , Amiri H , et al. . Unsupervised domain adaptation for clinical negation detection. In: BioNLP 2017 . Vancouver, Canada : Association for Computational Linguistics ; 2017 : 165 – 70 . http://www.aclweb.org/anthology/W17-2320. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 11 Blitzer J , Dredze M , Pereira F. Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Prague, Czech Republic: Association for Computational Linguistics; 2007 : 440. http://www.cs.brandeis.edu/ marc/misc/proceedings/acl-2007/ACLMain/pdf/ACLMain56.pdf. 12 Blitzer J , McDonald R , Pereira F. Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006) Published Online First: 2006 .http://dl.acm.org/citation.cfm? id=1610094 13 Ziser Y , Reichart R. Neural structural correspondence learning for domain adaptation. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: Association for Computational Linguistics; 2017 : 400–10. doi: 10.18653/v1/K17-1040. 14 Miller T. Simplified neural unsupervised domain adaptation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, MN: Association for Computational Linguistics; 2019 : 414 – 419 . 15 Ganin Y , Ustinova E , Ajakan H , et al. . Domain-adversarial training of neural networks . J Mach Learn Res 2016 ; 17 : 1 – 35 . OpenURL Placeholder Text WorldCat 16 Chen M , Xu Z , Weinberger KQ , et al. . Marginalized denoising autoencoders for domain adaptation. In: Proceedings of the 29th International Coference on International Conference on Machine Learning. Edinburgh, Scotland: Omnipress; 2012 : 1627 – 34 . 17 Ben-David S , Blitzer J , Crammer K , et al. . A theory of learning from different domains . Mach Learn 2010 ; 79 ( 1–2 ): 151 – 75 . Google Scholar Crossref Search ADS WorldCat 18 Akbik A , Blythe D , Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics; Santa Fe, New Mexico; 2018 : 1638 – 49 . 19 Howard J , Ruder S. Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018 : 328 – 39 . 20 Devlin J , Chang M-W , Lee K , et al. . BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, MN: Association for Computational Linguistics; 2019 : 4171 – 86 . 21 Peters M , Neumann M , Iyyer M , et al. . Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, LA: Association for Computational Linguistics; 2018 : 2227 – 37 . 22 Johnson AEW , Pollard TJ , Shen L , et al. . MIMIC-III, a freely accessible critical care database . Sci Data 2016 ; 3 ( 1 ): 160035 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Alsentzer E , Murphy J , Boag W , et al. . Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Minneapolis, MN: Association for Computational Linguistics; 2019 : 72 – 8 . 24 Huang K , Altosaar J , Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. ArXiv 2019; abs/1904.05342. 25 Si Y , Wang J , Xu H , et al. . Enhancing clinical concept extraction with contextual embeddings . J Am Med Inform Assoc 2019 ; 26(11): 1297–304. OpenURL Placeholder Text WorldCat 26 Han X , Eisenstein J. Unsupervised domain adaptation of contextualized embeddings: a case study in early modern English. CoRR 2019 ; abs/1904.02817. http://arxiv.org/abs/1904.02817 27 Albright D , Lanfranchi A , Fredriksen A , et al. . Towards comprehensive syntactic and semantic annotations of the clinical narrative . J Am Med Inform Assoc 2013 ; 20 ( 5 ): 922 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Radford A , Narasimhan K , Salimans T , et al. Improving language understanding by generative pre-training. 2018. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf. Accessed January 21, 2020. 29 Radford A , Wu J , Child R , et al. . Language models are unsupervised multitask learners . OpenAI Blog 2019 ; 1 (8): 9. OpenURL Placeholder Text WorldCat 30 Yang Z , Dai Z , Yang Y , et al. . XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv: 190608237 2019 . © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics Association – Oxford University Press

Published: Apr 1, 2020

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Does BERT need domain adaptation for clinical negation detection?

Does BERT need domain adaptation for clinical negation detection?

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Does BERT need domain adaptation for clinical negation detection?

Does BERT need domain adaptation for clinical negation detection?

References (34)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies