Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Evaluation of lexicon- and syntax-based negation detection algorithms using clinical text data

Evaluation of lexicon- and syntax-based negation detection algorithms using clinical text data IntroductionText is an important data type in any electronic health record (EHR) system and can be used in various files such as patient medical history, discharge summaries, radiology reports and laboratory test results. Files of this nature are unstructured text data that are difficult for humans to manually analyze. Therefore, the natural language processing (NLP) approach was designed based on the principle of analyzing the linguistic concepts of each file using various methods, which are classified into two levels, low and high [1]. Low-level tasks include tokenization, sentence splitting, part-of-speech tagging (noun, verb, adjective etc.), morphological analysis, syntactic parsing, coreference resolution and so on. High-level tasks include spelling/grammatical error identifications, named entity recognition (NER), word sense disambiguation, negation and uncertainty identification, relationship extraction, temporal inferences/relationship extraction and information extraction. Chapman et al. found that nearly half of the clinical concepts found in text files were negated, such as “patient denies chest pain”. Because of this, negation detection (ND) is one of the significant strategies used to denote whether a given concept is absent or present. NegEx is a simple, regular-expression-based ND algorithm. Today, the ND approach has been applied in a wide range of applications, including information retrieval systems [2], [3], cTAKES [4], GATE (general architecture for text engineering) [5], MITRE system [6] etc.In 2015, Ou and Patrick [7] classified the ND approach into three subtasks: lexicon, syntax and machine learning. First, lexicon-based algorithms follow a rule-based approach, relying on trigger terms and termination clues. Second, syntax-based algorithms also follow a rule-based approach, in which the number of rules is presented and the results are based on the dependency from the Stanford Dependency Parser (SDP) output. Third, machine learning-based algorithms mostly depend on using one of the classifier methods as a support vector machine. Although many studies of negation have been based on machine learning [8], [9], [10], their sources are not free to access. Therefore, the aim of this study is to investigate only those negation methods based on lexicon and syntax. Annotated data are very important in this research to measure the accuracy of the tested ND systems. In the medical field, very little annotated data are available due to patient privacy and confidentiality requirements. In 2010, i2b2 (Informatics for Integrating Biology and the Bedside) and VA Salt Lake City Health Care System manually annotated the patient reports from three institutions, which was accessed by signing a data use agreement. In this paper, the 2010 i2b2/VA data set is used as input data to evaluate the selected ND methods (further described in the Data Sources section).In addition, the performance of the five chosen ND algorithms can be analyzed using various statistical methods. To implement this work, a simple desktop application is first developed for the preprocessing tasks, which converts the unstructured data into structured data. Second, we modify each ND algorithm to accept specified input data. Third, the selected ND algorithms will be executed and evaluated on the 2010 i2b2 Clinical NLP Challenge data. Later, the post-processing step is conducted manually. To our knowledge, no previous research has analyzed the problem of these two ND approaches, especially lexicon- and syntax-based approaches.Review of literatureIn NLP, the ND method has been widely applied in biomedical research, particularly in unstructured text data [11], [12], [13], [14]. According to Eleatic philosophers, the early Buddhists in India had introduced the first explorations of negative concepts in ontology [15]. The Greek philosopher and scientist Aristotle, in his theory of negation, is rooted it in shifting the issue from the domain of ontology to logic and language. Based on the work of Ou and Patrick [7] in 2015, the ND algorithms are classified into three subtasks: lexicon, syntax and machine learning. Although the majority of papers on the subject have been based on the machine learning approach [8], [9], [10], these resources are not available for public access, and therefore, the machine learning-based approach is not considered in this work.Lexicon-based methodsRule-based techniques are one of the earliest, and still widely used, approaches among the clinical NLP challenges. To deal with these challenges, a different lexicon-based ND algorithm could be discovered similar to NegEx [16], NegExpander [17], NegFinder [18] and NegHunter [19]. NegEx is the first open-source algorithm for the regular expression-based approach. Recently, this algorithm has been successfully extended into applications such as ConText [20], pyConTextNLP [21], DEEPEN [22], cTAKES [4] and others. Chapman et al. tested the NegEx algorithm using patient discharge summaries as input data and found that the algorithm sensitivity was 78% and the specificity was 95%. The ConText algorithm was evaluated in the development sets of 120 reports (F-measure 97% for six-token window and 98% for end-of-sentence) and the test sets of 120 reports (F-measure 86%). Chapman et al. developed peFinder, a new framework that uses the module pyConTextNLP. This was designed for the classification of CT pulmonary angiography reports.Syntax-based methodsThe syntactic parser is a language processing method, different from lexicon-based ND methods. The challenge in lexicon-based ND methods is clearly described in this example, “there is no evidence of cervical lymph node enlargement”. Here, “no” is the negation signal used to detect that the concept “cervical lymph node enlargement” was negated. However, the test results of “cervical lymph node” were not considered negated based on discussions with physicians. Therefore, Hang and Lowe [23] intended to use a grammatical parser on the regular expression-based negation algorithm. In a biological database, the negative information about protein-protein interactions had been carried out using a full-dependency parser and semantic relations [13]. Zhu et al. [24] proposed a semantic parsing approach to BioScope corpus. A simple algorithm called dependency parser-based negation (DepNeg) showed that their F-score of 83% is slightly higher than the cTAKES negation module (82%) [25]. Mehrabi et al. [22] have developed a negation algorithm called DEEPEN using a transition-based dependency parser. They have tested the proposed algorithm on two different data sets, and the results showed that precision and recall were approximately 89–97% and 74–96%, respectively. In 2016, Gkotsis et al. [26] introduced a stand-alone negation tool using the probabilistic context-free grammar (PCFG), in which 6000 annotated sentences of mental health records were inputted in the training phase. Their results showed precision of 89% and recall of 95%. Over a decade, ND techniques were combined with other applications such as GATE [5], sentiment analysis [27], greedy algorithms [11], kernel methods [28] etc. In 2017, Kang et al. [29] reported that character and word embedding are more effective in identifying negations and tested this on Chinese clinical notes, which resulted in an F-score of 99%.Evaluation studies of ND methodsSince the comparison of negation approaches have been implemented, few studies have been reported. Four different ND methods were evaluated by Goryachev et al. [30]. In a 2013 report by Tanushi et al. [31], three different ND algorithms (NegEx, PyConTextNLP and SynNeg) had been analyzed among Swedish clinical texts. It was concluded that the syntax-based approach is best when dealing with longer/complex or shorter sentences. At last, the comparisons given in [32] are comprehensive, which point out that the ND method is still not generalized.Materials and methodsThis work describes two different types of negation algorithms, lexicon-based (or regular expression-based) and syntax-based. This section includes the main part of the work such as the database and statistical methods used and the five ND algorithms tested.Data sourcesThe human annotated clinical data set “2010 i2b2/VA NLP challenge” was developed by the i2b2 and Veterans Affairs (VA) Salt Lake City Health Care System [33]. This data set includes patient discharge summaries and progress notes, which are the input data used for this research. There are 73 discharge summaries collected from Beth Israel Deaconess Medical Center, 97 from Partners HealthCare and 256 progress notes from University of Pittsburgh Medical Center. In the 2010 i2b2/VA workshop, clinical NLP challenges are focused on three tasks: (1) extracting medical concepts and classifying them into problems, tests and treatments, (2) annotating these assertions and (3) finding the relation between medical problems, tests and treatments. In the end, there are 19,665 medical concepts labeled “problem”, 13,833 labeled “test” and 14,188 labeled “treatment”.In addition, there are four types of files used. One is the report file, which refers to individual patient data. Other files are concepts, assertion and relation, which shows the manually annotated data (see Table A1 or Table 2). The assertion annotation should be presented on the concepts grouped by problems. The assertion results will be divided into five different categories: present, absent, possible, conditional and hypothetical. Of these, we consider only two assertions: “absent” and “present”. In this study, the preprocessing step is to extract concepts, sentences and their assertion results (see sample preprocessed data in Table A2). After preprocessing, Table 1 is created using “absent” and “present” assertions from the 2010 i2b2/VA challenge data set.Table 1:Data set description.Data set descriptionsAbsentPresentTotal number of sentences34841958Average number of sentences20.4911.51Total number of words50,18351,816Average number of words per report295304Average number of words per sentence14.4026.46Lexicon-based ND algorithmsNegExThis has a simple regular expression-based ND algorithm and was introduced by Chapman et al. in 2001 [16]. This algorithm includes 281 negation phrases that are classified as conjunctions (joins two phrases together), pseudo-trigger terms, pre-negative trigger terms, post-negative trigger terms and pre- and post-negative trigger terms. The source code for NegEx was freely distributed by the author. The older version of NegEx did not use the possible conjunctions phrases, but the latest version, Apache 2.0, is upgraded with the negation phrase conjunctions.ConTextThis algorithm is extended from the NegEx algorithm by adding two subjects: temporality (recent, historical and hypothetical) and experiencer (patient and other) [20]. This is a lightweight desktop application. It can be used by a non-programmer and applied to different types of clinical reports (see Appendix Figure A1). In addition, this application can perform an information retrieval task by searching the trigger terms and finding whether it is preceded or followed by the indexed clinical findings. In this work, there are 143 negated, 10 historical, 11 hypothetical and 26 other trigger terms. There are also pseudo-triggers, including 17 for negated (e.g. “no increase”, “not cause”), 17 for historical (e.g. “social history”, “poor history”), 4 for hypothetical (e.g. “if negative”, “know if”) and 18 for other (e.g. “by her husband”, “by his brother”). Finally, the 12 termination terms include presentation, patient, because, diagnosis, ED, etiology, recent, remain, consistent, which, and, and but. The default value of the negation property is affirmed, default temporality is recent, and the default experiencer is the patient. The two main steps in the ConText algorithm are defined as follows:Step 1: Mark up all trigger terms, pseudo-trigger terms, and termination terms in the sentence.Step 2: Iterate through the trigger terms from left to right,If the trigger term is a pseudo-trigger term, skip to the next trigger term.Otherwise, it determines the scope of the trigger term and finds out whether the given concepts are present within the scope of the trigger term. If they are present, then the output is negated, if it they are not present, the output is affirmed.Although the ConText algorithm has two additionally proposed contextual properties such as temporality and experiencer, these were not considered, as this study only focused on negation property. This algorithm was written in the Java language and is hosted at the following web address: http://code.google.com/p/negex/.pyConTextNLPThe source code of the ConText algorithm was converted from Java to Python and called pyConText. To avoid name conflicts within the Python package index, pyConText was renamed pyConTextNLP. In this package, a number of items are used, where each item contains the following elements: literal, category, regular expression and rule. Literal is a lexical phrase, for example, “can rule out”, “cannot be excluded” etc. The category is what the item refers to, for example, a finding or an uncertainty term. The categories from the previous implementations of ConText included finding, negated existence term, conjunction, pseudo-negated existence term and temporality term. In pyConText, categories are user-defined, and there are no limits to the number of categories that can be used. Regular expression is optional, which captures the literal tags in the text. If a regular expression is not provided in the definition of the item, the literal category is used to directly generate the regular expression. The rule is also optional, which directs how to generate tagObjects from the item using the states forward, backward and bidirectional. The regular expression is used to identify any literal phrases in the sentence. When a literal phrase is identified, a tagObject is created. More details on this tagObject can be found in [21]. Today, several studies have focused on this algorithm for developing a web-based tool [3], extracting information and classifying the negative cases from the radiology reports [34], [35]. For doing further work, the authors freely distributed this package through the following web address: https://pypi.python.org/pypi/pyConTextNLP.Syntax-based ND algorithmsDEEPENThis stands for DEpEndency ParsEr Negation. This algorithm was proposed by Mehrabi et al. [22]. The SDP is incorporated into this algorithm, which annotates the concepts that are considered negated by the NegEx algorithm [22]. Note that when a concept is affirmed by the NegEx algorithm, a concept is not negated by the DEEPEN algorithm. The SDP parser uses the transition-based dependency relation between negation words and concepts. SDP encompasses 53 grammatical relations, for instance determiner, infinitival modifier etc. The relations are found for each word. In addition, SDP has a set of rules including conjunction and rule, preposition within rule, preposition with/in/within rule, nominal subject rule and suggest rule. The SDP output includes dependency relation, governor term and dependent term. Dependency relation is the grammatical relation between the dependent and the governor terms. The governor term refers to the word in the sentence that the dependency relation reports, and the dependent term is the word that is dependent of the governor term. For instance, in the sentence “Based on this, he required no operative intervention for his pseudocyst”, the output would be as follows: det(intervention-9, no-7). In this example, “det” is the dependency relation, “intervention” is the governor term, and “no” is the dependent term. The production chain concatenates sample sentences such as “No evidence of dilatation”. In every sentence, a production chain is generated that is composed of three levels of tokens: the governor of negation term (“evidence” in determiner; evidence-2, No-1), the dependents of first-level tokens (“of” in prepositional modifier; evidence-2, of-3) and dependents of second-level tokens (“dilatation” object of a proposition; of-3, dilatation-4). If a concept is found within a production chain, it is then confirmed that the concept is negation; otherwise, it is considered affirmation. Although the DEEPEN algorithm is used to decrease incorrect negations found when using NegEx, it cannot access the affirmed results as NegEx does and its performance would be lower on ungrammatical sentences. The DEEPEN source code was developed in Java and is free for access for research purposes at http://svn.code.sf.net/p/ohnlp/code/trunk/DEEPEN.Negation resolution (NR) algorithmThis is a new automated negation method integrating PCFG parsers, applied specifically to mental health records and used to determine whether a suicide risk is affirmed or negated [26]. A PCFG parser was used with the Stanford Core NLP toolkit, which generates a constituency parse tree for each sentence. Every sentence must be preprocessed before starting the NR algorithm. A syntactic representation (i.e. parse tree) and the token of the target concept (e.g. “suicide”) can also be carried out in the sentence structure. The basic principle of an NR algorithm is to reduce the problem of identifying negation scope in the given sentence. Gkotsis et al. [26] included 15 negation words to find the negation scope. Presently, we provide 128 additional negation cues and execute the ND tasks such as pruning, identification of subordinate clauses, identification of negation governing the target-node and NR. This algorithm was written in the Python programming language and is publicly available in the author’s repository (https://github.com/gkotsis/negation-detection). The difference between the DEEPEN and NR algorithms is that DEEPEN depends on NegEx and uses the dependency parser, whereas the NR approach is a fully standalone application based on the constituency parser. Moreover, the NR algorithm is now works only in a UNIX environment. Therefore, it does not currently work on Windows computers.CustomizationTable 2 shows an example of a patient report and its annotated files for concepts, assertions and relations. The present study is only interested in terms of annotated assertions that should be thoroughly investigated.Table 2:Example of i2b2-annotated files.File namesExamplesReportThe patient is stable without oxygen requirementThere was no evidence of acute rib factors on her chest X-rayThe patient has a history of neurogenic bladder, and therefore, she required Foley catheterizationConceptsc=“an oxygen requirement” 1:5 1:7 | | t=“problem”c=“her chest X-ray” 2:9 2:11 | | t=“test”c=“a Foley catheterization” 3:14 3:16 | | t= “treatment”Assertionsc=“an oxygen requirement” 1:5 1:7 | | t=“problem” | | a=“absent”Relationsc=“her chest X-ray” 2:9 2:11 | | r=“TeRP” | | c=“acute rib factors” 2:5 2:7c=“a foley catheterization” 3:14 3:16 | | r=“TrAP” | | c=“neurogenic bladder” 3:6 3:7The first problem is how to extract the concepts, sentences and assertions from two separate documents like patient reports and assertions. Manually doing this work is very difficult. To solve this problem, this study proposes a desktop application that can automatically extract the concepts along with their sentences and assertions. This output will be stored in a single tab-delimited file. The second problem is how to run the above ND algorithms, when the input is in a new format (tab-delimited file). For example, the NegEx algorithm is applied to a single input file, whereas ConText algorithm receives input from the text box (see Appendix Figure A1). Therefore, some modification is required to process each ND algorithm, which will be described later in this section.PreprocessingThis follows four steps (Figure 1). First, a new application is developed. The application can automatically read all files one by one within a folder containing all assertion files. The concepts and their sentences and assertions are carried out using the regular expression. In this case, concepts are extracted by the first double quotes, in which sentences are extracted by the line number, and assertions are extracted by the third double quotes. These outputs are written in a single tab-delimited file. Second, we manually filter the assertion status belonging to absent and present. Third, the manual assertions are removed. Fourth, we filter the sentences, which are included in selected negation words.Figure 1:Architecture of proposed evaluation of ND algorithms.Modification of selected ND algorithmsThe preprocessed data are of two attributes, clinical concepts and their sentences. Now, the input format should be changed to each ND algorithm. Table 3 shows a number of modifications being carried out on each algorithm.Table 3:Modifications of the five algorithms.AlgorithmsModificationsNegExNo changes are madeConTextFirst, the input format is corrected for preprocessed data. Second, the information retrieval process is stopped, which allows for a reduced time when finding negation scope within a sentence. Finally, we set the output to be stored in a comma separated file for further evaluationpyConTextNLPThis algorithm is updated by adding the following features:1. Handling the user-defined input data2. Target concepts are loaded into the item data3. To find given concepts that may present within the negation scope4. Writing the output in a comma separated fileDEEPENThis algorithm is improved by the following tasks:1. Get data from file instead of from pre-defined sentence2. The algorithm reads the given input file that contains concepts and sentences3. A production chain is generated which is composed of three levels of tokens4. If the concept is found within the production chain, the algorithm confirms that the sentence is negated. Otherwise, the sentence is affirmed5. Write the output in a comma separated fileNRThis algorithm is improved by the following tasks:1. Get data from file instead of from pre-defined sentence2. Previously, this algorithm was generated for the single concept of “suicide”. This algorithm is now modified to deal with more than one concept3. The final results of this algorithm are written in a comma-separated filePost-processingThis is fully a manual approach. After the execution of all ND algorithms, the user can perform some post-processing on the results. Data post-processing can be categorized into the following groups: knowledge filtering, interpretation and explanation, evaluation and knowledge integration [36]. In data post-processing, the evaluation results are taken by the statistical analysis, to be explained in the next section. This method is also useful in finding missing data, for example, if the algorithm generates null results. Another possibility is to view the noisy data, for example, if the algorithm generates unexpected results.WorkflowThe golden rule of workflow is “first organize, then computerize”. Hence, we first design our work in a more abstract way, without considering implementation, which is illustrated in Figure 1. The order of work can be described as follows: (i) preprocessing and setting the data in a common format, (ii) executing all ND algorithms, (iii) applying the post-processing tasks and (iv) evaluating individual system performance using statistical analysis. The first three steps are described in the Materials and Methods section. The statistical analysis is present in the next subsection.Statistical analysisThe statistical method determines the differences between the system output and the gold standard database. The results of the above-mentioned algorithms have to be proved by statistics. Individual system performance will be evaluated using statistical methods such as accuracy, precision, recall, F-measure, kappa and area under the curve (AUC). The confusion matrix is a 2∗2 table (see Table A3) that indicates both the predicted (column) and actual results (row). The results are classified in every ND algorithm as true positives (TP: the patients do not have the disease), true negatives (TN: the patients have the disease), false positives (FP: they actually they have the disease) and false negatives (FN: they actually do not have the disease). In this study, statistical significance was analyzed by the R software using two packages, caret and pROC.This study does not require ethical approval because the data set does not involve any vulnerable participants. Additionally, this data set has been collected freely from a registered online source by signing the data use agreement: https://www.i2b2.org/NLP/DataSets/Main.php.ResultsTypically, most negative clinical statements use six negation phrases: cannot, free, no, not, resolved and without. The same negation phrases are used in this research. Before preprocessing, the input data contained a total of 3484 absent sentences, in which 3048 sentences included the selected negation phrase. Now we assume that 3048 were responders and 436 were not. Likewise, of 1958 present sentences, 1470 sentences included the selected negation words. In this case, 1470 sentences were responders and 488 were not. All ND algorithms were executed using these two responder’s sentences and the results presented in Table 4.Table 4:Results of the five algorithms.AlgorithmsAbsentPresentAbsentPresentNoneAbsentPresentNoneNegEx13031745011213490ConText2507541026512050pyConTextNLP289115705349360DEEPEN2375673042010500NR2325143580110640720It is interesting to see that each individual system performed in the order of true-absent, false-absent, none-absent, true-present, false-present and none-present. We note from Table 4 that if the algorithm does not have any results for a particular input, the result is considered to be a “none” value. Here, none results are only produced by the NR algorithm when the target concept may be more than one word. Without considering these null results, the percentages of the NR algorithm were 94% of absent and 85% of present. Thus, the result is better than ConText (82% and 81%, respectively) and pyConTextNLP (94% and 63%, respectively). However, the performance of NR algorithm fails to work if a target concept contains more than one word.The pyConTextNLP algorithm generated results similar to the NR algorithm. Unexpectedly, the algorithm produced multiple results when the target concept was written more than once within a single sentence. Meanwhile, the ConText algorithm did not have a significant result due to the presence of some false-absent and false-present results. The percentage of the DEEPEN algorithm was 77% true absent and 71% true present. Finally, the test showed inadequate results from the NegEx algorithm. The results so far have been graphically displayed in Figure 2. The negation phrase “no” has occurred 2161 times in absent and 727 times in present, which is one of the greatest occurrences compared to other phrases. Table 5 shows the algorithm performance on each of the six negation phrases. Traditionally, negation algorithms have been tested on absent sentences. This work is alternatively executed within both absent and present sentences. The predicted results of each algorithm are labeled absent cases “A” and present cases “P”. Manual results are also indicated within the parentheses.Figure 2:Performance of algorithms.Table 5:Results of the ND algorithms between topmost negation-words.Trigger termsManualNegExConTextpyConTextDEEPENNRAPAPAPAPAPCannotA (13)310941129473P (30)255030228121813FreeA (47)2423291828192918176P (54)4862521242272723NoA (2161)9571204201814320629916914701550202P (727)55717022350425946814857931383NotA (235)1221139314223051795623256P (344)259852531914420010124327223ResolvedA (64)412316486222044177P (58)421645448101345511WithoutA (528)156372342186498304478132843P (257)1906711246691881191382041The highest average score for absent was 95% in pyConTextNLP, and the highest score for present was 80% in the ConText algorithm. However, it should be noted that a negation phrase can exactly match another phrase, such as “resolved completely” and “complete resolution of”.EvaluationAs mentioned in the above section, it is important to assess the quality of the five ND algorithms. First, we verified the algorithm results by the gold standard database, and we classified the results into TP, FP, TN and FN. Second, the evaluations were performed in terms of accuracy, precision, recall, F-measure, Cohen’s kappa and AUC. Table 6 summarizes the evaluation results. Note the total number of sentences (S), which included both present and absent sentences. The precision (P), recall (R) and F-measure (F) have separately calculated for absent (suffixes “_0”) and present (suffixes “_1”). In addition, Cohen’s kappa and AUC were calculated without dividing the results into two groups (present and absent). Additionally, the receiver operating characteristic (ROC) curve is a chart in which the TP rate is plotted on the y-axis, and the FP rate is plotted on the x-axis. The ROC curve is often used to find the relationship between sensitivity and specificity. The high value of F-score was (in bold) found in NR algorithm.Table 6:Evaluation results of the five algorithms.AlgorithmsSTPFPTNFNAP_0R_0F_0P_1R_1F_1KAUCNegEx451834911211303174536425347231619−367ConText45181205265250754182829868169746182pyConText45189365342891157849484896385736179DEEPEN45181050420237567375778481716654775NR3218664862151317878796918867767888Figure 3 shows the ROC curves, which were derived from our experimental results. It is interesting to note that the sensitivity of the NR algorithm is higher than that of the other four ND algorithms. The sensitivity of the DEEPEN, pyConTextNLP and ConText algorithms were slightly lower than expected, and there is certainly room for improvement.Figure 3:Results of ROC.Computational timeComputational time is an important metric to find the length of time that is required to process a computer program. Thus, we have calculated this time by milliseconds. Table 7 illustrates the total execution time for each methodology chosen for our research. The acronym PT stands for preprocessing time and PTND stands for the process time of ND algorithms. Note that the average time of a sentence has been taken in which execution time of single output (i.e. single sentence) was derived from the overall outputs (i.e. total number of sentences). It is easy to understand that the overall execution time is divided by the total number of sentences. The regular time consumption for preprocessing is 22 ms in absent sentences and in 46 ms in present sentences. The total turnaround time is calculated by the following formula:Table 7:Run time of the five algorithms (selected negation-words).AlgorithmsResultsPT (ms)PTND (ms)Total time (ms)Average timeNegExAbsent67,93542,245110,18036ConText67,935121,477189,41262pyConTextNLP67,93569168,62622DEEPEN67,93524,384,00024,451,9358022NR67,93549,360117,29548NegExPresent67,93520,37488,30960ConText67,93558,579126,51486pyConTextNLP67,93533068,26545DEEPEN67,93511,760,00011,827,9358046NR67,93515,00082,935111(1)Total time=PT+PTND.$${\text{Total time}} = {\text{PT}} + {\text{PTND}}{\text{.}}$$Later, the average time per sentence should be calculated through the following equation:(2)Average time=Total time/TNS.$${\text{Average time}} = {\text{Total time/TNS}}{\text{.}}$$The basic system requirement for this experimental work was based on the first four algorithms (NegEx, ConText, pyConTextNLP, DEEPEN), which fully depended on the Windows 32-bit operating system, and the last NR algorithm, which was executed on the Ubuntu 32-bit operating system. In Windows, the software applications used were NetBeans IDE 8.1 (NegEx, ConText, DEEPEN) and Python 2.7.11 (pyConTextNLP). Figure 4 shows the execution time of the ND algorithms for both absent and present assertions. According to this time calculation, the highest time consumption is shown in the DEEPEN algorithm.Figure 4:Execution time of ND algorithms.The time consumption of the NegEx algorithm is better than that of the ConText, DEEPEN and NR algorithms. Finally, this time analysis confirms that the pyConTextNLP is still superior to overall approaches. In syntax-based ND algorithms, the NR algorithm is acceptable in terms of both interpretability and prediction accuracy.DiscussionThe characteristics of lexicon- and syntax-based negation methods are analyzed in this study. Recently, various approaches have been proposed to solve the issues included with negation approaches. Although many studies have been published on the evaluation of different ND methods [7], [30], [32], most of them were not evaluated for the syntax-based negation algorithms, especially in regard to DEEPEN and NR. In this paper, these two methodologies are discussed on a large-scale. In our experimental results, pyConTextNLP and NR algorithms are better than others, as proven by their statistical and computational time scores. The best performance from the lexicon-based ND approaches is pyConTextNLP, which has two advantages: (1) the identification of negation scope is simple with the new forward and backwards options and (2) the execution time is much less than other similar approaches. The NR algorithm generated the parser outputs in less time. This is the main reason why the NR algorithm performed well in the syntax-based ND approach.A similar study was initially conducted by Goryachev et al. [30]. Based on their results, they suggested that the lexicon-based approach is the best; however, the lexicon-based approach is not 100% accurate in this study. Ou and Patrick [7] also compared the performance of the lexicon-, syntax- and machine learning-based approaches. They reported that the accuracy (92%) and κ (79%) of the lexicon method NegEx was higher than those of the machine learning-based methods. This is higher than our results in terms of accuracy (36%) and κ (−3%). In fact, we have executed the NegEx algorithm in more than 100 clinical notes, whereas they used only 100 clinical notes. According to Wu et al. [32], they had found that the F-measure of NegEx was 82% among 33,022 sentences in the 2010 i2b2/VA challenge data. Determining the assertion task in the same i2b2 test set, our F-measure results were measured as slightly lower than those of the machine learning-based approaches [8], [9], [10], [32]. However, our research could be a useful aid for further investigations into different negation methods. To the best of our knowledge, this is the first evaluation of the methodology chosen for our research.Error analysisTable 8 shows that there are errors between the two types of negation methods using the same sentences, as reported in the article [22]. Here, the two assertion results provided include P for present and N for absent. It is likely that most errors in the syntax-based approaches are based on poor parsing results. For example, if the parse trees contained errors, the algorithm did not work as expected. Therefore, more training for the parser is necessary, i.e. a domain-specific parser. In Table 6, the percentage of errors in the absent claims is as follows: NegEx 38%, ConText 11%, pyConTextNLP 3%, DEEPEN 14% and NR 9%. In the same order, the percentage of errors in present is 24%, 5%, 11%, 9% and 2%. The errors are explained with two types of structured sentences, simple and complex, and examples of the errors in simple sentences follows:Table 8:Summary of errors.SentencesGold standardNegExConTextpyConTextNLPDEEPENNRIf her pain should not have been resolved by that time, there is the possibility of repeating facet rhizotomyPPPPPPHowever, I suspect that her pain is not due to an underlying neurologic disorderPPPPPNShe denies any ear pain, sore throat, odynophagia, hemoptysis, shortness-of-breath, dyspnea on exertion, chest discomfort, anorexia, nausea, weight-loss, mass, adenopathy or painNPPNPNMolecular fragile-X results reveal no apparent PMR-1 gene abnormalityNPNNPNMrs. Jane Doe returns with no complaints worrisome for recurrent or metastatic oropharynx cancerNNNNP−She is not having any incontinence or suggestion of infection at this timeNPNNNNShe denies any blood in the stoolNNPNP−No feverNPNNNPNo history of diabetesNPNNNNNo pneumonia was suspectedNPPPN−History inconsistent with strokeNPNPPNHis dyspnea resolvedNNPPP−Elevated enzymes resolvedNNPNP−“Compared with previous CT dated 02-08-2010, there are no significant changes in the size of the lymph nodes”“1. No residual metabolically active lymph nodes in the neck and thorax”In the above first sentence, the contextual meaning is generally positive (i.e. no changes in the size of the lymph nodes), but which has been incorrectly identified as negated in NegEx, ConText and DEEPEN.Note that these three ND algorithms have made wrong decisions based on the pre-defined negation phrase “no significant”. In the second sentence, there was a typographical error of no space between numeric list “1” and the negation cue “no”. However, this is actually the negative statement, and there is no one best result. Seeing that, it is clear that typographical errors are also one of the challenges in clinical NLP research. Meanwhile, examples of errors in complex sentences are given below:“Highly FDG avid non necrotic non calcified lower cervical, mediastinal and periportal lymphadenopathy. There is no evidence of any primary lesion. Possibilities are lymphoma/sarcoidosis/tuberculosis. Biopsy of mediastinal lymph nodes suggested for further evaluation”“There is complete resolution of the cervical and supraclavicular lymph nodes, significant resolution of the mediastinal and left axillary lymphadenopathy and left upper lobe lung lesion”The first sentence above involves more than one clinical finding, and multiple sentences are included, which did not split in the preprocessing steps. Therefore, the negation results of the five ND algorithms are as follows: NegEx (absent), ConText (present), pyConTextNLP (present), DEEPEN (present) and NR (none). The second sentence involves the negation phrase “resolution”. This negation phrase was not added in any of the five ND algorithms. Therefore, the output of the five algorithms could not be correct for the second sentence. Hence, complex sentences should be avoided in the clinical reports. Otherwise, the ND algorithms should be improved with regard to complex sentences.LimitationsSome limitations are present in this work. First, this study has only focused on selected negation cues in two assertions. Second, the input database is small and the large body of work may have helped to verify this conclusion. Third, the preprocessing steps of this work are not generalizable. In fact, the preprocessing application was limited for another domain.ConclusionIn this paper, we evaluated five negation detection algorithms in terms of accuracy and computation time. The algorithms include NegEX, ConText, pyConTextNLP, DEEPEN and NR. The first three are lexicon-based and the last two are syntax-based. We developed a new application for preprocessing, which can automatically read all files one by one, in which the folder contains the collection of assertion files. We modified the five ND algorithms when the input is in a new format. The experimental results show that pyContextNLP and NR have better results than the others. However, despite their good performance in simple sentences, the results were not as successful in complex sentences. This is an important issue for future research. However, we hope that our research will be helpful in solving the difficulty of ND algorithms and encourage the validation of larger sample sizes. The ultimate aim of this evaluation addresses how we will design and develop a new negation algorithm in the future.AcknowledgmentsWe thank S. Shahul Hameed and Gokulalakshmi Elayaperumal from Sree Balaji Medical College and Hospital, Chennai, India, George Gkotsis from IoPPN, King’s College London, and Saeed Mehrabi from the School of Informatics and Computing, Indiana University, Indianapolis, IN, USA, who offered their continuous support in the implementation of this task.Author contributions: The authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.Research funding: None declared.Employment or leadership: None declared.Honorarium: None declared.Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.Appendix ATable A1:Example of single annotated i2b2 report.Sentences (report file)Chest CT scan was negative for pulmonary embolism but positive for consolidationAssertion annotationc=“pulmonary embolism” 21:6 21:7 | | t= “problem” | | a=“absent”(Annotated file)c=“consolidation” 21:11 21:11 | | t=“problem” | | a=“present”Table A2:A sample input data.ConceptsSentencesMassNo mass or vegetation is seen on the mitral valvePericardial effusionThere is no pericardial effusionEpileptiform featuresNo epileptiform features were seenInfectionCXR, LP, UA and abdominal CT showed no sign of infectionOrthostaticShe was not orthostaticA headacheHe did not complain about a headacheTable A3:Evaluation of the algorithm’s output using 2*2 table.Predicted outputTrue (negated)False (affirmed)Manual annotations True (negated)True positive (TP)False negative (FN) False (affirmed)False positive (FP)True negative (TN)Figure A1:A simple surface-based approach of ConText algorithm.References1.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc 2012;18:544–51.NadkarniPMOhno-MachadoLChapmanWWNatural language processing: an introductionJ Am Med Inform Assoc201218544512.Koopman B, Bruza P, Sitbon L, Lawley M. Analysis of the effect of negation on information retrieval of medical data. In: Proc 15th Australas Doc Comput Symp 2010:89–92.KoopmanBBruzaPSitbonLLawleyMAnalysis of the effect of negation on information retrieval of medical dataProc 15th Australas Doc Comput Symp201089923.Scuba W, Tharp M, Mowery D, Tseytlin E, Liu Y, Drews FA, et al. Knowledge author: facilitating user-driven, domain content development to support clinical information extraction. J Biomed Semant 2016;7:42.ScubaWTharpMMoweryDTseytlinELiuYDrewsFAKnowledge author: facilitating user-driven, domain content development to support clinical information extractionJ Biomed Semant20167424.Garla V, Re V Lo, Dorey-Stein Z, Kidwai F, Scotch M, Womack J, et al. The Yale cTAKES extensions for document classification: architecture and application. J Am Med Inform Assoc 2011;18:614–20.GarlaVReV LoDorey-SteinZKidwaiFScotchMWomackJThe Yale cTAKES extensions for document classification: architecture and applicationJ Am Med Inform Assoc201118614205.Mitchell KJ, Becich MJ, Berman JJ, Chapman WW, Gilbertson J, Gupta D, et al. Implementation and evaluation of a negation tagger in a pipeline-based system for information extraction from pathology reports. Stud Health Technol Inform 2004;107:663–7.MitchellKJBecichMJBermanJJChapmanWWGilbertsonJGuptaDImplementation and evaluation of a negation tagger in a pipeline-based system for information extraction from pathology reportsStud Health Technol Inform200410766376.Clark C, Aberdeen J, Coarr M, Tresner-kirsch D, Wellner B, Yeh A, et al. Determining assertion status for medical problems in clinical records. McLean, VA: Mitre Corporation, 2011:2–6.ClarkCAberdeenJCoarrMTresner-kirschDWellnerBYehADetermining assertion status for medical problems in clinical recordsMcLean, VAMitre Corporation2011267.Ou Y, Patrick J. Automatic negation detection in narrative pathology reports. Artif Intell Med 2015;64:41–50.OuYPatrickJAutomatic negation detection in narrative pathology reportsArtif Intell Med20156441508.Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inf Assoc 2011;18:601–6.JiangMChenYLiuMRosenbloomSTManiSDennyJCA study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summariesJ Am Med Inf Assoc20111860169.Clark C, Aberdeen J, Coarr M, Tresner-Kirsch D, Wellner B, Yeh A, et al. MITRE system for clinical assertion status classification. J Am Med Inform Assoc 2011;18:563–7.ClarkCAberdeenJCoarrMTresner-KirschDWellnerBYehAMITRE system for clinical assertion status classificationJ Am Med Inform Assoc201118563710.Minard A-L, Ligozat A-L, Ben Abacha A, Bernhard D, Cartoni B, Deléger L, et al. Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification. J Am Med Inform Assoc 2011;18:588–93.MinardA-LLigozatA-LBen AbachaABernhardDCartoniBDelégerLHybrid methods for improving information access in clinical documents: concept, assertion, and relation identificationJ Am Med Inform Assoc2011185889311.Ballesteros M, Francisco V, Díaz A, Herrera J, Gervás P. Inferring the scope of negation in biomedical documents. Lect Notes Comput Sci 2012;7181 LNCS:363–75.BallesterosMFranciscoVDíazAHerreraJGervásPInferring the scope of negation in biomedical documentsLect Notes Comput Sci20127181 LNCS3637512.Chapman WW, Dowling JN, Wagner MM. Fever detection from free-text clinical records for biosurveillance. J Biomed Inform 2004;37:120–7.ChapmanWWDowlingJNWagnerMMFever detection from free-text clinical records for biosurveillanceJ Biomed Inform200437120713.Sanchez-Graillet O, Poesio M. Negation of protein-protein interactions: analysis and extraction. Bioinformatics 2007;23:424–32.Sanchez-GrailletOPoesioMNegation of protein-protein interactions: analysis and extractionBioinformatics2007234243214.Morante R. Descriptive analysis of negation cues in biomedical texts. Statistics 2009;1429–36.MoranteRDescriptive analysis of negation cues in biomedical textsStatistics200914293615.Horn LR. Natural history of negation. J Pragmat 1989;16: 269–80.HornLRNatural history of negationJ Pragmat1989162698016.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001;34:301–10.ChapmanWWBridewellWHanburyPCooperGFBuchananBGA simple algorithm for identifying negated findings and diseases in discharge summariesJ Biomed Inform2001343011017.Aronow DB, Fangfang F, Croft WB. Ad hoc classification of radiology reports. J Am Med Inform Assoc 1999;6:393–411.AronowDBFangfangFCroftWBAd hoc classification of radiology reportsJ Am Med Inform Assoc1999639341118.Mutalik PG, Deshpande A, Nadkarni PM. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc 2001;8:598–609.MutalikPGDeshpandeANadkarniPMUse of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLSJ Am Med Inform Assoc2001859860919.Gindl S, Kaiser K, Miksch S. Syntactical negation detection in clinical practice guidelines. Stud Health Technol Inform 2008;136:187–92.GindlSKaiserKMikschSSyntactical negation detection in clinical practice guidelinesStud Health Technol Inform20081361879220.Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform 2009;42:839–51.HarkemaHDowlingJNThornbladeTChapmanWWConText: an algorithm for determining negation, experiencer, and temporal status from clinical reportsJ Biomed Inform2009428395121.Chapman BE, Lee S, Kang HP, Chapman WW. Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm. J Biomed Inform 2011;44:728–37.ChapmanBELeeSKangHPChapmanWWDocument-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithmJ Biomed Inform2011447283722.Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, et al. DEEPEN: a negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform 2015;54:213–9.MehrabiSKrishnanASohnSRochAMSchmidtHKestersonJDEEPEN: a negation detection system for clinical text incorporating dependency relation into NegExJ Biomed Inform201554213923.Huang Y, Lowe H. A novel hybrid approach to automated negation detection in clinical radiology reports. J Am Med Inform 2007;304–11.HuangYLoweHA novel hybrid approach to automated negation detection in clinical radiology reportsJ Am Med Inform20073041124.Zhu Q, Li J, Wang H. A unified framework for scope learning via simplified shallow semantic parsing. In: EMNLP 2010 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing 2010:714–24.ZhuQLiJWangHA unified framework for scope learning via simplified shallow semantic parsingEMNLP 2010 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing20107142425.Sohn S, Wu S, Chute CG. Dependency parser-based negation detection in clinical narratives. AMIA Jt Summits Transl Sci Proc AMIA Summit Transl Sci 2012;2012:1–8.SohnSWuSChuteCGDependency parser-based negation detection in clinical narrativesAMIA Jt Summits Transl Sci Proc AMIA Summit Transl Sci201220121826.Gkotsis G, Velupillai S, Oellrich A, Dean H, Liakata M, Dutta R. Don’t let notes be misunderstood: a negation detection method for assessing risk of suicide in mental health records. In: Proc 3rd Work Comput Linguist Clin Psychol Linguist Signal Clin Real 2016:95–105.GkotsisGVelupillaiSOellrichADeanHLiakataMDuttaRDon’t let notes be misunderstood: a negation detection method for assessing risk of suicide in mental health recordsProc 3rd Work Comput Linguist Clin Psychol Linguist Signal Clin Real20169510527.Lapponi E, Read J, Øvrelid L. Representing and resolving negation for sentiment analysis. In: Proc 12th IEEE Int Conf Data Min Work ICDMW 2012:687–92.LapponiEReadJØvrelidLRepresenting and resolving negation for sentiment analysisProc 12th IEEE Int Conf Data Min Work ICDMW20126879228.Shivade C, de Marneffe MC, Fosler-Lussier E, Lai AM. Extending NegEx with kernel methods for negation detection in clinical text. In: Proc Work Extra-Propositional Asp Mean Comput Semant NAACL 2015:41–6.ShivadeCde MarneffeMCFosler-LussierELaiAMExtending NegEx with kernel methods for negation detection in clinical textProc Work Extra-Propositional Asp Mean Comput Semant NAACL201541629.Kang T, Zhang S, Xu N, Wen D, Zhang X, Lei J. Detecting negation and scope in Chinese clinical notes using character and word embedding. Comput Methods Programs Biomed 2017;140:53–9.KangTZhangSXuNWenDZhangXLeiJDetecting negation and scope in Chinese clinical notes using character and word embeddingComput Methods Programs Biomed201714053930.Goryachev S, Sordo M, Zeng QT, Ngo L. Implementation and evaluation of four different methods of negation detection. Boston, MA: DSG, 2006.GoryachevSSordoMZengQTNgoLImplementation and evaluation of four different methods of negation detectionBoston, MADSG200631.Tanushi H, Dalianis H, Duneld M, Kvist M, Skeppstedt M, Velupillai S. Negation scope delimitation in clinical text using three approaches: NegEx, PyConTextNLP and SynNeg. In: Proc 19th Nord Conf Comput Linguist (NoDaLiDa 2013) 2013;1: 387–97.TanushiHDalianisHDuneldMKvistMSkeppstedtMVelupillaiSNegation scope delimitation in clinical text using three approaches: NegEx, PyConTextNLP and SynNegProc 19th Nord Conf Comput Linguist (NoDaLiDa 2013)201313879732.Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, et al. Negation’s not solved: generalizability versus optimizability in clinical natural language processing. PLoS One 2014;9:e112774.WuSMillerTMasanzJCoarrMHalgrimSCarrellDNegation’s not solved: generalizability versus optimizability in clinical natural language processingPLoS One20149e11277433.Uzuner O, South BR, Shen S, DuVall SL, Uzuner Ö, South BR, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18:552–6.UzunerOSouthBRShenSDuVallSLUzunerÖSouthBR2010 i2b2/VA challenge on concepts, assertions, and relations in clinical textJ Am Med Inform Assoc201118552634.Mowery DL, Chapman BE, Conway M, South BR, Madden E, Keyhani S, et al. Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis. J Biomed Semantics 2016;7:26.MoweryDLChapmanBEConwayMSouthBRMaddenEKeyhaniSExtracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysisJ Biomed Semantics201672635.Chapman BE, Mowery DL, Narasimhan E, Patel N, Chapman WW, Heilbrun ME. Assessing the feasibility of an automated suggestion system for communicating critical findings from chest radiology reports to referring physicians. In: Proc 15th Work Biomed Nat Lang Process 2016:181–5.ChapmanBEMoweryDLNarasimhanEPatelNChapmanWWHeilbrunMEAssessing the feasibility of an automated suggestion system for communicating critical findings from chest radiology reports to referring physiciansProc 15th Work Biomed Nat Lang Process2016181536.Bruha I, Famili A. Postprocessing in machine learning and data mining. ACM SIGKDD Explor Newslett 2000;2:110–4.BruhaIFamiliAPostprocessing in machine learning and data miningACM SIGKDD Explor Newslett200021104 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Evaluation of lexicon- and syntax-based negation detection algorithms using clinical text data

Bio-Algorithms and Med-Systems , Volume 13 (4): 13 – Dec 20, 2017

Loading next page...
 
/lp/de-gruyter/evaluation-of-lexicon-and-syntax-based-negation-detection-algorithms-Kv5WoP0YC7

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
de Gruyter
Copyright
©2017 Walter de Gruyter GmbH, Berlin/Boston
ISSN
1896-530X
eISSN
1896-530X
DOI
10.1515/bams-2017-0016
Publisher site
See Article on Publisher Site

Abstract

IntroductionText is an important data type in any electronic health record (EHR) system and can be used in various files such as patient medical history, discharge summaries, radiology reports and laboratory test results. Files of this nature are unstructured text data that are difficult for humans to manually analyze. Therefore, the natural language processing (NLP) approach was designed based on the principle of analyzing the linguistic concepts of each file using various methods, which are classified into two levels, low and high [1]. Low-level tasks include tokenization, sentence splitting, part-of-speech tagging (noun, verb, adjective etc.), morphological analysis, syntactic parsing, coreference resolution and so on. High-level tasks include spelling/grammatical error identifications, named entity recognition (NER), word sense disambiguation, negation and uncertainty identification, relationship extraction, temporal inferences/relationship extraction and information extraction. Chapman et al. found that nearly half of the clinical concepts found in text files were negated, such as “patient denies chest pain”. Because of this, negation detection (ND) is one of the significant strategies used to denote whether a given concept is absent or present. NegEx is a simple, regular-expression-based ND algorithm. Today, the ND approach has been applied in a wide range of applications, including information retrieval systems [2], [3], cTAKES [4], GATE (general architecture for text engineering) [5], MITRE system [6] etc.In 2015, Ou and Patrick [7] classified the ND approach into three subtasks: lexicon, syntax and machine learning. First, lexicon-based algorithms follow a rule-based approach, relying on trigger terms and termination clues. Second, syntax-based algorithms also follow a rule-based approach, in which the number of rules is presented and the results are based on the dependency from the Stanford Dependency Parser (SDP) output. Third, machine learning-based algorithms mostly depend on using one of the classifier methods as a support vector machine. Although many studies of negation have been based on machine learning [8], [9], [10], their sources are not free to access. Therefore, the aim of this study is to investigate only those negation methods based on lexicon and syntax. Annotated data are very important in this research to measure the accuracy of the tested ND systems. In the medical field, very little annotated data are available due to patient privacy and confidentiality requirements. In 2010, i2b2 (Informatics for Integrating Biology and the Bedside) and VA Salt Lake City Health Care System manually annotated the patient reports from three institutions, which was accessed by signing a data use agreement. In this paper, the 2010 i2b2/VA data set is used as input data to evaluate the selected ND methods (further described in the Data Sources section).In addition, the performance of the five chosen ND algorithms can be analyzed using various statistical methods. To implement this work, a simple desktop application is first developed for the preprocessing tasks, which converts the unstructured data into structured data. Second, we modify each ND algorithm to accept specified input data. Third, the selected ND algorithms will be executed and evaluated on the 2010 i2b2 Clinical NLP Challenge data. Later, the post-processing step is conducted manually. To our knowledge, no previous research has analyzed the problem of these two ND approaches, especially lexicon- and syntax-based approaches.Review of literatureIn NLP, the ND method has been widely applied in biomedical research, particularly in unstructured text data [11], [12], [13], [14]. According to Eleatic philosophers, the early Buddhists in India had introduced the first explorations of negative concepts in ontology [15]. The Greek philosopher and scientist Aristotle, in his theory of negation, is rooted it in shifting the issue from the domain of ontology to logic and language. Based on the work of Ou and Patrick [7] in 2015, the ND algorithms are classified into three subtasks: lexicon, syntax and machine learning. Although the majority of papers on the subject have been based on the machine learning approach [8], [9], [10], these resources are not available for public access, and therefore, the machine learning-based approach is not considered in this work.Lexicon-based methodsRule-based techniques are one of the earliest, and still widely used, approaches among the clinical NLP challenges. To deal with these challenges, a different lexicon-based ND algorithm could be discovered similar to NegEx [16], NegExpander [17], NegFinder [18] and NegHunter [19]. NegEx is the first open-source algorithm for the regular expression-based approach. Recently, this algorithm has been successfully extended into applications such as ConText [20], pyConTextNLP [21], DEEPEN [22], cTAKES [4] and others. Chapman et al. tested the NegEx algorithm using patient discharge summaries as input data and found that the algorithm sensitivity was 78% and the specificity was 95%. The ConText algorithm was evaluated in the development sets of 120 reports (F-measure 97% for six-token window and 98% for end-of-sentence) and the test sets of 120 reports (F-measure 86%). Chapman et al. developed peFinder, a new framework that uses the module pyConTextNLP. This was designed for the classification of CT pulmonary angiography reports.Syntax-based methodsThe syntactic parser is a language processing method, different from lexicon-based ND methods. The challenge in lexicon-based ND methods is clearly described in this example, “there is no evidence of cervical lymph node enlargement”. Here, “no” is the negation signal used to detect that the concept “cervical lymph node enlargement” was negated. However, the test results of “cervical lymph node” were not considered negated based on discussions with physicians. Therefore, Hang and Lowe [23] intended to use a grammatical parser on the regular expression-based negation algorithm. In a biological database, the negative information about protein-protein interactions had been carried out using a full-dependency parser and semantic relations [13]. Zhu et al. [24] proposed a semantic parsing approach to BioScope corpus. A simple algorithm called dependency parser-based negation (DepNeg) showed that their F-score of 83% is slightly higher than the cTAKES negation module (82%) [25]. Mehrabi et al. [22] have developed a negation algorithm called DEEPEN using a transition-based dependency parser. They have tested the proposed algorithm on two different data sets, and the results showed that precision and recall were approximately 89–97% and 74–96%, respectively. In 2016, Gkotsis et al. [26] introduced a stand-alone negation tool using the probabilistic context-free grammar (PCFG), in which 6000 annotated sentences of mental health records were inputted in the training phase. Their results showed precision of 89% and recall of 95%. Over a decade, ND techniques were combined with other applications such as GATE [5], sentiment analysis [27], greedy algorithms [11], kernel methods [28] etc. In 2017, Kang et al. [29] reported that character and word embedding are more effective in identifying negations and tested this on Chinese clinical notes, which resulted in an F-score of 99%.Evaluation studies of ND methodsSince the comparison of negation approaches have been implemented, few studies have been reported. Four different ND methods were evaluated by Goryachev et al. [30]. In a 2013 report by Tanushi et al. [31], three different ND algorithms (NegEx, PyConTextNLP and SynNeg) had been analyzed among Swedish clinical texts. It was concluded that the syntax-based approach is best when dealing with longer/complex or shorter sentences. At last, the comparisons given in [32] are comprehensive, which point out that the ND method is still not generalized.Materials and methodsThis work describes two different types of negation algorithms, lexicon-based (or regular expression-based) and syntax-based. This section includes the main part of the work such as the database and statistical methods used and the five ND algorithms tested.Data sourcesThe human annotated clinical data set “2010 i2b2/VA NLP challenge” was developed by the i2b2 and Veterans Affairs (VA) Salt Lake City Health Care System [33]. This data set includes patient discharge summaries and progress notes, which are the input data used for this research. There are 73 discharge summaries collected from Beth Israel Deaconess Medical Center, 97 from Partners HealthCare and 256 progress notes from University of Pittsburgh Medical Center. In the 2010 i2b2/VA workshop, clinical NLP challenges are focused on three tasks: (1) extracting medical concepts and classifying them into problems, tests and treatments, (2) annotating these assertions and (3) finding the relation between medical problems, tests and treatments. In the end, there are 19,665 medical concepts labeled “problem”, 13,833 labeled “test” and 14,188 labeled “treatment”.In addition, there are four types of files used. One is the report file, which refers to individual patient data. Other files are concepts, assertion and relation, which shows the manually annotated data (see Table A1 or Table 2). The assertion annotation should be presented on the concepts grouped by problems. The assertion results will be divided into five different categories: present, absent, possible, conditional and hypothetical. Of these, we consider only two assertions: “absent” and “present”. In this study, the preprocessing step is to extract concepts, sentences and their assertion results (see sample preprocessed data in Table A2). After preprocessing, Table 1 is created using “absent” and “present” assertions from the 2010 i2b2/VA challenge data set.Table 1:Data set description.Data set descriptionsAbsentPresentTotal number of sentences34841958Average number of sentences20.4911.51Total number of words50,18351,816Average number of words per report295304Average number of words per sentence14.4026.46Lexicon-based ND algorithmsNegExThis has a simple regular expression-based ND algorithm and was introduced by Chapman et al. in 2001 [16]. This algorithm includes 281 negation phrases that are classified as conjunctions (joins two phrases together), pseudo-trigger terms, pre-negative trigger terms, post-negative trigger terms and pre- and post-negative trigger terms. The source code for NegEx was freely distributed by the author. The older version of NegEx did not use the possible conjunctions phrases, but the latest version, Apache 2.0, is upgraded with the negation phrase conjunctions.ConTextThis algorithm is extended from the NegEx algorithm by adding two subjects: temporality (recent, historical and hypothetical) and experiencer (patient and other) [20]. This is a lightweight desktop application. It can be used by a non-programmer and applied to different types of clinical reports (see Appendix Figure A1). In addition, this application can perform an information retrieval task by searching the trigger terms and finding whether it is preceded or followed by the indexed clinical findings. In this work, there are 143 negated, 10 historical, 11 hypothetical and 26 other trigger terms. There are also pseudo-triggers, including 17 for negated (e.g. “no increase”, “not cause”), 17 for historical (e.g. “social history”, “poor history”), 4 for hypothetical (e.g. “if negative”, “know if”) and 18 for other (e.g. “by her husband”, “by his brother”). Finally, the 12 termination terms include presentation, patient, because, diagnosis, ED, etiology, recent, remain, consistent, which, and, and but. The default value of the negation property is affirmed, default temporality is recent, and the default experiencer is the patient. The two main steps in the ConText algorithm are defined as follows:Step 1: Mark up all trigger terms, pseudo-trigger terms, and termination terms in the sentence.Step 2: Iterate through the trigger terms from left to right,If the trigger term is a pseudo-trigger term, skip to the next trigger term.Otherwise, it determines the scope of the trigger term and finds out whether the given concepts are present within the scope of the trigger term. If they are present, then the output is negated, if it they are not present, the output is affirmed.Although the ConText algorithm has two additionally proposed contextual properties such as temporality and experiencer, these were not considered, as this study only focused on negation property. This algorithm was written in the Java language and is hosted at the following web address: http://code.google.com/p/negex/.pyConTextNLPThe source code of the ConText algorithm was converted from Java to Python and called pyConText. To avoid name conflicts within the Python package index, pyConText was renamed pyConTextNLP. In this package, a number of items are used, where each item contains the following elements: literal, category, regular expression and rule. Literal is a lexical phrase, for example, “can rule out”, “cannot be excluded” etc. The category is what the item refers to, for example, a finding or an uncertainty term. The categories from the previous implementations of ConText included finding, negated existence term, conjunction, pseudo-negated existence term and temporality term. In pyConText, categories are user-defined, and there are no limits to the number of categories that can be used. Regular expression is optional, which captures the literal tags in the text. If a regular expression is not provided in the definition of the item, the literal category is used to directly generate the regular expression. The rule is also optional, which directs how to generate tagObjects from the item using the states forward, backward and bidirectional. The regular expression is used to identify any literal phrases in the sentence. When a literal phrase is identified, a tagObject is created. More details on this tagObject can be found in [21]. Today, several studies have focused on this algorithm for developing a web-based tool [3], extracting information and classifying the negative cases from the radiology reports [34], [35]. For doing further work, the authors freely distributed this package through the following web address: https://pypi.python.org/pypi/pyConTextNLP.Syntax-based ND algorithmsDEEPENThis stands for DEpEndency ParsEr Negation. This algorithm was proposed by Mehrabi et al. [22]. The SDP is incorporated into this algorithm, which annotates the concepts that are considered negated by the NegEx algorithm [22]. Note that when a concept is affirmed by the NegEx algorithm, a concept is not negated by the DEEPEN algorithm. The SDP parser uses the transition-based dependency relation between negation words and concepts. SDP encompasses 53 grammatical relations, for instance determiner, infinitival modifier etc. The relations are found for each word. In addition, SDP has a set of rules including conjunction and rule, preposition within rule, preposition with/in/within rule, nominal subject rule and suggest rule. The SDP output includes dependency relation, governor term and dependent term. Dependency relation is the grammatical relation between the dependent and the governor terms. The governor term refers to the word in the sentence that the dependency relation reports, and the dependent term is the word that is dependent of the governor term. For instance, in the sentence “Based on this, he required no operative intervention for his pseudocyst”, the output would be as follows: det(intervention-9, no-7). In this example, “det” is the dependency relation, “intervention” is the governor term, and “no” is the dependent term. The production chain concatenates sample sentences such as “No evidence of dilatation”. In every sentence, a production chain is generated that is composed of three levels of tokens: the governor of negation term (“evidence” in determiner; evidence-2, No-1), the dependents of first-level tokens (“of” in prepositional modifier; evidence-2, of-3) and dependents of second-level tokens (“dilatation” object of a proposition; of-3, dilatation-4). If a concept is found within a production chain, it is then confirmed that the concept is negation; otherwise, it is considered affirmation. Although the DEEPEN algorithm is used to decrease incorrect negations found when using NegEx, it cannot access the affirmed results as NegEx does and its performance would be lower on ungrammatical sentences. The DEEPEN source code was developed in Java and is free for access for research purposes at http://svn.code.sf.net/p/ohnlp/code/trunk/DEEPEN.Negation resolution (NR) algorithmThis is a new automated negation method integrating PCFG parsers, applied specifically to mental health records and used to determine whether a suicide risk is affirmed or negated [26]. A PCFG parser was used with the Stanford Core NLP toolkit, which generates a constituency parse tree for each sentence. Every sentence must be preprocessed before starting the NR algorithm. A syntactic representation (i.e. parse tree) and the token of the target concept (e.g. “suicide”) can also be carried out in the sentence structure. The basic principle of an NR algorithm is to reduce the problem of identifying negation scope in the given sentence. Gkotsis et al. [26] included 15 negation words to find the negation scope. Presently, we provide 128 additional negation cues and execute the ND tasks such as pruning, identification of subordinate clauses, identification of negation governing the target-node and NR. This algorithm was written in the Python programming language and is publicly available in the author’s repository (https://github.com/gkotsis/negation-detection). The difference between the DEEPEN and NR algorithms is that DEEPEN depends on NegEx and uses the dependency parser, whereas the NR approach is a fully standalone application based on the constituency parser. Moreover, the NR algorithm is now works only in a UNIX environment. Therefore, it does not currently work on Windows computers.CustomizationTable 2 shows an example of a patient report and its annotated files for concepts, assertions and relations. The present study is only interested in terms of annotated assertions that should be thoroughly investigated.Table 2:Example of i2b2-annotated files.File namesExamplesReportThe patient is stable without oxygen requirementThere was no evidence of acute rib factors on her chest X-rayThe patient has a history of neurogenic bladder, and therefore, she required Foley catheterizationConceptsc=“an oxygen requirement” 1:5 1:7 | | t=“problem”c=“her chest X-ray” 2:9 2:11 | | t=“test”c=“a Foley catheterization” 3:14 3:16 | | t= “treatment”Assertionsc=“an oxygen requirement” 1:5 1:7 | | t=“problem” | | a=“absent”Relationsc=“her chest X-ray” 2:9 2:11 | | r=“TeRP” | | c=“acute rib factors” 2:5 2:7c=“a foley catheterization” 3:14 3:16 | | r=“TrAP” | | c=“neurogenic bladder” 3:6 3:7The first problem is how to extract the concepts, sentences and assertions from two separate documents like patient reports and assertions. Manually doing this work is very difficult. To solve this problem, this study proposes a desktop application that can automatically extract the concepts along with their sentences and assertions. This output will be stored in a single tab-delimited file. The second problem is how to run the above ND algorithms, when the input is in a new format (tab-delimited file). For example, the NegEx algorithm is applied to a single input file, whereas ConText algorithm receives input from the text box (see Appendix Figure A1). Therefore, some modification is required to process each ND algorithm, which will be described later in this section.PreprocessingThis follows four steps (Figure 1). First, a new application is developed. The application can automatically read all files one by one within a folder containing all assertion files. The concepts and their sentences and assertions are carried out using the regular expression. In this case, concepts are extracted by the first double quotes, in which sentences are extracted by the line number, and assertions are extracted by the third double quotes. These outputs are written in a single tab-delimited file. Second, we manually filter the assertion status belonging to absent and present. Third, the manual assertions are removed. Fourth, we filter the sentences, which are included in selected negation words.Figure 1:Architecture of proposed evaluation of ND algorithms.Modification of selected ND algorithmsThe preprocessed data are of two attributes, clinical concepts and their sentences. Now, the input format should be changed to each ND algorithm. Table 3 shows a number of modifications being carried out on each algorithm.Table 3:Modifications of the five algorithms.AlgorithmsModificationsNegExNo changes are madeConTextFirst, the input format is corrected for preprocessed data. Second, the information retrieval process is stopped, which allows for a reduced time when finding negation scope within a sentence. Finally, we set the output to be stored in a comma separated file for further evaluationpyConTextNLPThis algorithm is updated by adding the following features:1. Handling the user-defined input data2. Target concepts are loaded into the item data3. To find given concepts that may present within the negation scope4. Writing the output in a comma separated fileDEEPENThis algorithm is improved by the following tasks:1. Get data from file instead of from pre-defined sentence2. The algorithm reads the given input file that contains concepts and sentences3. A production chain is generated which is composed of three levels of tokens4. If the concept is found within the production chain, the algorithm confirms that the sentence is negated. Otherwise, the sentence is affirmed5. Write the output in a comma separated fileNRThis algorithm is improved by the following tasks:1. Get data from file instead of from pre-defined sentence2. Previously, this algorithm was generated for the single concept of “suicide”. This algorithm is now modified to deal with more than one concept3. The final results of this algorithm are written in a comma-separated filePost-processingThis is fully a manual approach. After the execution of all ND algorithms, the user can perform some post-processing on the results. Data post-processing can be categorized into the following groups: knowledge filtering, interpretation and explanation, evaluation and knowledge integration [36]. In data post-processing, the evaluation results are taken by the statistical analysis, to be explained in the next section. This method is also useful in finding missing data, for example, if the algorithm generates null results. Another possibility is to view the noisy data, for example, if the algorithm generates unexpected results.WorkflowThe golden rule of workflow is “first organize, then computerize”. Hence, we first design our work in a more abstract way, without considering implementation, which is illustrated in Figure 1. The order of work can be described as follows: (i) preprocessing and setting the data in a common format, (ii) executing all ND algorithms, (iii) applying the post-processing tasks and (iv) evaluating individual system performance using statistical analysis. The first three steps are described in the Materials and Methods section. The statistical analysis is present in the next subsection.Statistical analysisThe statistical method determines the differences between the system output and the gold standard database. The results of the above-mentioned algorithms have to be proved by statistics. Individual system performance will be evaluated using statistical methods such as accuracy, precision, recall, F-measure, kappa and area under the curve (AUC). The confusion matrix is a 2∗2 table (see Table A3) that indicates both the predicted (column) and actual results (row). The results are classified in every ND algorithm as true positives (TP: the patients do not have the disease), true negatives (TN: the patients have the disease), false positives (FP: they actually they have the disease) and false negatives (FN: they actually do not have the disease). In this study, statistical significance was analyzed by the R software using two packages, caret and pROC.This study does not require ethical approval because the data set does not involve any vulnerable participants. Additionally, this data set has been collected freely from a registered online source by signing the data use agreement: https://www.i2b2.org/NLP/DataSets/Main.php.ResultsTypically, most negative clinical statements use six negation phrases: cannot, free, no, not, resolved and without. The same negation phrases are used in this research. Before preprocessing, the input data contained a total of 3484 absent sentences, in which 3048 sentences included the selected negation phrase. Now we assume that 3048 were responders and 436 were not. Likewise, of 1958 present sentences, 1470 sentences included the selected negation words. In this case, 1470 sentences were responders and 488 were not. All ND algorithms were executed using these two responder’s sentences and the results presented in Table 4.Table 4:Results of the five algorithms.AlgorithmsAbsentPresentAbsentPresentNoneAbsentPresentNoneNegEx13031745011213490ConText2507541026512050pyConTextNLP289115705349360DEEPEN2375673042010500NR2325143580110640720It is interesting to see that each individual system performed in the order of true-absent, false-absent, none-absent, true-present, false-present and none-present. We note from Table 4 that if the algorithm does not have any results for a particular input, the result is considered to be a “none” value. Here, none results are only produced by the NR algorithm when the target concept may be more than one word. Without considering these null results, the percentages of the NR algorithm were 94% of absent and 85% of present. Thus, the result is better than ConText (82% and 81%, respectively) and pyConTextNLP (94% and 63%, respectively). However, the performance of NR algorithm fails to work if a target concept contains more than one word.The pyConTextNLP algorithm generated results similar to the NR algorithm. Unexpectedly, the algorithm produced multiple results when the target concept was written more than once within a single sentence. Meanwhile, the ConText algorithm did not have a significant result due to the presence of some false-absent and false-present results. The percentage of the DEEPEN algorithm was 77% true absent and 71% true present. Finally, the test showed inadequate results from the NegEx algorithm. The results so far have been graphically displayed in Figure 2. The negation phrase “no” has occurred 2161 times in absent and 727 times in present, which is one of the greatest occurrences compared to other phrases. Table 5 shows the algorithm performance on each of the six negation phrases. Traditionally, negation algorithms have been tested on absent sentences. This work is alternatively executed within both absent and present sentences. The predicted results of each algorithm are labeled absent cases “A” and present cases “P”. Manual results are also indicated within the parentheses.Figure 2:Performance of algorithms.Table 5:Results of the ND algorithms between topmost negation-words.Trigger termsManualNegExConTextpyConTextDEEPENNRAPAPAPAPAPCannotA (13)310941129473P (30)255030228121813FreeA (47)2423291828192918176P (54)4862521242272723NoA (2161)9571204201814320629916914701550202P (727)55717022350425946814857931383NotA (235)1221139314223051795623256P (344)259852531914420010124327223ResolvedA (64)412316486222044177P (58)421645448101345511WithoutA (528)156372342186498304478132843P (257)1906711246691881191382041The highest average score for absent was 95% in pyConTextNLP, and the highest score for present was 80% in the ConText algorithm. However, it should be noted that a negation phrase can exactly match another phrase, such as “resolved completely” and “complete resolution of”.EvaluationAs mentioned in the above section, it is important to assess the quality of the five ND algorithms. First, we verified the algorithm results by the gold standard database, and we classified the results into TP, FP, TN and FN. Second, the evaluations were performed in terms of accuracy, precision, recall, F-measure, Cohen’s kappa and AUC. Table 6 summarizes the evaluation results. Note the total number of sentences (S), which included both present and absent sentences. The precision (P), recall (R) and F-measure (F) have separately calculated for absent (suffixes “_0”) and present (suffixes “_1”). In addition, Cohen’s kappa and AUC were calculated without dividing the results into two groups (present and absent). Additionally, the receiver operating characteristic (ROC) curve is a chart in which the TP rate is plotted on the y-axis, and the FP rate is plotted on the x-axis. The ROC curve is often used to find the relationship between sensitivity and specificity. The high value of F-score was (in bold) found in NR algorithm.Table 6:Evaluation results of the five algorithms.AlgorithmsSTPFPTNFNAP_0R_0F_0P_1R_1F_1KAUCNegEx451834911211303174536425347231619−367ConText45181205265250754182829868169746182pyConText45189365342891157849484896385736179DEEPEN45181050420237567375778481716654775NR3218664862151317878796918867767888Figure 3 shows the ROC curves, which were derived from our experimental results. It is interesting to note that the sensitivity of the NR algorithm is higher than that of the other four ND algorithms. The sensitivity of the DEEPEN, pyConTextNLP and ConText algorithms were slightly lower than expected, and there is certainly room for improvement.Figure 3:Results of ROC.Computational timeComputational time is an important metric to find the length of time that is required to process a computer program. Thus, we have calculated this time by milliseconds. Table 7 illustrates the total execution time for each methodology chosen for our research. The acronym PT stands for preprocessing time and PTND stands for the process time of ND algorithms. Note that the average time of a sentence has been taken in which execution time of single output (i.e. single sentence) was derived from the overall outputs (i.e. total number of sentences). It is easy to understand that the overall execution time is divided by the total number of sentences. The regular time consumption for preprocessing is 22 ms in absent sentences and in 46 ms in present sentences. The total turnaround time is calculated by the following formula:Table 7:Run time of the five algorithms (selected negation-words).AlgorithmsResultsPT (ms)PTND (ms)Total time (ms)Average timeNegExAbsent67,93542,245110,18036ConText67,935121,477189,41262pyConTextNLP67,93569168,62622DEEPEN67,93524,384,00024,451,9358022NR67,93549,360117,29548NegExPresent67,93520,37488,30960ConText67,93558,579126,51486pyConTextNLP67,93533068,26545DEEPEN67,93511,760,00011,827,9358046NR67,93515,00082,935111(1)Total time=PT+PTND.$${\text{Total time}} = {\text{PT}} + {\text{PTND}}{\text{.}}$$Later, the average time per sentence should be calculated through the following equation:(2)Average time=Total time/TNS.$${\text{Average time}} = {\text{Total time/TNS}}{\text{.}}$$The basic system requirement for this experimental work was based on the first four algorithms (NegEx, ConText, pyConTextNLP, DEEPEN), which fully depended on the Windows 32-bit operating system, and the last NR algorithm, which was executed on the Ubuntu 32-bit operating system. In Windows, the software applications used were NetBeans IDE 8.1 (NegEx, ConText, DEEPEN) and Python 2.7.11 (pyConTextNLP). Figure 4 shows the execution time of the ND algorithms for both absent and present assertions. According to this time calculation, the highest time consumption is shown in the DEEPEN algorithm.Figure 4:Execution time of ND algorithms.The time consumption of the NegEx algorithm is better than that of the ConText, DEEPEN and NR algorithms. Finally, this time analysis confirms that the pyConTextNLP is still superior to overall approaches. In syntax-based ND algorithms, the NR algorithm is acceptable in terms of both interpretability and prediction accuracy.DiscussionThe characteristics of lexicon- and syntax-based negation methods are analyzed in this study. Recently, various approaches have been proposed to solve the issues included with negation approaches. Although many studies have been published on the evaluation of different ND methods [7], [30], [32], most of them were not evaluated for the syntax-based negation algorithms, especially in regard to DEEPEN and NR. In this paper, these two methodologies are discussed on a large-scale. In our experimental results, pyConTextNLP and NR algorithms are better than others, as proven by their statistical and computational time scores. The best performance from the lexicon-based ND approaches is pyConTextNLP, which has two advantages: (1) the identification of negation scope is simple with the new forward and backwards options and (2) the execution time is much less than other similar approaches. The NR algorithm generated the parser outputs in less time. This is the main reason why the NR algorithm performed well in the syntax-based ND approach.A similar study was initially conducted by Goryachev et al. [30]. Based on their results, they suggested that the lexicon-based approach is the best; however, the lexicon-based approach is not 100% accurate in this study. Ou and Patrick [7] also compared the performance of the lexicon-, syntax- and machine learning-based approaches. They reported that the accuracy (92%) and κ (79%) of the lexicon method NegEx was higher than those of the machine learning-based methods. This is higher than our results in terms of accuracy (36%) and κ (−3%). In fact, we have executed the NegEx algorithm in more than 100 clinical notes, whereas they used only 100 clinical notes. According to Wu et al. [32], they had found that the F-measure of NegEx was 82% among 33,022 sentences in the 2010 i2b2/VA challenge data. Determining the assertion task in the same i2b2 test set, our F-measure results were measured as slightly lower than those of the machine learning-based approaches [8], [9], [10], [32]. However, our research could be a useful aid for further investigations into different negation methods. To the best of our knowledge, this is the first evaluation of the methodology chosen for our research.Error analysisTable 8 shows that there are errors between the two types of negation methods using the same sentences, as reported in the article [22]. Here, the two assertion results provided include P for present and N for absent. It is likely that most errors in the syntax-based approaches are based on poor parsing results. For example, if the parse trees contained errors, the algorithm did not work as expected. Therefore, more training for the parser is necessary, i.e. a domain-specific parser. In Table 6, the percentage of errors in the absent claims is as follows: NegEx 38%, ConText 11%, pyConTextNLP 3%, DEEPEN 14% and NR 9%. In the same order, the percentage of errors in present is 24%, 5%, 11%, 9% and 2%. The errors are explained with two types of structured sentences, simple and complex, and examples of the errors in simple sentences follows:Table 8:Summary of errors.SentencesGold standardNegExConTextpyConTextNLPDEEPENNRIf her pain should not have been resolved by that time, there is the possibility of repeating facet rhizotomyPPPPPPHowever, I suspect that her pain is not due to an underlying neurologic disorderPPPPPNShe denies any ear pain, sore throat, odynophagia, hemoptysis, shortness-of-breath, dyspnea on exertion, chest discomfort, anorexia, nausea, weight-loss, mass, adenopathy or painNPPNPNMolecular fragile-X results reveal no apparent PMR-1 gene abnormalityNPNNPNMrs. Jane Doe returns with no complaints worrisome for recurrent or metastatic oropharynx cancerNNNNP−She is not having any incontinence or suggestion of infection at this timeNPNNNNShe denies any blood in the stoolNNPNP−No feverNPNNNPNo history of diabetesNPNNNNNo pneumonia was suspectedNPPPN−History inconsistent with strokeNPNPPNHis dyspnea resolvedNNPPP−Elevated enzymes resolvedNNPNP−“Compared with previous CT dated 02-08-2010, there are no significant changes in the size of the lymph nodes”“1. No residual metabolically active lymph nodes in the neck and thorax”In the above first sentence, the contextual meaning is generally positive (i.e. no changes in the size of the lymph nodes), but which has been incorrectly identified as negated in NegEx, ConText and DEEPEN.Note that these three ND algorithms have made wrong decisions based on the pre-defined negation phrase “no significant”. In the second sentence, there was a typographical error of no space between numeric list “1” and the negation cue “no”. However, this is actually the negative statement, and there is no one best result. Seeing that, it is clear that typographical errors are also one of the challenges in clinical NLP research. Meanwhile, examples of errors in complex sentences are given below:“Highly FDG avid non necrotic non calcified lower cervical, mediastinal and periportal lymphadenopathy. There is no evidence of any primary lesion. Possibilities are lymphoma/sarcoidosis/tuberculosis. Biopsy of mediastinal lymph nodes suggested for further evaluation”“There is complete resolution of the cervical and supraclavicular lymph nodes, significant resolution of the mediastinal and left axillary lymphadenopathy and left upper lobe lung lesion”The first sentence above involves more than one clinical finding, and multiple sentences are included, which did not split in the preprocessing steps. Therefore, the negation results of the five ND algorithms are as follows: NegEx (absent), ConText (present), pyConTextNLP (present), DEEPEN (present) and NR (none). The second sentence involves the negation phrase “resolution”. This negation phrase was not added in any of the five ND algorithms. Therefore, the output of the five algorithms could not be correct for the second sentence. Hence, complex sentences should be avoided in the clinical reports. Otherwise, the ND algorithms should be improved with regard to complex sentences.LimitationsSome limitations are present in this work. First, this study has only focused on selected negation cues in two assertions. Second, the input database is small and the large body of work may have helped to verify this conclusion. Third, the preprocessing steps of this work are not generalizable. In fact, the preprocessing application was limited for another domain.ConclusionIn this paper, we evaluated five negation detection algorithms in terms of accuracy and computation time. The algorithms include NegEX, ConText, pyConTextNLP, DEEPEN and NR. The first three are lexicon-based and the last two are syntax-based. We developed a new application for preprocessing, which can automatically read all files one by one, in which the folder contains the collection of assertion files. We modified the five ND algorithms when the input is in a new format. The experimental results show that pyContextNLP and NR have better results than the others. However, despite their good performance in simple sentences, the results were not as successful in complex sentences. This is an important issue for future research. However, we hope that our research will be helpful in solving the difficulty of ND algorithms and encourage the validation of larger sample sizes. The ultimate aim of this evaluation addresses how we will design and develop a new negation algorithm in the future.AcknowledgmentsWe thank S. Shahul Hameed and Gokulalakshmi Elayaperumal from Sree Balaji Medical College and Hospital, Chennai, India, George Gkotsis from IoPPN, King’s College London, and Saeed Mehrabi from the School of Informatics and Computing, Indiana University, Indianapolis, IN, USA, who offered their continuous support in the implementation of this task.Author contributions: The authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.Research funding: None declared.Employment or leadership: None declared.Honorarium: None declared.Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.Appendix ATable A1:Example of single annotated i2b2 report.Sentences (report file)Chest CT scan was negative for pulmonary embolism but positive for consolidationAssertion annotationc=“pulmonary embolism” 21:6 21:7 | | t= “problem” | | a=“absent”(Annotated file)c=“consolidation” 21:11 21:11 | | t=“problem” | | a=“present”Table A2:A sample input data.ConceptsSentencesMassNo mass or vegetation is seen on the mitral valvePericardial effusionThere is no pericardial effusionEpileptiform featuresNo epileptiform features were seenInfectionCXR, LP, UA and abdominal CT showed no sign of infectionOrthostaticShe was not orthostaticA headacheHe did not complain about a headacheTable A3:Evaluation of the algorithm’s output using 2*2 table.Predicted outputTrue (negated)False (affirmed)Manual annotations True (negated)True positive (TP)False negative (FN) False (affirmed)False positive (FP)True negative (TN)Figure A1:A simple surface-based approach of ConText algorithm.References1.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc 2012;18:544–51.NadkarniPMOhno-MachadoLChapmanWWNatural language processing: an introductionJ Am Med Inform Assoc201218544512.Koopman B, Bruza P, Sitbon L, Lawley M. Analysis of the effect of negation on information retrieval of medical data. In: Proc 15th Australas Doc Comput Symp 2010:89–92.KoopmanBBruzaPSitbonLLawleyMAnalysis of the effect of negation on information retrieval of medical dataProc 15th Australas Doc Comput Symp201089923.Scuba W, Tharp M, Mowery D, Tseytlin E, Liu Y, Drews FA, et al. Knowledge author: facilitating user-driven, domain content development to support clinical information extraction. J Biomed Semant 2016;7:42.ScubaWTharpMMoweryDTseytlinELiuYDrewsFAKnowledge author: facilitating user-driven, domain content development to support clinical information extractionJ Biomed Semant20167424.Garla V, Re V Lo, Dorey-Stein Z, Kidwai F, Scotch M, Womack J, et al. The Yale cTAKES extensions for document classification: architecture and application. J Am Med Inform Assoc 2011;18:614–20.GarlaVReV LoDorey-SteinZKidwaiFScotchMWomackJThe Yale cTAKES extensions for document classification: architecture and applicationJ Am Med Inform Assoc201118614205.Mitchell KJ, Becich MJ, Berman JJ, Chapman WW, Gilbertson J, Gupta D, et al. Implementation and evaluation of a negation tagger in a pipeline-based system for information extraction from pathology reports. Stud Health Technol Inform 2004;107:663–7.MitchellKJBecichMJBermanJJChapmanWWGilbertsonJGuptaDImplementation and evaluation of a negation tagger in a pipeline-based system for information extraction from pathology reportsStud Health Technol Inform200410766376.Clark C, Aberdeen J, Coarr M, Tresner-kirsch D, Wellner B, Yeh A, et al. Determining assertion status for medical problems in clinical records. McLean, VA: Mitre Corporation, 2011:2–6.ClarkCAberdeenJCoarrMTresner-kirschDWellnerBYehADetermining assertion status for medical problems in clinical recordsMcLean, VAMitre Corporation2011267.Ou Y, Patrick J. Automatic negation detection in narrative pathology reports. Artif Intell Med 2015;64:41–50.OuYPatrickJAutomatic negation detection in narrative pathology reportsArtif Intell Med20156441508.Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inf Assoc 2011;18:601–6.JiangMChenYLiuMRosenbloomSTManiSDennyJCA study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summariesJ Am Med Inf Assoc20111860169.Clark C, Aberdeen J, Coarr M, Tresner-Kirsch D, Wellner B, Yeh A, et al. MITRE system for clinical assertion status classification. J Am Med Inform Assoc 2011;18:563–7.ClarkCAberdeenJCoarrMTresner-KirschDWellnerBYehAMITRE system for clinical assertion status classificationJ Am Med Inform Assoc201118563710.Minard A-L, Ligozat A-L, Ben Abacha A, Bernhard D, Cartoni B, Deléger L, et al. Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification. J Am Med Inform Assoc 2011;18:588–93.MinardA-LLigozatA-LBen AbachaABernhardDCartoniBDelégerLHybrid methods for improving information access in clinical documents: concept, assertion, and relation identificationJ Am Med Inform Assoc2011185889311.Ballesteros M, Francisco V, Díaz A, Herrera J, Gervás P. Inferring the scope of negation in biomedical documents. Lect Notes Comput Sci 2012;7181 LNCS:363–75.BallesterosMFranciscoVDíazAHerreraJGervásPInferring the scope of negation in biomedical documentsLect Notes Comput Sci20127181 LNCS3637512.Chapman WW, Dowling JN, Wagner MM. Fever detection from free-text clinical records for biosurveillance. J Biomed Inform 2004;37:120–7.ChapmanWWDowlingJNWagnerMMFever detection from free-text clinical records for biosurveillanceJ Biomed Inform200437120713.Sanchez-Graillet O, Poesio M. Negation of protein-protein interactions: analysis and extraction. Bioinformatics 2007;23:424–32.Sanchez-GrailletOPoesioMNegation of protein-protein interactions: analysis and extractionBioinformatics2007234243214.Morante R. Descriptive analysis of negation cues in biomedical texts. Statistics 2009;1429–36.MoranteRDescriptive analysis of negation cues in biomedical textsStatistics200914293615.Horn LR. Natural history of negation. J Pragmat 1989;16: 269–80.HornLRNatural history of negationJ Pragmat1989162698016.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001;34:301–10.ChapmanWWBridewellWHanburyPCooperGFBuchananBGA simple algorithm for identifying negated findings and diseases in discharge summariesJ Biomed Inform2001343011017.Aronow DB, Fangfang F, Croft WB. Ad hoc classification of radiology reports. J Am Med Inform Assoc 1999;6:393–411.AronowDBFangfangFCroftWBAd hoc classification of radiology reportsJ Am Med Inform Assoc1999639341118.Mutalik PG, Deshpande A, Nadkarni PM. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc 2001;8:598–609.MutalikPGDeshpandeANadkarniPMUse of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLSJ Am Med Inform Assoc2001859860919.Gindl S, Kaiser K, Miksch S. Syntactical negation detection in clinical practice guidelines. Stud Health Technol Inform 2008;136:187–92.GindlSKaiserKMikschSSyntactical negation detection in clinical practice guidelinesStud Health Technol Inform20081361879220.Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform 2009;42:839–51.HarkemaHDowlingJNThornbladeTChapmanWWConText: an algorithm for determining negation, experiencer, and temporal status from clinical reportsJ Biomed Inform2009428395121.Chapman BE, Lee S, Kang HP, Chapman WW. Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm. J Biomed Inform 2011;44:728–37.ChapmanBELeeSKangHPChapmanWWDocument-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithmJ Biomed Inform2011447283722.Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, et al. DEEPEN: a negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform 2015;54:213–9.MehrabiSKrishnanASohnSRochAMSchmidtHKestersonJDEEPEN: a negation detection system for clinical text incorporating dependency relation into NegExJ Biomed Inform201554213923.Huang Y, Lowe H. A novel hybrid approach to automated negation detection in clinical radiology reports. J Am Med Inform 2007;304–11.HuangYLoweHA novel hybrid approach to automated negation detection in clinical radiology reportsJ Am Med Inform20073041124.Zhu Q, Li J, Wang H. A unified framework for scope learning via simplified shallow semantic parsing. In: EMNLP 2010 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing 2010:714–24.ZhuQLiJWangHA unified framework for scope learning via simplified shallow semantic parsingEMNLP 2010 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing20107142425.Sohn S, Wu S, Chute CG. Dependency parser-based negation detection in clinical narratives. AMIA Jt Summits Transl Sci Proc AMIA Summit Transl Sci 2012;2012:1–8.SohnSWuSChuteCGDependency parser-based negation detection in clinical narrativesAMIA Jt Summits Transl Sci Proc AMIA Summit Transl Sci201220121826.Gkotsis G, Velupillai S, Oellrich A, Dean H, Liakata M, Dutta R. Don’t let notes be misunderstood: a negation detection method for assessing risk of suicide in mental health records. In: Proc 3rd Work Comput Linguist Clin Psychol Linguist Signal Clin Real 2016:95–105.GkotsisGVelupillaiSOellrichADeanHLiakataMDuttaRDon’t let notes be misunderstood: a negation detection method for assessing risk of suicide in mental health recordsProc 3rd Work Comput Linguist Clin Psychol Linguist Signal Clin Real20169510527.Lapponi E, Read J, Øvrelid L. Representing and resolving negation for sentiment analysis. In: Proc 12th IEEE Int Conf Data Min Work ICDMW 2012:687–92.LapponiEReadJØvrelidLRepresenting and resolving negation for sentiment analysisProc 12th IEEE Int Conf Data Min Work ICDMW20126879228.Shivade C, de Marneffe MC, Fosler-Lussier E, Lai AM. Extending NegEx with kernel methods for negation detection in clinical text. In: Proc Work Extra-Propositional Asp Mean Comput Semant NAACL 2015:41–6.ShivadeCde MarneffeMCFosler-LussierELaiAMExtending NegEx with kernel methods for negation detection in clinical textProc Work Extra-Propositional Asp Mean Comput Semant NAACL201541629.Kang T, Zhang S, Xu N, Wen D, Zhang X, Lei J. Detecting negation and scope in Chinese clinical notes using character and word embedding. Comput Methods Programs Biomed 2017;140:53–9.KangTZhangSXuNWenDZhangXLeiJDetecting negation and scope in Chinese clinical notes using character and word embeddingComput Methods Programs Biomed201714053930.Goryachev S, Sordo M, Zeng QT, Ngo L. Implementation and evaluation of four different methods of negation detection. Boston, MA: DSG, 2006.GoryachevSSordoMZengQTNgoLImplementation and evaluation of four different methods of negation detectionBoston, MADSG200631.Tanushi H, Dalianis H, Duneld M, Kvist M, Skeppstedt M, Velupillai S. Negation scope delimitation in clinical text using three approaches: NegEx, PyConTextNLP and SynNeg. In: Proc 19th Nord Conf Comput Linguist (NoDaLiDa 2013) 2013;1: 387–97.TanushiHDalianisHDuneldMKvistMSkeppstedtMVelupillaiSNegation scope delimitation in clinical text using three approaches: NegEx, PyConTextNLP and SynNegProc 19th Nord Conf Comput Linguist (NoDaLiDa 2013)201313879732.Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, et al. Negation’s not solved: generalizability versus optimizability in clinical natural language processing. PLoS One 2014;9:e112774.WuSMillerTMasanzJCoarrMHalgrimSCarrellDNegation’s not solved: generalizability versus optimizability in clinical natural language processingPLoS One20149e11277433.Uzuner O, South BR, Shen S, DuVall SL, Uzuner Ö, South BR, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18:552–6.UzunerOSouthBRShenSDuVallSLUzunerÖSouthBR2010 i2b2/VA challenge on concepts, assertions, and relations in clinical textJ Am Med Inform Assoc201118552634.Mowery DL, Chapman BE, Conway M, South BR, Madden E, Keyhani S, et al. Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis. J Biomed Semantics 2016;7:26.MoweryDLChapmanBEConwayMSouthBRMaddenEKeyhaniSExtracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysisJ Biomed Semantics201672635.Chapman BE, Mowery DL, Narasimhan E, Patel N, Chapman WW, Heilbrun ME. Assessing the feasibility of an automated suggestion system for communicating critical findings from chest radiology reports to referring physicians. In: Proc 15th Work Biomed Nat Lang Process 2016:181–5.ChapmanBEMoweryDLNarasimhanEPatelNChapmanWWHeilbrunMEAssessing the feasibility of an automated suggestion system for communicating critical findings from chest radiology reports to referring physiciansProc 15th Work Biomed Nat Lang Process2016181536.Bruha I, Famili A. Postprocessing in machine learning and data mining. ACM SIGKDD Explor Newslett 2000;2:110–4.BruhaIFamiliAPostprocessing in machine learning and data miningACM SIGKDD Explor Newslett200021104

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Dec 20, 2017

References