Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Lejun Gong; Zhifei Zhang; Shiqi Chen

doi:10.1155/2020/8829219

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Gong, Lejun;Zhang, Zhifei;Chen, Shiqi 2020-11-24 00:00:00 Hindawi Journal of Healthcare Engineering Volume 2020, Article ID 8829219, 8 pages https://doi.org/10.1155/2020/8829219 Research Article Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining 1,2 1 1 Lejun Gong , Zhifei Zhang, and Shiqi Chen Jiangsu Key Lab of Big Data Security & Intelligent Processing, School of Computer Science, Nanjing University of Posts and Telecommunications, Nanijing 210023, China Zhejiang Engineering Research Center of Intelligent Medicine, Wenzhou 325035, China Correspondence should be addressed to Lejun Gong; glj98226@163.com Received 14 August 2020; Revised 26 October 2020; Accepted 2 November 2020; Published 24 November 2020 Academic Editor: Jiafeng Yao Copyright © 2020 Lejun Gong et al. -is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Background. Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is diﬃcult to obtain. Methods. Aiming at these characteristics of Chinese electronic medical records, this study proposed a Chinese clinical entity recognition model based on deep learning pretraining. -e model used word embedding from domain corpus and ﬁne-tuning of entity recognition model pretrained by relevant corpus. -en BiLSTM and Transformer are, respectively, used as feature extractors to identify four types of clinical entities including diseases, symptoms, drugs, and operations from the text of Chinese electronic medical records. Results. 75.06% Macro-P, 76.40% Macro-R, and 75.72% Macro-F1 aiming at test dataset could be achieved. -ese experiments show that the Chinese clinical entity recognition model based on deep learning pretraining can eﬀectively improve the recognition eﬀect. Conclusions. -ese experiments show that the proposed Chinese clinical entity recognition model based on deep learning pretraining can eﬀectively improve the recognition performance. personnel provides based on examination results such as 1. Background disease diagnosis and treatment methods as medical re- In recent years, medical informatization has produced a sources constructed by professionals. As the core data of large number of electronic medical records. -e electronic medical information system, how to make use of the large medical record not only completely preserves the detailed amount of potential medical information contained in information of the patients’ diagnosis and treatment process electronic medical records has become one of the hot re- but also has the advantages of regular writing format, search directions. But electronic medical records are not convenient retrieval, and storage, and it can better help fully structured data. Semistructured or unstructured free telemedicine further. In addition, the rapid development of a text data make up the majority. In order to convert these large number of online consultation websites and cases unstructured data into structured form that can be recog- discussion forums will also produce a large number of nized by computer, it is necessary to use natural language disease question and answer information. -ese medical processing technology to conduct text mining. As the basic texts are in the same form as electronic medical records. task of text mining information extraction, the types of -ese data make up a very large amount of medical data clinical entities to be recognized mainly include disease, resources. symptoms, operations taken by medical personnel (in- Electronic medical records (EMR) record various cluding inspection operations and treatment operations), symptoms and examination measures taken by patients and drugs. Although the research on Chinese named entity from before admission to hospitalization and that medical recognition has been going on for some time, most of the 2 Journal of Healthcare Engineering the speciﬁc objects identiﬁed are diﬀerent, a large number researches focus on the open ﬁeld. However, some studies have shown that the density of entity distribution in Chinese of entity identiﬁcation methods in the general ﬁeld can still be applied to the biomedical ﬁeld. -ese include early EMRs is much higher than that in open ﬁeld texts. -e proportion of entity characters in the corpus of Chinese approaches based on the combination of dictionaries and EMRs is nearly twice that of the general Chinese corpus, rules and approaches based on machine learning. which indicates that Chinese EMRs are a kind of knowledge- -e method based on the combination of dictionary and intensive text [1], and the data has considerable research rule will match the dictionary and text ﬁrstly. -en it value. But this density also creates more obstacles to the combines with the formulated rules for postprocessing and study of clinical named entity recognition from EMRs in normalization. Its performance depends on the size of the Chinese. Since it is the entity recognition of Chinese elec- dictionary and the quality of the rules [4]. Due to the variety tronic medical record, this paper keeps the entity in Chinese of clinical entities and its strong professionalism, the con- struction of dictionaries and the formulation of rules need medical record as Chinese character format. -is task has just started. In addition, it still has the following diﬃculties amount manpower, which is not only time-consuming, but also not portabl e. -erefore, dictionaries and rules are often [2]. used as auxiliary means in the task of named entity (1) Clinical entities have various types and a large recognition. number, and there are new entities with unregistered In recent years, machine learning has been applied to words, such as unregistered disease, drug, and in- named entity recognition [5, 6], such as maximum entropy spection, which make it diﬃcult to build a com- (ME), conditional random ﬁeld (CRF), support vector prehensive clinical dictionary and obtain disease machine (SVM), and structural support vector machine dictionary, drug dictionary, or inspection dictionary. (SSVM). Multiple deep learning methods are also applied to (2) Clinical entities are divided into simple entities and entity recognition tasks. For example, Recurrent Neural complex entities with relatively complex structure. Networks (RNN) and Long Short-Term Memory Networks -e length of medical record entities in EMR is (LSTM). In addition, models combining deep learning with variable, and a large number of clinical entities are traditional machine learning have also been widely used. longer than common entities. -ere are a lot of Named entity recognition is often regarded as a se- nesting, alias, and acronym in clinical entities. quential annotation task. Traditional machine learning (3) In diﬀerent parts of EMR, the extension of clinical methods, such as CRF [7], can achieve good performance in sequence annotation tasks but rely heavily on manually entity is diﬀerent, and there is fuzzy classiﬁcation in category labeling. -e boundary between diﬀerent selected characteristics. By contrast, deep learning could named entities is not clear, and the names of clinical automatically learn features, but a large amount of training data is needed to achieve excellent recognition eﬀect [8, 9]. manifestations often appear in the names of diseases. -ere are a lot of mutual inclusion and crossover Related work of Chinese EMR is developing rapidly [10, 11]. Compared with English corpus, Chinese text has phenomenon. For example, “上呼吸道感染” is generally considered to be the disease, but, in some fuzzy word boundary and no obvious segmentation mark, so it is diﬃcult to study the entity recognition [12]. -e se- cases, it also appears as a symptom. lection of features in traditional machine learning will di- -e recognition of named entity in EMR has been rectly aﬀect the eﬀect of entity recognition, so most studies studied in foreign countries. Because EMR involves more focus on the construction and selection of diﬀerent features. professional knowledge, the cost of corpus construction Lei et al. [13] compared the combination of CRF, SSVM, is higher. -e informatics for Integrating Biology and the SVM, and ME with a variety of characteristics and recog- Bedside (I2B2) organized the multiple records related nized medical problems, examination, treatment, and drugs tasks and issued the relevant corpus and a number of in the admission records and discharge records. Wang et al. shared tasks since 2006 [3]. -e task of concept recog- [14] used the character position information and short nition and relation extraction evaluation of I2B2 2010 is clauses to reach the F1 value of 95.12% in the self-labeled the ﬁrst to systematically classify English electronic Chinese medicine text corpus. Literature [15] studies the medical record name entities. -is classiﬁcation refers to inﬂuence of multifeature combination such as linguistic the semantic type deﬁned by UMLS, which divides EMR symbol feature, part of speech feature, keyword feature, and entities into three categories, namely, medical problems dictionary feature on CRF sequence annotation. (including diseases and symptoms), treatment, and -ere are also relevant studies on Chinese clinical entity examination. recognition using the deep learning method [16–18], whose Named entity recognition, as the key task of text data model is basically the sequence model RNN and its variants. mining, has been the research foundation and hotspot of It is worth mentioning that Yang et al. [19] combined the natural language processing. Named entity recognition in characteristics of Chinese electronic language structure and, the general ﬁeld mainly includes names, places, organi- with the help of the guidance of professional medical per- zations, time expressions, and numerical expressions. In sonnel, combined EMR label speciﬁcation in English and the ﬁeld of biomedicine, most of the current research made a detailed EMR in Chinese named entity and entity focuses on the identiﬁcation of genes, proteins, cells, and relationship labeling regulations and completed the natural other entities in the English medical literature. Although language processing research in the ﬁeld of EMR in Chinese Journal of Healthcare Engineering 3 basic work. In addition, there are some identiﬁcation B-sym I-sym 0 0 methods that combine deep learning with supervised learning [20, 21]. However, as far as we know, BiLSTM and Transformer [22] combined methods have not been applied to clinical Chinese-named entity recognition. In view of the above problems, this study proposes a named entity recognition method for Chinese EMR based Feature extractors on pretraining. -e method is based on word embedding pretraining and ﬁne-tuning of entity recognition model Fine-tuning pretrained by relevant corpus. BiLSTM and Transformer are, EMR embedding respectively, used as feature extractors to eﬀectively recog- nize clinical entities in Chinese EMR. 2. Methods Head Ache , Suﬀer -e problem of Chinese clinical entity recognition can be transformed into sequence labeling. Sequence annotation Figure 1: Pipeline of deep learning pretraining. problem is to determine the output tag sequence B � b , . . . , b (b ∈ L, 1≤ i≤ n) for input sequence 1 n i In order to make full use of the resources of previous studies, A � a , a , a , ..., a and tag set L. Its essence is to classify 1 2 3 n it is used to ﬁne-tune our recognition tasks based on a each element in the input sequence according to its clinical entity recognition model (https://github.com/ context. baiyyang/medical-entity-recognition) trained by medical -ere were two speciﬁc practices for implementing the data of CCKS2017 tasks. -is model uses BiLSTM as feature deep learning pretraining mode: ﬁrstly, the input is initialized extractor followed by CRF for sequence annotation. For the by the same ﬁeld corpus pretraining EMR embedding and, convenience of description, this paper calls it as BioModel. secondly, the entity recognition model pretrained by relevant Although the labeling target of BioModel is diﬀerent corpus is ﬁne-tuning. We studied the eﬀect of this model on the from our task. However, Chinese EMR texts all have the recognition of clinical entities as shown in Figure 1. same linguistic features, and the model’s ability to learn this language feature can be well transferred to our task. 2.1. Datasets. Because of the protection policy of patient privacy in China, it is diﬃcult to obtain electronic medical 2.3. Bidirectional Long Short-Term Memory-Conditional records in hospitals. -erefore, we got 1,064 respiratory Random Fields. In recent years, a variety of deep learning records and 30,262 unrestricted records were crawled from methods have been widely applied in named entity recog- the website (https://www.iiyi.com). 200 of the 1,064 respi- nition tasks, usually using RNN model and its variants. RNN ratory department EMRs were manually annotated is theoretically capable of capturing remote context rela- according to the annotation speciﬁcation shown in Table 1 tionships, but, in practice, RNN cells often fail due to based on [19] and the semantic types of English I2B2 and gradient disappearance or gradient explosion. -erefore, UMLS, indicating four medical entities of disease, symptom, LSTM is usually used in practical applications. drug, and operation. Table 2 shows the distribution of LSTM uses a separate update gate Γ and a forget gate Γ , u f training set and test set. as well as an output gate Γ . -e update gate selectively Skip-gram model of Word2vec was used to adopt the EMR updates the state of the current moment, while the forget word embedding from 30,262 sets of unmarked electronic gate selectively forgets the state of the previous moment. medical records (115 MB), called the ﬁrst dataset. In addition, And then the output gate controls the proportion of the in order to study the impact of word embedding language on output of the current state. Figure 2 depicts the internal downstream task, we also use the universal word embedding structure of an LSTM cell [6]. -e realization of LSTM is as with 268G news corpus, called the second dataset. follows: For the sequential annotation task of entity recognition, the tag is composed of two parts: the entity category and the lo- 􏽥 C(t) � tanh W [h(t − 1), x(t)] + b , c c cation in the entity. In this study, BIO representation is used to Γ � sigmoidW [h(t − 1), x(t)] + b 􏼁, u u u represent the entity category and the position of the entity and then character as the minimum annotation unit. In the BIO Γ � sigmoid􏼐W [h(t − 1), x(t)] + b 􏼑, f f f representation, B is at the beginning of the entity, I is inside the (1) Γ � sigmoidW [h(t − 1), x(t)] + b 􏼁, entity, and O is not an entity. -erefore, the labeled corpus o o o contains 4 types of entities and 9 types of labels. C(t) � Γ C(t) + Γ C(t − 1), u f h(t) � Γ ∗ tanh(C(t)). 2.2. Pretraining. -ere is a pretraining with ﬁne-tuning mode in addition to the character embedding of Chinese In the named entity recognition task, simultaneous EMR. It is diﬃcult to annotate the corpus of Chinese EMRs. access to the context of the current moment can help predict 4 Journal of Healthcare Engineering Table 1: Labeling rules. Entity Deﬁnition Medical entities types -e diagnosis made by doctors to patients or entities ending with “病” or “症” are collectively Disease 肺内隔离症 referred to as diseases. Symptoms of discomfort, abnormalities, normal or abnormal examination results, or an unhealthy 声音嘶哑、无结核病 Symptoms state of the patient, as well as the patient’s self-reported history. 史 Drug -e speciﬁc drug name or class of drug given to the patient during treatment. 地塞米松、抗生素 -is includes screening programs and treatments. A test item is given to a patient in order to discover, deny, conﬁrm, and ﬁnd out more about the disease. Treatment refers to the treatment 拍胸片、抗感染、胸 Operation procedures and interventions that are imposed on patients to solve the disease or relieve 腔穿刺术 symptoms. Table 2: Distribution of entities among the training set and the test set. Data Disease Symptoms Drug Operation -e total number of entities Training set 701 2648 546 2138 6033 Test set 273 1043 208 918 2442 h (t) CRF layer B–sym I–sym O O C (t – 1) × + C (t) tanh c c c c 1 2 3 4 Sigmoid Sigmoid Bi-LSTM r r r r 1 2 3 4 Sigmoid encoder tanh h (t – 1) h (t – 1) l l l l 1 2 3 4 x (t) Word Head Ache , Suﬀer embeddings Figure 2: LSTM cell. Figure 3: BiLSTM-CRF. the current moment. However, LSTM’s hidden state h(t) still unable to do anything for the special long-term de- accepts only past information. -erefore, we use a bidi- pendence phenomenon. -e calculation of LSTM is limited rectional LSTM model to give the context of each state, using to sequence; that is, it can only be calculated in sequence two independent hidden states from left to right and from from left to right or from right to left, and the loss of in- right to left, while capturing past and future information. formation in the process of sequential calculation is BiLSTM converts the input sequence through the em- inevitable. bedding layer into a vector sequence input into two LSTM Transformer solves this problem by using the attention networks and then contact the forward and reverse two to reduce the distance between any two positions in the hidden layer outputs into the Softmax layer for classiﬁcation. sequence to a constant. -erefore, Transformer, as a feature However, LSTM can only learn the context relation of extractor, has a stronger learning ability than LSTM and has features but cannot directly learn the context relation of tags. been widely used in the past two years. Without the constraint of state transition, the model is likely As shown in Figure 4, Transformer is stacked with en- to output a completely wrong tag sequence. -erefore, it is coder and decoder, and, like all generation models, the considered to replace Softmax layer with CRF layer. CRF is output of the encoder is the input of the decoder [12]. All still responsible for sequence annotation, and BiLSTM is encoder blocks are structurally identical, but they do not responsible for automatic feature selection. Figure 3 de- share parameters. Each encoder block can be decomposed scribes the BiLSTM-CRF model used in the clinical entity into two sublayers, composed of self-attention and Feed recognition. Forward Neural Network. After the data is passed through the self-attention module, the weighted feature vector Z is obtained and then sent to the next module of encoder block, 2.4. Transformer-Conditional Random Fields. Although the namely, Feed Forward Neural Network, to obtain the output structure of gate mechanism such as LSTM alleviates the FFN (Z) of an encoder block. problem of long-term dependence to some extent, LSTM is Journal of Healthcare Engineering 5 Output probabilities Somax Feed forword Encoder-decoder Feed forword N × Decoder attention block N × Encoder block Masked multihead Masked multihead attention attention Output Input Figure 4: Transformer. each kind of performance indicator, which can be divided QK Z � Attention(Q, K, V) � soft max 􏽰�� V, 􏼠 􏼡 into macroprecision (Macro-p), macrorecall (Macro-r), and (2) Macro-F1 (Macro-F1). FFN(Z) � max0, ZW + b 􏼁W + b . 􏽐 P 1 1 2 2 i�1 i Marco − P � , Among them, Q, K, and V are assumed to be composed of a series of <Q, K, V> data pairs. For any constituent 􏽐 R i�1 (3) element Q, the weight coeﬃcient of each K corresponding to Marco − R � , V can be obtained by calculating the similarity between the c current element Q and other elements K, and then the 2 × Macro − P × Macro − R weighted sum of V can be carried out to obtain the ﬁnal Marco − F1 � , Macro − P + Macro − R attention value. Decoder block has one more encoder-decoder attention N represents the total number of entity categories, P c i than encoder. -e two types of attention of decoder are used represents the precision of each category of entity, and R to calculate the weight of input and output, respectively. Self- represents the recall of each category of entity. attention is used to calculate the relationship between In order to investigate the eﬀect of BiLSTM-CRF em- current output and preorder output. Encoder-decoder at- bedding of diﬀerent dimensions on the test set, we con- tention calculates the relationship between the current ducted this set of comparative experiments as shown in output and the encoder input eigenvector. In encoder-de- Table 3 using the ﬁrst dataset as test dataset. From Table 3, if coder attention, Q comes from the last output of decoder, the dimension of word embedding is too small, the implied and K and V come from the output of encoder. semantic information will be lost. If the dimension of word Multihead self-attention represents the diﬀerent ways of embedded is too large, it will bring noise. How to set the fusion of the target word and the semantic vector of other dimension of word embedding is related to the size and the words in the text under various semantic scenes. Note that language characteristics of the corpus. there are multiple sets of Q/K/V weight matrices in the In deep learning, the quality of word embedding has a mechanism, each of which is randomly initialized, and after great inﬂuence on the recognition results of deep neural training, each set is used to embed the input word or the network. In this study, the experiment eﬀect of 150-di- vector from the previous encoder/decoder into a diﬀerent mension word embedding is the best. -erefore, two dif- representation subspace. ferent word embeddings combined with BiLSTM-CRF and Transformer-CRF form the following four groups of ex- periments using the 150-dimension and the two types of 3. Results and Discussion dataset as shown in Table 4. In order to comprehensively consider the performance of It can be seen from the results that EMR embedding has better common embedding than whatever model is used. the model on the whole dataset, macroaverage is adopted in this paper. Macroaverage refers to the arithmetic average of Although the corpus size of EMR embedding is smaller than 6 Journal of Healthcare Engineering Table 3: Comparison results of diﬀerent dimensions. Diﬀerent dimensions Marco-P (%) Marco-R (%) Marco-F1 (%) Random embedding 69.52 69.70 69.38 50 embeddings 53.42 54.31 53.74 150 embeddings 72.48 72.54 72.51 300 embeddings 55.36 61.03 57.88 Table 4: Comparisons of diﬀerent recognition models and diﬀerent word embedding. Models Dataset Marco-P (%) Marco-R (%) Marco-F1 (%) BiLSTM-CRF + embedding Second 68.37 70.84 69.58 BiLSTM-CRF + EMR embedding First 72.48 72.54 72.51 Transformer-CRF + embedding Second 52.70 69.50 59.90 Transformer-CRF + EMR embedding First 52.70 72.10 60.70 that of common embedding, the strong relevance of EMR Table 5: Comparisons between pretraining and not pretraining. embedding to downstream tasks makes the eﬀect of EMR Models Marco-P (%) Marco-R (%) Marco-F1 (%) embedding signiﬁcantly better than that of embedding with BioModel 72.48 72.54 72.51 universal corpus. BioModel-ﬁne 75.06 76.40 75.72 In addition, by comparing the experimental results of BiLSTM-CRF and Transformer-CRF, it can be found that although the feature extraction ability of Transformer is Table 6: Performances of BioModel-ﬁne. theoretically better than that of BiLSTM, the complex Types P (%) R (%) F1 (%) model structure of Transformer requires a large amount of Disease 77.07 75.09 76.07 training data for learning. With the case of fewer training Drug 70.81 71.15 70.98 samples, Transformer does not perform as well as Operation 79.28 80.56 79.91 BiLSTM. Symptom 71.74 74.12 72.91 To study the eﬀectiveness of pretraining with the ﬁne- Average 75.06 76.40 75.72 tuning related entity recognition model, EMR embedding was adopted to BioModel ﬁne-tuning aiming at the ﬁrst dataset, and the medical entity recognition model was recognized disease entity, and, for the disease entity, its word obtained, called BioModel-ﬁne. As the basic network formation is similar, often ending with “病” and “症.” structure of BioModel is BiLSTM-CRF, the experimental Further, the positions appearing in the EMR are relatively control group has also used the BiLSTM-CRF network stable, and these characteristics can be well learned based on model embedding EMR. -e performances of comparisons the context. Relatively, the smallest diﬀerence between the between pretraining and not pretraining is shown in pretraining and nonpretraining modes is the drug. -is is Table 5. because drug entities are mostly unfamiliar, and the word BioModel has achieved 79% Macro-p, 80% Macro-r, formation is quite diﬀerent from the free text of other parts and 80% Macro-F1 on its original test set in CCKS2017. of the medical record. It is diﬃcult to learn, and the drugs However, the recognition target of its original corpus is appearing in EMR from diﬀerent departments tend to be diﬀerent from ours. Using the ﬁrst dataset as test data, quite diﬀerent, and the recognition of drug is very diﬃcult, BioModel could obtain 72.48% Macro-p, 72.54% Macro-r, in essence. and 72.51% Macro-F1. BioModel-ﬁne model obtains In addition, in Chinese electronic medical records, there 75.06% Macro-p, 76.40% Macro-r, and 75.72% Macro-F1. are a large number of long entities and even super long -e more details of BioModel-ﬁne model are as shown in entities with characters longer than 10, such as “双侧腋下 Tables 5 and 6. 扪及黄豆大小淋巴结” and “右肺中叶大片密度增高阴 BioModel-ﬁne is the model structure based on Bio- 影.” -is paper also computes character length statistics for Model for pretraining. Compared with the above results BioModel-ﬁne entity recognition results with 4.63 average without pretraining with ﬁne tuning, ﬁne-tuning can sig- character. -ough BiModel-ﬁne model based on the deep niﬁcantly improve the experimental results by utilizing the neural network relies on the performance of adjacent implied information learned by BioModel from its own words, the learned gate structure can retain more long-term training data. Further, F1 is used to measure the perfor- eﬀective information and has more advantages over the mances between pretraining and nonpretraining as shown implied characteristics in long-term dependence. Bio- in Figure 5. Model-ﬁne has generally shown greater sensitivity to these From the above, we can see that the most diﬀerent medical entities with longer character lengths. Table 7 lists diﬀerence between the pretraining and nonpretraining the ﬁve examples of identifying long entities by BiModel- modes is the disease. It is mainly because the BioModel also ﬁne model. Journal of Healthcare Engineering 7 F1 (%) Disease Symptom Drug Operation With pretraining 76.07 72.91 70.98 79.91 Without pretraining 68.24 71.07 69.32 76.15 Figure 5: Two types of F1 comparison on entity. Table 7: Identifying ﬁve examples of long entities by BiModle-ﬁne Conflicts of Interest model. -e authors declare that they have no conﬂicts of interest. No Identiﬁed medical entities 1 双侧腋下扪及黄豆大小淋巴结 (symptom) Acknowledgments 2 右肺中叶大片密度增高阴影 (symptom) 两肺纹理间可见边界不清的粟粒样微小淡结节影 -is research was supported by the National Natural Science (symptom) Foundation of China (Grant nos. 61502243, 61502247, and 4 急性心肌梗塞 (disease) 61572263), Zhejiang Engineering Research Center of In- 5 结核 PCR 扩增实验 (operation) telligent Medicine under grant 2016E10011, China Post- doctoral Science Foundation (2018M632349), and Natural Science Foundation of the Higher Education Institutions of 4. Conclusions Jiangsu Province in China (no. 16KJD520003). In this study, a pretrained method for Chinese electronic medical record named entity recognition is proposed in view References of the language features of Chinese EMR with many com- [1] L. B. Zhang, Word Segmentation and Named Entity Mining pound entities, serious missing sentence components, un- Based on Semi Supervised Learning for Chinese EMR, Dis- clear entity boundary, and the diﬃculty in obtaining sertation, Harbin Institute of Technology, Harbin, China, annotated corpus. Pretraining is divided into two steps. -e ﬁrst step is to adopt the same ﬁeld of corpus pretraining [2] W. Li, D. Zhao, L. Bo et al., “Combining CRF and rule based word embedding and, respectively, use BiLSTM and medical named entity recognition,” Application Research of Transformer as feature extractor to identify medical entities Computers, vol. 32, no. 4, pp. 1082–1086, 2015. in Chinese electronic medical records. -e second step is [3] O. Uzuner, B. R. South, S. Shen et al., “2010 i2b2/VA challenge ﬁne-tuning the named entity recognition model pretrained on concepts, assertions, and relations in clinical text,” Journal by other relevant corpus, so as to make full use of the existing of the American Medical Informatics Association, vol. 18, no. 5, pp. 552–556, 2011. annotated corpus and eﬀectively improve the recognition [4] H. V. Cook and L. J. Jensen, “A guide to dictionary-based text eﬀect of Chinese clinical entities when there are few an- mining,” Methods in Molecular Biology, Bioinformatics and notated corpus. 75.06% Macro-P, 76.40% Macro-R, and Drug Discovery, vol. 1939, pp. 73–89, 2019. 75.72% Macro-F1 could be achieved aiming at test dataset [5] X. Liu, S. Zhang, F. Wei, and M. Zhou, “Recognizing named related to the Chinese electronic medical records. Experi- entities in tweets,” in Proceedings of the 49th Annual Meeting ment results show that the proposed Chinese clinical entity of the Association for Computational Linguistics: Human recognition model based on deep learning pretraining could Language Technologies, Portland, USA, June 2011. eﬀectively improve the recognition performance. [6] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” 2015, http://arxiv.org/abs/1508.01991. Data Availability [7] Y. Zhang, X. W. Wang, Z. Hou et al., “Clinical named entity recognition from Chinese electronic health records via ma- -e data used to support the ﬁndings of this study are chine learning methods,” JMIR Medical Informatics, vol. 6, available from the corresponding author upon request. no. 4, p. e50, 2018. 8 Journal of Healthcare Engineering [8] S. Chowdhury, X. S. Dong, L. J. Qian et al., “A multitask bi- directional RNN model for named entity recognition on Chinese electronic medical records,” BMC Bioinformatics, vol. 19, no. 17 Suppl, p. 499, 2018. [9] Y. H. Wu, M. Jiang, J. B. Lei et al., “Named entity recognition in Chinese clinical text using deep neural networks,” Studies in Health Technology and Informatics, vol. 216, pp. 624–628, [10] Y. Wu, X. Yang, J. Bian, Y. Guo, H. Xu, and W. Hogan, “Combine factual medical knowledge and distributed word representation to improve clinical named entity recognition,” AMIA Annual Symposium Proceedings, vol. 2018, pp. 1110– 1117, 2018. [11] Y. Xu, Y. Wang, T. Liu et al., “Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries,” Journal of the American Medical In- formatics Association, vol. 21, no. e1, pp. e84–e92, 2014. [12] H. Wang, W. Zhang, Q. Zeng, Z. Li, K. Feng, and L. Liu, “Extracting important information from Chinese operation notes with natural language processing methods,” Journal of Biomedical Informatics, vol. 48, no. C, pp. 130–136, 2014. [13] J. Lei, B. Tang, X. Lu, K. Gao, M. Jiang, and H. Xu, “A comprehensive study of named entity recognition in Chinese clinical text,” Journal of the American Medical Informatics Association, vol. 21, no. 5, pp. 808–814, 2014. [14] Y. Wang, Z. Yu, L. Chen et al., “Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: an empirical study,” Journal of Biomedical Informatics, vol. 47, no. 2, pp. 91–104, 2014. [15] X. W. Zhang and Z. Li, “Chinese electronic medical record named entity recognition based on multi-feature fusion,” Software Guide, vol. 16, no. 2, pp. 128–131, 2017. [16] Y. B. Xia, J. L. Zhen, and Y. F. Zhao, “Electronic medical record named entity recognition based on deep learning,” Electronic Science and Technology, vol. 31, no. 11, p. 31, 2018. [17] F. Li, M. Zhang, B. Tian, B. Chen, G. Fu, and D. Ji, “Rec- ognizing irregular entities in biomedical text via deep neural networks,” Pattern Recognition Letters, vol. 105, pp. 105–113, [18] Z. J. Liu, M. Yang, X. L. Wang et al., “Entity recognition from clinical texts via recurrent neural networks,” BMC Medical Informatics and Decision Making, vol. 17, no. 2, p. 67, 2017. [19] J. F. Yang, Y. Guan, B. He et al., “Chinese electronic medical record named entity and entity relationship corpus con- struction,” Journal of Software, vol. 27, no. 11, pp. 2725–2746, [20] Z. H. Huang, W. Xu, and k Yu, “Bidirectional LSTM-CRF models for sequence tagging,” 2015, https://arxiv.org/abs/ 1508.01991. [21] Q. K. Wei, T. Chen, R. F. Xu, Y. He, and L. Gui, “Disease named entity recognition by combining conditional random ﬁelds and bidirectional recurrent neural networks,” Database, vol. 2016, 2016. [22] A. Vaswani, N. Shazeer, N. Parmar et al., Attention is all you need, 2017, https://arxiv.org/abs/1706.03762. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Healthcare Engineering Hindawi Publishing Corporation http://www.deepdyve.com/lp/hindawi-publishing-corporation/clinical-named-entity-recognition-from-chinese-electronic-medical-brmH4rJ8nL

Loading next page...

References (22)

(2017)
Chinese electronic medical record named entity recognition based on multi-feature fusion
Software Guide, 16
(2018)
Electronic medical record named entity recognition based on deep learning
Electronic Science and Technology, 31
Zengjian Liu, Ming Yang, Xiaolong Wang, Qingcai Chen, Buzhou Tang, Zhe Wang, Hua Xu (2017)
Entity recognition from clinical texts via recurrent neural network
BMC Medical Informatics and Decision Making, 17
Yaqiang Wang, Zhonghua Yu, Li Chen, Yunhui Chen, Yiguang Liu, Xiaoguang Hu, Yongguang Jiang (2014)
Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: An empirical study
Journal of biomedical informatics, 47
Yu Zhang, Xuwen Wang, Zhen Hou, Jiao Li (2018)
Clinical Named Entity Recognition From Chinese Electronic Health Records via Machine Learning Methods
JMIR Medical Informatics, 6
Yonghui Wu, Xi Yang, J. Bian, Yi Guo, Hua Xu, W. Hogan (2018)
Combine Factual Medical Knowledge and Distributed Word Representation to Improve Clinical Named Entity Recognition
AMIA ... Annual Symposium proceedings. AMIA Symposium, 2018
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, Illia Polosukhin (2017)
Attention is All you Need
Zhiheng Huang, W. Xu, Kai Yu (2015)
Bidirectional LSTM-CRF Models for Sequence Tagging
ArXiv, abs/1508.01991
Jianbo Lei, Buzhou Tang, Xueqin Lu, Kaihua Gao, Min Jiang, Hua Xu (2014)
Research and applications: A comprehensive study of named entity recognition in Chinese clinical text
J. Am. Medical Informatics Assoc., 21
Xiaohua Liu, Shaodian Zhang, Furu Wei, M. Zhou (2011)
Recognizing Named Entities in Tweets
H. Cook, L. Jensen (2019)
A Guide to Dictionary-Based Text Mining.
Methods in molecular biology, 1939
(2014)
Word Segmentation and Named Entity Mining Based on Semi Supervised Learning for Chinese EMR
Yan Xu, Yining Wang, Tianren Liu, Jiahua Liu, Yubo Fan, Yingli Qian, Junichi Tsujii, E. Chang (2014)
Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries.
Journal of the American Medical Informatics Association : JAMIA, 21 e1
Yonghui Wu, Min Jiang, Jianbo Lei, Hua Xu (2015)
Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network
Studies in health technology and informatics, 216
Hui Wang, Weide Zhang, Qiang Zeng, Zuofeng Li, Kaiyan Feng, Lei Liu (2014)
Extracting important information from Chinese Operation Notes with natural language processing methods
Journal of biomedical informatics, 48
Shanta Chowdhury, Xishuang Dong, Lijun Qian, Xiangfang Li, Y. Guan, Jinfeng Yang, Qiubin Yu (2018)
A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records
BMC Bioinformatics, 19
Fei Li, Meishan Zhang, Bo Tian, Bo Chen, G. Fu, D. Ji (2017)
Recognizing irregular entities in biomedical text via deep neural networks
Pattern Recognit. Lett., 105
Qikang Wei, Tao Chen, Ruifeng Xu, Yulan He, Lin Gui (2016)
Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks
Database: The Journal of Biological Databases and Curation, 2016
Özlem Uzuner, B. South, Shuying Shen, S. Duvall (2011)
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
Journal of the American Medical Informatics Association : JAMIA, 18 5
(2015)
Combining CRF and rule based medical named entity recognition
(2016)
Chinese electronic medical record named entity and entity relationship corpus construction
(2014)
A comprehensive study of named entity recognition in Chinese clinical text
Journal of the American Medical Informatics Association, 21

Publisher: Hindawi Publishing Corporation
Copyright: Copyright © 2020 Lejun Gong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN: 2040-2295
eISSN: 2040-2309
DOI: 10.1155/2020/8829219
Publisher site: See Article on Publisher Site

Abstract

Hindawi Journal of Healthcare Engineering Volume 2020, Article ID 8829219, 8 pages https://doi.org/10.1155/2020/8829219 Research Article Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining 1,2 1 1 Lejun Gong , Zhifei Zhang, and Shiqi Chen Jiangsu Key Lab of Big Data Security & Intelligent Processing, School of Computer Science, Nanjing University of Posts and Telecommunications, Nanijing 210023, China Zhejiang Engineering Research Center of Intelligent Medicine, Wenzhou 325035, China Correspondence should be addressed to Lejun Gong; glj98226@163.com Received 14 August 2020; Revised 26 October 2020; Accepted 2 November 2020; Published 24 November 2020 Academic Editor: Jiafeng Yao Copyright © 2020 Lejun Gong et al. -is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Background. Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is diﬃcult to obtain. Methods. Aiming at these characteristics of Chinese electronic medical records, this study proposed a Chinese clinical entity recognition model based on deep learning pretraining. -e model used word embedding from domain corpus and ﬁne-tuning of entity recognition model pretrained by relevant corpus. -en BiLSTM and Transformer are, respectively, used as feature extractors to identify four types of clinical entities including diseases, symptoms, drugs, and operations from the text of Chinese electronic medical records. Results. 75.06% Macro-P, 76.40% Macro-R, and 75.72% Macro-F1 aiming at test dataset could be achieved. -ese experiments show that the Chinese clinical entity recognition model based on deep learning pretraining can eﬀectively improve the recognition eﬀect. Conclusions. -ese experiments show that the proposed Chinese clinical entity recognition model based on deep learning pretraining can eﬀectively improve the recognition performance. personnel provides based on examination results such as 1. Background disease diagnosis and treatment methods as medical re- In recent years, medical informatization has produced a sources constructed by professionals. As the core data of large number of electronic medical records. -e electronic medical information system, how to make use of the large medical record not only completely preserves the detailed amount of potential medical information contained in information of the patients’ diagnosis and treatment process electronic medical records has become one of the hot re- but also has the advantages of regular writing format, search directions. But electronic medical records are not convenient retrieval, and storage, and it can better help fully structured data. Semistructured or unstructured free telemedicine further. In addition, the rapid development of a text data make up the majority. In order to convert these large number of online consultation websites and cases unstructured data into structured form that can be recog- discussion forums will also produce a large number of nized by computer, it is necessary to use natural language disease question and answer information. -ese medical processing technology to conduct text mining. As the basic texts are in the same form as electronic medical records. task of text mining information extraction, the types of -ese data make up a very large amount of medical data clinical entities to be recognized mainly include disease, resources. symptoms, operations taken by medical personnel (in- Electronic medical records (EMR) record various cluding inspection operations and treatment operations), symptoms and examination measures taken by patients and drugs. Although the research on Chinese named entity from before admission to hospitalization and that medical recognition has been going on for some time, most of the 2 Journal of Healthcare Engineering the speciﬁc objects identiﬁed are diﬀerent, a large number researches focus on the open ﬁeld. However, some studies have shown that the density of entity distribution in Chinese of entity identiﬁcation methods in the general ﬁeld can still be applied to the biomedical ﬁeld. -ese include early EMRs is much higher than that in open ﬁeld texts. -e proportion of entity characters in the corpus of Chinese approaches based on the combination of dictionaries and EMRs is nearly twice that of the general Chinese corpus, rules and approaches based on machine learning. which indicates that Chinese EMRs are a kind of knowledge- -e method based on the combination of dictionary and intensive text [1], and the data has considerable research rule will match the dictionary and text ﬁrstly. -en it value. But this density also creates more obstacles to the combines with the formulated rules for postprocessing and study of clinical named entity recognition from EMRs in normalization. Its performance depends on the size of the Chinese. Since it is the entity recognition of Chinese elec- dictionary and the quality of the rules [4]. Due to the variety tronic medical record, this paper keeps the entity in Chinese of clinical entities and its strong professionalism, the con- struction of dictionaries and the formulation of rules need medical record as Chinese character format. -is task has just started. In addition, it still has the following diﬃculties amount manpower, which is not only time-consuming, but also not portabl e. -erefore, dictionaries and rules are often [2]. used as auxiliary means in the task of named entity (1) Clinical entities have various types and a large recognition. number, and there are new entities with unregistered In recent years, machine learning has been applied to words, such as unregistered disease, drug, and in- named entity recognition [5, 6], such as maximum entropy spection, which make it diﬃcult to build a com- (ME), conditional random ﬁeld (CRF), support vector prehensive clinical dictionary and obtain disease machine (SVM), and structural support vector machine dictionary, drug dictionary, or inspection dictionary. (SSVM). Multiple deep learning methods are also applied to (2) Clinical entities are divided into simple entities and entity recognition tasks. For example, Recurrent Neural complex entities with relatively complex structure. Networks (RNN) and Long Short-Term Memory Networks -e length of medical record entities in EMR is (LSTM). In addition, models combining deep learning with variable, and a large number of clinical entities are traditional machine learning have also been widely used. longer than common entities. -ere are a lot of Named entity recognition is often regarded as a se- nesting, alias, and acronym in clinical entities. quential annotation task. Traditional machine learning (3) In diﬀerent parts of EMR, the extension of clinical methods, such as CRF [7], can achieve good performance in sequence annotation tasks but rely heavily on manually entity is diﬀerent, and there is fuzzy classiﬁcation in category labeling. -e boundary between diﬀerent selected characteristics. By contrast, deep learning could named entities is not clear, and the names of clinical automatically learn features, but a large amount of training data is needed to achieve excellent recognition eﬀect [8, 9]. manifestations often appear in the names of diseases. -ere are a lot of mutual inclusion and crossover Related work of Chinese EMR is developing rapidly [10, 11]. Compared with English corpus, Chinese text has phenomenon. For example, “上呼吸道感染” is generally considered to be the disease, but, in some fuzzy word boundary and no obvious segmentation mark, so it is diﬃcult to study the entity recognition [12]. -e se- cases, it also appears as a symptom. lection of features in traditional machine learning will di- -e recognition of named entity in EMR has been rectly aﬀect the eﬀect of entity recognition, so most studies studied in foreign countries. Because EMR involves more focus on the construction and selection of diﬀerent features. professional knowledge, the cost of corpus construction Lei et al. [13] compared the combination of CRF, SSVM, is higher. -e informatics for Integrating Biology and the SVM, and ME with a variety of characteristics and recog- Bedside (I2B2) organized the multiple records related nized medical problems, examination, treatment, and drugs tasks and issued the relevant corpus and a number of in the admission records and discharge records. Wang et al. shared tasks since 2006 [3]. -e task of concept recog- [14] used the character position information and short nition and relation extraction evaluation of I2B2 2010 is clauses to reach the F1 value of 95.12% in the self-labeled the ﬁrst to systematically classify English electronic Chinese medicine text corpus. Literature [15] studies the medical record name entities. -is classiﬁcation refers to inﬂuence of multifeature combination such as linguistic the semantic type deﬁned by UMLS, which divides EMR symbol feature, part of speech feature, keyword feature, and entities into three categories, namely, medical problems dictionary feature on CRF sequence annotation. (including diseases and symptoms), treatment, and -ere are also relevant studies on Chinese clinical entity examination. recognition using the deep learning method [16–18], whose Named entity recognition, as the key task of text data model is basically the sequence model RNN and its variants. mining, has been the research foundation and hotspot of It is worth mentioning that Yang et al. [19] combined the natural language processing. Named entity recognition in characteristics of Chinese electronic language structure and, the general ﬁeld mainly includes names, places, organi- with the help of the guidance of professional medical per- zations, time expressions, and numerical expressions. In sonnel, combined EMR label speciﬁcation in English and the ﬁeld of biomedicine, most of the current research made a detailed EMR in Chinese named entity and entity focuses on the identiﬁcation of genes, proteins, cells, and relationship labeling regulations and completed the natural other entities in the English medical literature. Although language processing research in the ﬁeld of EMR in Chinese Journal of Healthcare Engineering 3 basic work. In addition, there are some identiﬁcation B-sym I-sym 0 0 methods that combine deep learning with supervised learning [20, 21]. However, as far as we know, BiLSTM and Transformer [22] combined methods have not been applied to clinical Chinese-named entity recognition. In view of the above problems, this study proposes a named entity recognition method for Chinese EMR based Feature extractors on pretraining. -e method is based on word embedding pretraining and ﬁne-tuning of entity recognition model Fine-tuning pretrained by relevant corpus. BiLSTM and Transformer are, EMR embedding respectively, used as feature extractors to eﬀectively recog- nize clinical entities in Chinese EMR. 2. Methods Head Ache , Suﬀer -e problem of Chinese clinical entity recognition can be transformed into sequence labeling. Sequence annotation Figure 1: Pipeline of deep learning pretraining. problem is to determine the output tag sequence B � b , . . . , b (b ∈ L, 1≤ i≤ n) for input sequence 1 n i In order to make full use of the resources of previous studies, A � a , a , a , ..., a and tag set L. Its essence is to classify 1 2 3 n it is used to ﬁne-tune our recognition tasks based on a each element in the input sequence according to its clinical entity recognition model (https://github.com/ context. baiyyang/medical-entity-recognition) trained by medical -ere were two speciﬁc practices for implementing the data of CCKS2017 tasks. -is model uses BiLSTM as feature deep learning pretraining mode: ﬁrstly, the input is initialized extractor followed by CRF for sequence annotation. For the by the same ﬁeld corpus pretraining EMR embedding and, convenience of description, this paper calls it as BioModel. secondly, the entity recognition model pretrained by relevant Although the labeling target of BioModel is diﬀerent corpus is ﬁne-tuning. We studied the eﬀect of this model on the from our task. However, Chinese EMR texts all have the recognition of clinical entities as shown in Figure 1. same linguistic features, and the model’s ability to learn this language feature can be well transferred to our task. 2.1. Datasets. Because of the protection policy of patient privacy in China, it is diﬃcult to obtain electronic medical 2.3. Bidirectional Long Short-Term Memory-Conditional records in hospitals. -erefore, we got 1,064 respiratory Random Fields. In recent years, a variety of deep learning records and 30,262 unrestricted records were crawled from methods have been widely applied in named entity recog- the website (https://www.iiyi.com). 200 of the 1,064 respi- nition tasks, usually using RNN model and its variants. RNN ratory department EMRs were manually annotated is theoretically capable of capturing remote context rela- according to the annotation speciﬁcation shown in Table 1 tionships, but, in practice, RNN cells often fail due to based on [19] and the semantic types of English I2B2 and gradient disappearance or gradient explosion. -erefore, UMLS, indicating four medical entities of disease, symptom, LSTM is usually used in practical applications. drug, and operation. Table 2 shows the distribution of LSTM uses a separate update gate Γ and a forget gate Γ , u f training set and test set. as well as an output gate Γ . -e update gate selectively Skip-gram model of Word2vec was used to adopt the EMR updates the state of the current moment, while the forget word embedding from 30,262 sets of unmarked electronic gate selectively forgets the state of the previous moment. medical records (115 MB), called the ﬁrst dataset. In addition, And then the output gate controls the proportion of the in order to study the impact of word embedding language on output of the current state. Figure 2 depicts the internal downstream task, we also use the universal word embedding structure of an LSTM cell [6]. -e realization of LSTM is as with 268G news corpus, called the second dataset. follows: For the sequential annotation task of entity recognition, the tag is composed of two parts: the entity category and the lo- 􏽥 C(t) � tanh W [h(t − 1), x(t)] + b , c c cation in the entity. In this study, BIO representation is used to Γ � sigmoidW [h(t − 1), x(t)] + b 􏼁, u u u represent the entity category and the position of the entity and then character as the minimum annotation unit. In the BIO Γ � sigmoid􏼐W [h(t − 1), x(t)] + b 􏼑, f f f representation, B is at the beginning of the entity, I is inside the (1) Γ � sigmoidW [h(t − 1), x(t)] + b 􏼁, entity, and O is not an entity. -erefore, the labeled corpus o o o contains 4 types of entities and 9 types of labels. C(t) � Γ C(t) + Γ C(t − 1), u f h(t) � Γ ∗ tanh(C(t)). 2.2. Pretraining. -ere is a pretraining with ﬁne-tuning mode in addition to the character embedding of Chinese In the named entity recognition task, simultaneous EMR. It is diﬃcult to annotate the corpus of Chinese EMRs. access to the context of the current moment can help predict 4 Journal of Healthcare Engineering Table 1: Labeling rules. Entity Deﬁnition Medical entities types -e diagnosis made by doctors to patients or entities ending with “病” or “症” are collectively Disease 肺内隔离症 referred to as diseases. Symptoms of discomfort, abnormalities, normal or abnormal examination results, or an unhealthy 声音嘶哑、无结核病 Symptoms state of the patient, as well as the patient’s self-reported history. 史 Drug -e speciﬁc drug name or class of drug given to the patient during treatment. 地塞米松、抗生素 -is includes screening programs and treatments. A test item is given to a patient in order to discover, deny, conﬁrm, and ﬁnd out more about the disease. Treatment refers to the treatment 拍胸片、抗感染、胸 Operation procedures and interventions that are imposed on patients to solve the disease or relieve 腔穿刺术 symptoms. Table 2: Distribution of entities among the training set and the test set. Data Disease Symptoms Drug Operation -e total number of entities Training set 701 2648 546 2138 6033 Test set 273 1043 208 918 2442 h (t) CRF layer B–sym I–sym O O C (t – 1) × + C (t) tanh c c c c 1 2 3 4 Sigmoid Sigmoid Bi-LSTM r r r r 1 2 3 4 Sigmoid encoder tanh h (t – 1) h (t – 1) l l l l 1 2 3 4 x (t) Word Head Ache , Suﬀer embeddings Figure 2: LSTM cell. Figure 3: BiLSTM-CRF. the current moment. However, LSTM’s hidden state h(t) still unable to do anything for the special long-term de- accepts only past information. -erefore, we use a bidi- pendence phenomenon. -e calculation of LSTM is limited rectional LSTM model to give the context of each state, using to sequence; that is, it can only be calculated in sequence two independent hidden states from left to right and from from left to right or from right to left, and the loss of in- right to left, while capturing past and future information. formation in the process of sequential calculation is BiLSTM converts the input sequence through the em- inevitable. bedding layer into a vector sequence input into two LSTM Transformer solves this problem by using the attention networks and then contact the forward and reverse two to reduce the distance between any two positions in the hidden layer outputs into the Softmax layer for classiﬁcation. sequence to a constant. -erefore, Transformer, as a feature However, LSTM can only learn the context relation of extractor, has a stronger learning ability than LSTM and has features but cannot directly learn the context relation of tags. been widely used in the past two years. Without the constraint of state transition, the model is likely As shown in Figure 4, Transformer is stacked with en- to output a completely wrong tag sequence. -erefore, it is coder and decoder, and, like all generation models, the considered to replace Softmax layer with CRF layer. CRF is output of the encoder is the input of the decoder [12]. All still responsible for sequence annotation, and BiLSTM is encoder blocks are structurally identical, but they do not responsible for automatic feature selection. Figure 3 de- share parameters. Each encoder block can be decomposed scribes the BiLSTM-CRF model used in the clinical entity into two sublayers, composed of self-attention and Feed recognition. Forward Neural Network. After the data is passed through the self-attention module, the weighted feature vector Z is obtained and then sent to the next module of encoder block, 2.4. Transformer-Conditional Random Fields. Although the namely, Feed Forward Neural Network, to obtain the output structure of gate mechanism such as LSTM alleviates the FFN (Z) of an encoder block. problem of long-term dependence to some extent, LSTM is Journal of Healthcare Engineering 5 Output probabilities Somax Feed forword Encoder-decoder Feed forword N × Decoder attention block N × Encoder block Masked multihead Masked multihead attention attention Output Input Figure 4: Transformer. each kind of performance indicator, which can be divided QK Z � Attention(Q, K, V) � soft max 􏽰�� V, 􏼠 􏼡 into macroprecision (Macro-p), macrorecall (Macro-r), and (2) Macro-F1 (Macro-F1). FFN(Z) � max0, ZW + b 􏼁W + b . 􏽐 P 1 1 2 2 i�1 i Marco − P � , Among them, Q, K, and V are assumed to be composed of a series of <Q, K, V> data pairs. For any constituent 􏽐 R i�1 (3) element Q, the weight coeﬃcient of each K corresponding to Marco − R � , V can be obtained by calculating the similarity between the c current element Q and other elements K, and then the 2 × Macro − P × Macro − R weighted sum of V can be carried out to obtain the ﬁnal Marco − F1 � , Macro − P + Macro − R attention value. Decoder block has one more encoder-decoder attention N represents the total number of entity categories, P c i than encoder. -e two types of attention of decoder are used represents the precision of each category of entity, and R to calculate the weight of input and output, respectively. Self- represents the recall of each category of entity. attention is used to calculate the relationship between In order to investigate the eﬀect of BiLSTM-CRF em- current output and preorder output. Encoder-decoder at- bedding of diﬀerent dimensions on the test set, we con- tention calculates the relationship between the current ducted this set of comparative experiments as shown in output and the encoder input eigenvector. In encoder-de- Table 3 using the ﬁrst dataset as test dataset. From Table 3, if coder attention, Q comes from the last output of decoder, the dimension of word embedding is too small, the implied and K and V come from the output of encoder. semantic information will be lost. If the dimension of word Multihead self-attention represents the diﬀerent ways of embedded is too large, it will bring noise. How to set the fusion of the target word and the semantic vector of other dimension of word embedding is related to the size and the words in the text under various semantic scenes. Note that language characteristics of the corpus. there are multiple sets of Q/K/V weight matrices in the In deep learning, the quality of word embedding has a mechanism, each of which is randomly initialized, and after great inﬂuence on the recognition results of deep neural training, each set is used to embed the input word or the network. In this study, the experiment eﬀect of 150-di- vector from the previous encoder/decoder into a diﬀerent mension word embedding is the best. -erefore, two dif- representation subspace. ferent word embeddings combined with BiLSTM-CRF and Transformer-CRF form the following four groups of ex- periments using the 150-dimension and the two types of 3. Results and Discussion dataset as shown in Table 4. In order to comprehensively consider the performance of It can be seen from the results that EMR embedding has better common embedding than whatever model is used. the model on the whole dataset, macroaverage is adopted in this paper. Macroaverage refers to the arithmetic average of Although the corpus size of EMR embedding is smaller than 6 Journal of Healthcare Engineering Table 3: Comparison results of diﬀerent dimensions. Diﬀerent dimensions Marco-P (%) Marco-R (%) Marco-F1 (%) Random embedding 69.52 69.70 69.38 50 embeddings 53.42 54.31 53.74 150 embeddings 72.48 72.54 72.51 300 embeddings 55.36 61.03 57.88 Table 4: Comparisons of diﬀerent recognition models and diﬀerent word embedding. Models Dataset Marco-P (%) Marco-R (%) Marco-F1 (%) BiLSTM-CRF + embedding Second 68.37 70.84 69.58 BiLSTM-CRF + EMR embedding First 72.48 72.54 72.51 Transformer-CRF + embedding Second 52.70 69.50 59.90 Transformer-CRF + EMR embedding First 52.70 72.10 60.70 that of common embedding, the strong relevance of EMR Table 5: Comparisons between pretraining and not pretraining. embedding to downstream tasks makes the eﬀect of EMR Models Marco-P (%) Marco-R (%) Marco-F1 (%) embedding signiﬁcantly better than that of embedding with BioModel 72.48 72.54 72.51 universal corpus. BioModel-ﬁne 75.06 76.40 75.72 In addition, by comparing the experimental results of BiLSTM-CRF and Transformer-CRF, it can be found that although the feature extraction ability of Transformer is Table 6: Performances of BioModel-ﬁne. theoretically better than that of BiLSTM, the complex Types P (%) R (%) F1 (%) model structure of Transformer requires a large amount of Disease 77.07 75.09 76.07 training data for learning. With the case of fewer training Drug 70.81 71.15 70.98 samples, Transformer does not perform as well as Operation 79.28 80.56 79.91 BiLSTM. Symptom 71.74 74.12 72.91 To study the eﬀectiveness of pretraining with the ﬁne- Average 75.06 76.40 75.72 tuning related entity recognition model, EMR embedding was adopted to BioModel ﬁne-tuning aiming at the ﬁrst dataset, and the medical entity recognition model was recognized disease entity, and, for the disease entity, its word obtained, called BioModel-ﬁne. As the basic network formation is similar, often ending with “病” and “症.” structure of BioModel is BiLSTM-CRF, the experimental Further, the positions appearing in the EMR are relatively control group has also used the BiLSTM-CRF network stable, and these characteristics can be well learned based on model embedding EMR. -e performances of comparisons the context. Relatively, the smallest diﬀerence between the between pretraining and not pretraining is shown in pretraining and nonpretraining modes is the drug. -is is Table 5. because drug entities are mostly unfamiliar, and the word BioModel has achieved 79% Macro-p, 80% Macro-r, formation is quite diﬀerent from the free text of other parts and 80% Macro-F1 on its original test set in CCKS2017. of the medical record. It is diﬃcult to learn, and the drugs However, the recognition target of its original corpus is appearing in EMR from diﬀerent departments tend to be diﬀerent from ours. Using the ﬁrst dataset as test data, quite diﬀerent, and the recognition of drug is very diﬃcult, BioModel could obtain 72.48% Macro-p, 72.54% Macro-r, in essence. and 72.51% Macro-F1. BioModel-ﬁne model obtains In addition, in Chinese electronic medical records, there 75.06% Macro-p, 76.40% Macro-r, and 75.72% Macro-F1. are a large number of long entities and even super long -e more details of BioModel-ﬁne model are as shown in entities with characters longer than 10, such as “双侧腋下 Tables 5 and 6. 扪及黄豆大小淋巴结” and “右肺中叶大片密度增高阴 BioModel-ﬁne is the model structure based on Bio- 影.” -is paper also computes character length statistics for Model for pretraining. Compared with the above results BioModel-ﬁne entity recognition results with 4.63 average without pretraining with ﬁne tuning, ﬁne-tuning can sig- character. -ough BiModel-ﬁne model based on the deep niﬁcantly improve the experimental results by utilizing the neural network relies on the performance of adjacent implied information learned by BioModel from its own words, the learned gate structure can retain more long-term training data. Further, F1 is used to measure the perfor- eﬀective information and has more advantages over the mances between pretraining and nonpretraining as shown implied characteristics in long-term dependence. Bio- in Figure 5. Model-ﬁne has generally shown greater sensitivity to these From the above, we can see that the most diﬀerent medical entities with longer character lengths. Table 7 lists diﬀerence between the pretraining and nonpretraining the ﬁve examples of identifying long entities by BiModel- modes is the disease. It is mainly because the BioModel also ﬁne model. Journal of Healthcare Engineering 7 F1 (%) Disease Symptom Drug Operation With pretraining 76.07 72.91 70.98 79.91 Without pretraining 68.24 71.07 69.32 76.15 Figure 5: Two types of F1 comparison on entity. Table 7: Identifying ﬁve examples of long entities by BiModle-ﬁne Conflicts of Interest model. -e authors declare that they have no conﬂicts of interest. No Identiﬁed medical entities 1 双侧腋下扪及黄豆大小淋巴结 (symptom) Acknowledgments 2 右肺中叶大片密度增高阴影 (symptom) 两肺纹理间可见边界不清的粟粒样微小淡结节影 -is research was supported by the National Natural Science (symptom) Foundation of China (Grant nos. 61502243, 61502247, and 4 急性心肌梗塞 (disease) 61572263), Zhejiang Engineering Research Center of In- 5 结核 PCR 扩增实验 (operation) telligent Medicine under grant 2016E10011, China Post- doctoral Science Foundation (2018M632349), and Natural Science Foundation of the Higher Education Institutions of 4. Conclusions Jiangsu Province in China (no. 16KJD520003). In this study, a pretrained method for Chinese electronic medical record named entity recognition is proposed in view References of the language features of Chinese EMR with many com- [1] L. B. Zhang, Word Segmentation and Named Entity Mining pound entities, serious missing sentence components, un- Based on Semi Supervised Learning for Chinese EMR, Dis- clear entity boundary, and the diﬃculty in obtaining sertation, Harbin Institute of Technology, Harbin, China, annotated corpus. Pretraining is divided into two steps. -e ﬁrst step is to adopt the same ﬁeld of corpus pretraining [2] W. Li, D. Zhao, L. Bo et al., “Combining CRF and rule based word embedding and, respectively, use BiLSTM and medical named entity recognition,” Application Research of Transformer as feature extractor to identify medical entities Computers, vol. 32, no. 4, pp. 1082–1086, 2015. in Chinese electronic medical records. -e second step is [3] O. Uzuner, B. R. South, S. Shen et al., “2010 i2b2/VA challenge ﬁne-tuning the named entity recognition model pretrained on concepts, assertions, and relations in clinical text,” Journal by other relevant corpus, so as to make full use of the existing of the American Medical Informatics Association, vol. 18, no. 5, pp. 552–556, 2011. annotated corpus and eﬀectively improve the recognition [4] H. V. Cook and L. J. Jensen, “A guide to dictionary-based text eﬀect of Chinese clinical entities when there are few an- mining,” Methods in Molecular Biology, Bioinformatics and notated corpus. 75.06% Macro-P, 76.40% Macro-R, and Drug Discovery, vol. 1939, pp. 73–89, 2019. 75.72% Macro-F1 could be achieved aiming at test dataset [5] X. Liu, S. Zhang, F. Wei, and M. Zhou, “Recognizing named related to the Chinese electronic medical records. Experi- entities in tweets,” in Proceedings of the 49th Annual Meeting ment results show that the proposed Chinese clinical entity of the Association for Computational Linguistics: Human recognition model based on deep learning pretraining could Language Technologies, Portland, USA, June 2011. eﬀectively improve the recognition performance. [6] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” 2015, http://arxiv.org/abs/1508.01991. Data Availability [7] Y. Zhang, X. W. Wang, Z. Hou et al., “Clinical named entity recognition from Chinese electronic health records via ma- -e data used to support the ﬁndings of this study are chine learning methods,” JMIR Medical Informatics, vol. 6, available from the corresponding author upon request. no. 4, p. e50, 2018. 8 Journal of Healthcare Engineering [8] S. Chowdhury, X. S. Dong, L. J. Qian et al., “A multitask bi- directional RNN model for named entity recognition on Chinese electronic medical records,” BMC Bioinformatics, vol. 19, no. 17 Suppl, p. 499, 2018. [9] Y. H. Wu, M. Jiang, J. B. Lei et al., “Named entity recognition in Chinese clinical text using deep neural networks,” Studies in Health Technology and Informatics, vol. 216, pp. 624–628, [10] Y. Wu, X. Yang, J. Bian, Y. Guo, H. Xu, and W. Hogan, “Combine factual medical knowledge and distributed word representation to improve clinical named entity recognition,” AMIA Annual Symposium Proceedings, vol. 2018, pp. 1110– 1117, 2018. [11] Y. Xu, Y. Wang, T. Liu et al., “Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries,” Journal of the American Medical In- formatics Association, vol. 21, no. e1, pp. e84–e92, 2014. [12] H. Wang, W. Zhang, Q. Zeng, Z. Li, K. Feng, and L. Liu, “Extracting important information from Chinese operation notes with natural language processing methods,” Journal of Biomedical Informatics, vol. 48, no. C, pp. 130–136, 2014. [13] J. Lei, B. Tang, X. Lu, K. Gao, M. Jiang, and H. Xu, “A comprehensive study of named entity recognition in Chinese clinical text,” Journal of the American Medical Informatics Association, vol. 21, no. 5, pp. 808–814, 2014. [14] Y. Wang, Z. Yu, L. Chen et al., “Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: an empirical study,” Journal of Biomedical Informatics, vol. 47, no. 2, pp. 91–104, 2014. [15] X. W. Zhang and Z. Li, “Chinese electronic medical record named entity recognition based on multi-feature fusion,” Software Guide, vol. 16, no. 2, pp. 128–131, 2017. [16] Y. B. Xia, J. L. Zhen, and Y. F. Zhao, “Electronic medical record named entity recognition based on deep learning,” Electronic Science and Technology, vol. 31, no. 11, p. 31, 2018. [17] F. Li, M. Zhang, B. Tian, B. Chen, G. Fu, and D. Ji, “Rec- ognizing irregular entities in biomedical text via deep neural networks,” Pattern Recognition Letters, vol. 105, pp. 105–113, [18] Z. J. Liu, M. Yang, X. L. Wang et al., “Entity recognition from clinical texts via recurrent neural networks,” BMC Medical Informatics and Decision Making, vol. 17, no. 2, p. 67, 2017. [19] J. F. Yang, Y. Guan, B. He et al., “Chinese electronic medical record named entity and entity relationship corpus con- struction,” Journal of Software, vol. 27, no. 11, pp. 2725–2746, [20] Z. H. Huang, W. Xu, and k Yu, “Bidirectional LSTM-CRF models for sequence tagging,” 2015, https://arxiv.org/abs/ 1508.01991. [21] Q. K. Wei, T. Chen, R. F. Xu, Y. He, and L. Gui, “Disease named entity recognition by combining conditional random ﬁelds and bidirectional recurrent neural networks,” Database, vol. 2016, 2016. [22] A. Vaswani, N. Shazeer, N. Parmar et al., Attention is all you need, 2017, https://arxiv.org/abs/1706.03762.

Journal

Journal of Healthcare Engineering – Hindawi Publishing Corporation

Published: Nov 24, 2020

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

References (22)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies