Chinese Named Entity Recognition Based on Knowledge Based Question Answering System
Chinese Named Entity Recognition Based on Knowledge Based Question Answering System
Yin, Didi;Cheng, Siyuan;Pan, Boxu;Qiao, Yuanyuan;Zhao, Wei;Wang, Dongyu
2022-05-26 00:00:00
applied sciences Article Chinese Named Entity Recognition Based on Knowledge Based Question Answering System 1 1 2 2 1 2, Didi Yin , Siyuan Cheng , Boxu Pan , Yuanyuan Qiao , Wei Zhao and Dongyu Wang * State Grid Hebei Information & Telecommunication Branch, Shijiazhuang 050013, China; xtgs_yindd@he.sgcc.com.cn (D.Y.); xtgs_chengsy@he.sgcc.com.cn (S.C.); xtgs_zhaow@he.sgcc.com.cn (W.Z.) School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China; pbx@bupt.edu.cn (B.P.); yyqiao@bupt.edu.cn (Y.Q.) * Correspondence: dy_wang@bupt.edu.cn; Tel.: +133-1105-0257 Abstract: The KBQA (Knowledge-Based Question Answering) system is an essential part of the smart customer service system. KBQA is a type of QA (Question Answering) system based on KB (Knowledge Base). It aims to automatically answer natural language questions by retrieving structured data stored in the knowledge base. Generally, when a KBQA system receives the user ’s query, it first needs to recognize topic entities of the query, such as name, location, organization, etc. This process is the NER (Named Entity Recognition). In this paper, we use the Bidirectional Long Short-Term Memory-Conditional Random Field (Bi-LSTM-CRF) model and introduce the SoftLexicon method for a Chinese NER task. At the same time, according to the analysis of the characteristics of application scenario, we propose a fuzzy matching module based on the combination of multiple methods. This module can efficiently modify the error recognition results, which can further improve the performance of entity recognition. We combine the NER model and the fuzzy matching module into an NER system. To explore the availability of the system in some specific fields, such as a Citation: Yin, D.; Cheng, S.; Pan, B.; power grid field, we utilize the power grid-related original data collected by the Hebei Electric Qiao, Y.; Zhao, W.; Wang, D. Chinese Power Company to improve our system according to the characteristics of data in the power grid Named Entity Recognition Based on field. We innovatively make the dataset and high-frequency word lexicon in the power grid field, Knowledge Based Question which makes our proposed NER system perform better in recognizing entities in the field of power Answering System. Appl. Sci. 2022, grid. We used the cross-validation method for validation. The experimental results show that the 12, 5373. https://doi.org/10.3390/ F1-score of the improved NER model on the power grid dataset reaches 92.43%. After processing the app12115373 recognition results by using the fuzzy matching module, about 99% of the entities in the test set can Academic Editors: Min Yang, be correctly recognized. It proves that the proposed NER system can achieve excellent performance in Qingshan Jiang and John the application scenario of a power grid. The results of this work will also fill the gap in the research (Junhu) Wang of intelligent customer-service-related technologies in the power grid field in China. Received: 11 May 2022 Accepted: 24 May 2022 Keywords: named entity recognition; knowledge—based question answering; power grid; smart Published: 26 May 2022 customer service system; SoftLexicon; BERT model; word embedding Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- 1. Introduction iations. Alan Mathison Turing put forward the concept of the “Turing test” in his famous paper “Computing Machinery and Intelligence” published in 1950 to judge whether a machine interacting with human has human intelligence [1]. At present, although the Copyright: © 2022 by the authors. system that completely passes the “Turing test” is still far away from us, there are still some Licensee MDPI, Basel, Switzerland. high- performance human–computer interaction products appearing in the market, such as This article is an open access article Siri [2], ALIME [3], NetEase Qiyu [4], etc. The KBQA system is the core component of these distributed under the terms and products. Generally, these voice assistants or intelligent customer services should first use conditions of the Creative Commons the NER module to extract the topic entity, such as a proper noun, a person’s name or the Attribution (CC BY) license (https:// name of an organization, from the text information or voice information input by the user. creativecommons.org/licenses/by/ Then, the retrieval will be built around this extracted entity to generate the answer to users’ 4.0/). Appl. Sci. 2022, 12, 5373. https://doi.org/10.3390/app12115373 https://www.mdpi.com/journal/applsci Appl. Sci. 2022, 12, 5373 2 of 19 query. Therefore, the accuracy of the NER module largely determines the performance of the whole system [5]. Traditional NER methods rely on artificial features and are difficult for mining deep semantic information, which leads to poor NER model performance when an OOV (Out of Vocabulary) problem occurs [6]. In recent years, with the continuous development of deep learning technology, DL-based (Deep Learning-based) NER methods have gradually become mainstream. Compared to a traditional one, a DL-based NER can learn more semantic features through complex non-linear transformation. In 2008, the deep NN (Neural Network) architecture proposed by Collobert and Weston introduced the neural network architecture into the NER task for the first time [7]. The large-scale application of this technology has greatly improved the accuracy and efficiency of NER. In 2013, Mikolov et al. [8] proposed the famous CBOW (Continuous Bag of Words) and Skip-grams models, which led to the widespread adoption of such word-embedding methods when dealing with NLP (Natural Language Processing) tasks. Huang et al. [9] proposed a Bi- LSTM-CRF model, which is robust and has less dependence on word embedding. This model produced the SOTA (State Of The Art) accuracy on several NLP tasks. In 2017, the Google team proposed the Transformer model [10], which made extensive use of self-attention mechanisms to learn text representations with spectacular results. This mechanism has also been widely used in NER tasks since then. Compared to English NER tasks, Chinese NER is more difficult, since Chinese sen- tences cannot be naturally segmented by space like English. For Chinese NER, a common practice is to use the CWS (Chinese Word Segment) tool for word segmentation before applying word sequence labeling. However, the CWS tool cannot guarantee that all seg- mentations are completely correct. These potential errors can greatly affect the performance of the NER model. Despite all this, researchers have proposed some effective methods to solve such problems. Zhang (2018) [11] proposed a Lattice-LSTM structure model which makes full use of character information and word information in Chinese text. This model reduces segmentation errors through lexicon matching. In 2020, Qiu’s team proposed a FLAT (Flat-Lattice Transformer) [12] model to improve the disadvantages introduced by the use of an RNN (Recurrent Neural Network) in a Lattice-LSTM. At the same year, Peng et al. [13] proposed the SoftLexicon model based on the Lattice-LSTM. The Softlexi- con model simplifies the complicated architecture of the Lattice-LSTM, making it easier to deploy to various downstream NLP tasks without affecting the structure of original system frameworks. In this work, we focus on the implementation of the Chinese NER task in the KBQA scene. We choose the commonly used Bi-LSTM-CRF model to construct the main part of the whole NER model. At the same time, considering the negative effect brought by incorrect word segmentation, we apply the SoftLexicon model to solve it. However, the SoftLexicon model only shows good performance on MSRA [14], Weibo [15], OntoNotes [16] and other public domain datasets. Whether it can be applied to a domain-specific NER task needs to be verified. Therefore, in addition to utilizing the open-source Chinese public domain dataset NLPCC2016 [17], we also utilize the power-grid-related dataset collected by the Hebei Electric Power Company for experiments. The result shows that the performance of the SoftLexicon-based NER model can be improved by expanding the matching lexicon according to the high-frequency words of the application domain. In a typical KBQA system, the entities recognized by the NER model will be used to establish a connection between the knowledge base and the user input query. If the entities recognized by the NER model cannot be retrieved in the knowledge base, the knowledge- base-based queries cannot be carried out. Therefore, a complete KBQA system should have the function of fuzzy matching so that those incorrectly recognized entities can retrieve their relevant information in the knowledge base. Some researchers choose to realize this function by introducing deep learning methods [18,19]. For example, Francis- Landau et al. [20] proposed a method that uses CNN (Convolutional Neural Networks) to learn the vector representation of text and then uses cosine similarity scores between Appl. Sci. 2022, 12, 5373 3 of 19 candidate entity vectors and text vectors to realize the matching function. In this work, we regard the fuzzy matching function as a continuation of NER. According to the analysis of NER model recognition results and specific application scenarios, we propose a fuzzy matching module based on the combination of multiple methods. This module combines artificial rules with deep learning methods, which can efficiently recall the entity set with high similarity to the entity to be matched from the knowledge base. The user or system can select an entity from the set to modify the recognition result of the NER model, thereby improving the performance of the named entity recognition module. The contributions of this work can be summarized as follows: • We verify the performance of the SoftLexicon+Bi-LSTM-CRF model in NER tasks under a KBQA scenario. Moreover, we explore the applicability of this model in the non-public field, such as a power grid field, and improve the SoftLexicon method according to the application domain. • To improve the performance of NER, we propose an efficient fuzzy matching module that can modify those entities incorrectly recognized by the NER model based on the combination of multiple methods. This module can be easily deployed in a KBQA system and has strong portability and robustness. • We further build a dataset and lexicon related to a power grid based on the data provided by the Hebei Electric Power Company and use them to construct an NER system suitable for the power grid field. The experimental results show that the accuracy of the improved SoftLexicon+Bi- LSTM-CRF model on the power grid dataset is 91.24%, the recall rate is 93.65% and the F1-score is 92.43%. After processing the recognition results by using the fuzzy matching module, about 99% of the entities in the test dataset can be correctly recognized. The ex- periment results prove that the system achieves good performance on the NER task of the power grid field. Moreover, this system, including the power grid dataset and the power grid lexicon produced by us, can be applied to the construction of a KBQA module of power-grid-related smart customer service system. 2. Related Works 2.1. Power Grid Intelligent Customer Service System As mentioned in the Introduction, due to the characteristics of Chinese, the devel- opment of a Chinese intelligent customer service system started late. At present, in the public domains of finance, e-commerce and education, some excellent Chinese intelligent customer service systems, such as Tencent Qidian [21], Sobot [22] and so on, have emerged. According to statistics, the market size of China’s intelligent customer service industry reached CNY 78.8 billion in 2019, an increase of about 10.06% compared with 2018 [23]. More than 70% of companies have applied Chinese intelligent customer service systems to serve more than 100 million customers. However, the construction of some non-public intelligent customer service systems in China is still in the initial stage. Taking the field of the power grid as an example, according to the report provided by the Hebei Electric Power Company, the customer service hot line of the power grid company is generally busy during peak hours, and the calls dialed by users cannot be connected. The introduction of an intelligent customer service system can effectively solve the problems existing in the current manual customer service hot line. Due to the limited application scenarios, at present, the research on power grid intelligent customer service systems mainly focuses on overview [24,25], system design [26], technical investigation [27,28] and so on. For some key modules in the system, such as an NER module, no specific technical solutions have been proposed, and few experiments have been carried out using data related to the Chinese power grid field because there are no open- source datasets. We convert the original power grid data collected by the Hebei Electric Power Company into the format that can be used for deep learning training and extract a lexicon of high-frequency words related to the power grid field. We use the above datasets to fine-tune the NER model so that it can be applied to NER tasks in the power grid field. Appl. Sci. 2022, 12, 5373 4 of 19 2.2. Text Matching The fuzzy matching module proposed in this paper is constructed on the basis of a text matching algorithm. The purpose of text matching is to determine whether sentence pairs are semantically similar. Traditional text matching algorithms, such as TF-IDF (Term FrequencyInverse Document Frequency) [29], BM25 [30], Edit Distance [31], Simhash [32], etc., are mainly unsupervised. For instance, the TF-IDF algorithm calculates TF-IDF weight by evaluating the importance of a word to a document and judges the similarity between texts by calculating cosine similarity. This algorithm is efficient and interpretable, but the matching result is not ideal when dealing with complex texts because it only considers word frequency and does not highlight the deep semantic information. In recent years, with the development of deep learning technology, more and more researchers choose to use deep learning models to solve text matching tasks. Deep learning models excel at using context information to mine deep semantic information, which can effectively solve the shortcomings of traditional algorithms [33]. For example, Arora et al. [34] proposed an unsupervised SIF (Smooth Inverse Frequency) method in 2016. This method represents sentence vectors by the weighted average of word vectors generated by Glove [35], Word2Vec [8] and other methods and then modifies these sentence vectors by PCA (Principal Component Analysis) method. These sentence vectors can be used for textual classification, textual matching and other tasks. Moreover, the tBERT model proposed by Peinelt et al. (2019) [36] predicts semantic similarity by combining semantic features extracted by the BERT model with topic features extracted by the Topic model. This model is prominent in domain-specific cases. In this experiment, consider the following three features: • The KBQA system needs to achieve efficient query. • For the data used in this work, the length of entities is relatively short, normally no more than 10 characters. • The query text input by the user in a KBQA system generally does not contain contex- tual information. Based on the above analysis, we choose to use the unsupervised Word2Vec method to implement a fuzzy matching module. We generate the word embeddings of the words by using the Word2Vec algorithm and obtain the sentence embeddings of entities through the accumulative average method. Then, we calculate the cosine similarity between vectors (sentence embeddings) for matching. In addition, according to the analysis of the NER model recognition results, we find that even those entities that are not correctly recognized by the NER model usually contain some useful information. Especially for the KBQA system, the query sentence input by the user is relatively short, and the topic entity of the sentence is generally unique. This feature makes the recognition results contain richer effective information. For example, a user enters the query “ÉK·eôl»P/”? (Who is Barack Obama’s wife?). The topic entity extracted by the NER model is “ÉK·eôl” (Barack Obama). However, the knowledge base only stores information about the topic entity “É K·¯[à·eôl” (Barack Hussein Obama). Although the two entities are very similar, they cannot be matched due to the one word (¯[à “Hussein”) difference. For these entities, using the traditional Edit Distance algorithm for similarity comparison can achieve better results. Therefore, in this work, we combine the DL-based text matching algorithm with the traditional text matching algorithm to construct the fuzzy matching module. Moreover, we add additional artificial rules according to the application scenario to improve the efficiency of fuzzy matching. 3. Model and Approach In this experiment, we use an improved Softlexicon+Bi-LSTM-CRF model to complete the NER task, and propose a fuzzy matching module based on the fusion of multiple methods to modify the incorrectly recognized entities. Among these methods, the Bi-LSTM- CRF model is a commonly used deep learning model in Chinese NER tasks. It can obtain Appl. Sci. 2022, 12, 5373 5 of 19 bidirectional long-term dependencies in the sequence and has a strong nonlinear fitting ability which can be used to model for complex tasks. Compared with the traditional HMM (Hidden Markov Model), the MEMM (Maximum Entropy Markov Model) and other machine learning methods, the Bi-LSTM-CRF model has better performance and robustness [37,38]. The SoftLexicon method makes full use of word and word sequence information while avoiding the negative effect of word segmentation errors. This method can be easily implemented by modifying the character representation layer of existing NER models such as the Bi-LSTM-CRF model. In this way, the performance of the Bi-LSTM- CRF model can be effectively improved on some tasks. At the same time, we also make improvements to this method for specific application scenarios. However, no NER model can achieve 100% recognition accuracy. In the application scenario of KBQA, the recognition results of an NER model always contain some entities that cannot be retrieved in the knowledge base due to various reasons. As a result, the user ’s query cannot be connected with the knowledge base, and subsequent work such as entity disambiguation and relationship recognition cannot be carried out. For these entities, we propose a fuzzy matching module suitable for the KBQA system. This module can improve the performance of the NER system by modifying these entities into high similarity entities which can be retrieved in the knowledge base. We first generate the candidate entity set by constructing artificial rules to reduce the matching scope. Then, we calculate the similarity through the text matching method based on the Word2Vec algorithm and the Edit Distance algorithm, respectively, and select the two entities with the highest similarity from each calculation result to form the final entity set. The user or system can select a topic entity from the entity set to modify the incorrect recognition result of the NER model. According to experimental results, for the two datasets used in this work, the accuracy rate of entity recognition will be improved to about 99% after processing with the module. We combine the NER model with the fuzzy matching module into a NER system. The system can read the query sentence and identify the topic entity of the query. The whole architecture of the system is shown in Figure 1. Figure 1. The overall architecture of the NER system. First, we preprocess the obtained data and utilize these data to generate a simple knowledge base. The processed data are input into the NER model for training and predic- tion. We will try to retrieve the entities recognized by the NER model in the knowledge base (exact matching). For entities that cannot be retrieved from the knowledge base in the recognition result (wrong recognition in Figure 1), the fuzzy matching module will generate an entity set and provide it to the user or the system for selection. The selected entity will be the topic entity of this query. 3.1. The NER Model 3.1.1. SoftLexicon Method SoftLexicon is proposed to solve the long-standing problem of the Chinese NER task. When performing an NER task, words in the context need to be labeled first. An English Appl. Sci. 2022, 12, 5373 6 of 19 sentence can be segmented by space, while a Chinese sentence cannot be segmented directly due to the characteristics of the language. Therefore, researchers tends to use a CWS (Chinese Word Segment) tool for word segmentation first and then use a word-based sequence labeling model to label the segmented sentence [13]. In this process, due to the complexity of Chinese, it is impossible for the CWS tool to precisely segment each sentence. These segmentation errors will affect subsequent model training and entity prediction. To avoid these problems, some researchers conduct Chinese NER at the character level [39], but this will discard the latent word information contained in the sentence. There- fore, Huang et al. (2018) integrated latent word information into a character-based NER model and proposed a Lattice-LSTM model, which can make full use of the information of word and word sequence without suffering segmentation errors. This model achieved SOTA results on some datasets at the time. However, the speed of model training was slowed down due to the insertion of additional word information into the input sequence. Moreover, the special structure of Lattice-LSTM is difficult to apply to other neural net- work models [11]. Therefore, Ma et al. (2020) simplified the architecture of the model and proposed the SoftLexicon model. Compared to the Lattice-LSTM, the SoftLexicon method has faster inference speed and is easier to implement. The SoftLexicon method operates directly on the sequence representation layer of the NER model. For example, if the text is labeled with the “BMES” labeling method, each character c in a sentence that matches the lexicon will be classified into four word sets: “B” (Begin), “M” (Mediate), “E” (End) and “S” (Single). For words that cannot match the lexicon, the special symbol “None” will be filled in the word set (Figure 2). The process can be formulated as follows: B(c ) = w ,8w 2 L, i < k n i i,k i,k M(c ) = w ,8w 2 L, 1 j < i < k n i i,k i,k (1) E(c ) = w ,8w 2 L, 1 j < i i j,i j,i S(c ) = fc ,9c 2 Lg i i i For a sentence s = fc , c , . . . . . . , c g, the c in Formulas (1) represents each character 1 2 i in s. The w represents the sub-sequence of s (the character set of c to c ). L is the lexicon i,k i k used in the experiment. Figure 2. An example of generating the “BMES” word set [13]. The input sentence is “ÑÞij Wï” (Zhengzhou Huanghe South Road). The light orange boxes are the “BMES” word sets of the characters “Þ” (state) and “³” (river). After obtaining the word set of each character, each word set will be compressed into a fixed-dimension vector by using the pooling method, and the frequency of each word in the dataset will be used as the weight to adjust the result. s w v (B) = z(w)e (w) (2) w2B Here, v (B) is the vector representation of the word set B, and z(w) denotes the occurrence frequency of the matched word w in the dataset. The symbol e (w) is the word Appl. Sci. 2022, 12, 5373 7 of 19 embedding lookup table [40], which provides the pre-trained word embeddings. Z is the sum of all z(w). s s s s Then, the word set vector [v (B), v (M), v (E), v (S)] of each character c and its char- acter embedding h obtained by the language model are connected to form the final representation of each character e (Formulas (3)). These data can be input into the Bi-LSTM layer for training or prediction. s s s s e = [h ; v (B), v (M), v (E), v (S)] (3) c c 3.1.2. The Bi-LSTM-CRF Model In this paper, we apply Bi-LSTM to carry out sequence encoding. Bi-LSTM utilizes a bidirectional LSTM network, which can take full advantage of the long-term dependencies from both input directions. The LSTM network is composed of several memory cells (Figure 3). Figure 3. The general structure of the LSTM memory cell. The memory cell is implemented as follows: i = s(W [h , X ] + b ) t t i t 1 i f = s(W [h , X ] + b ) t f t 1 t f c = f c + i tanh(W [h , X ] + b ) (4) t t t 1 t c t 1 t c o = s(W [h , X ] + b ) t o t o t 1 h = o tanh(c ) t t t where i, f , c and o are input gate, forget gate, cell state and output gate; s is the sigmoid function. W and b is the parameter matrix; h is the hidden state vector. The output of each LSTM layer can be represented as H = fh jt = 1, 2, 3 . . . . . . , ng; n is the length of the input sequence. The hidden state sequences of forward LSTM and backward LSTM are concatenated to form the final sequence representation H = [H , H ] [41]. f b For the last layer of the NER model, we add a CRF (Conditional Random Field) layer to perform label prediction of the input sequence. CRF combines the characteristics of MEMM and HMM [42]. It is a typical sequence labeling algorithm. In this experiment, the Bi-LSTM layer will process the character representation of the input sequence generated by the SoftLexicon method, and the CRF layer will predict the label according to the output of the Bi-LSTM layer. Appl. Sci. 2022, 12, 5373 8 of 19 3.1.3. SoftLexicon+Bi-LSTM-CRF Model In this experiment, we make some improvements to the SoftLexiocn+BI-LSTM-CRF model according to the specific application scenario. Since the SoftLexicon method directly adjusts the sequence representation layer of the NER model, the structure of the Bi-LSTM- CRF model does not need special changes. The input sentences will be processed by the sequence representation layer using the SoftLexicon method. The generated character representations will be put into the Bi-LSTM-CRF model to obtain the prediction results. The overall structure is shown in Figure 4. Figure 4. The overall architecture of the SoftLexicon+Bi-LSTM-CRF model. (The SoftLexicon method is used to generate word embedding at the sequence representation layer of the model). We use the commonly used BERT (Bidirectional Encoder Representation from Trans- formers) model to generate character embeddings [43]. The BERT model used a large amount of corpus for pre-training. In this work, we only need to use the corresponding labeled data (NLPCC2016 and power grid dataset) to fine-tune it. Meanwhile, we utilize the SoftLexicon method to generate word embeddings. In the original paper of the SoftLexicon method, the matching lexicon L used by the author is trained from the Chinese Gigaword Fifth Edition corpus [44]. The data sources of this corpus include the newswire from the Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative and so on. In other words, the content of the corpus comes from the public domain. The NLPCC2016 used in this experiment is a Chinese open domain question answering dataset, and its corpus is also from the public domain. Therefore, we can continue to use this matching lexicon in the NLPCC2016 dataset-related experiments. However, another dataset used in this paper, the power grid dataset, is a specific domain dataset. Considering that there are a large number of power-grid-related proper nouns in this dataset, still using the above matching lexicon will cause some proper nouns not to be correctly recognized. Therefore, we clean and screen the data in the power grid database to obtain high-frequency proper nouns related to the power grid domain. These high-frequency words will be used to expand the matching lexicon. At the same time, we use the Word2Vec method to generate word embeddings of these high-frequency words and add them to the word embedding lookup table. Appl. Sci. 2022, 12, 5373 9 of 19 3.2. Fuzzy Matching Module Although above 90% of the topic entities in questions can be recognized correctly by the model proposed in our paper, we find that there are some simple but effective ways to further improve its accuracy. Taking the NLPCC2016 dataset as an example, we divide the causes of topic entity recognition errors into three main cases: 1. The topic entity cannot be recognized normally due to some rarely used Chinese char- acters. The BERT model relies on a vocabulary when encoding tokens [45]. For those tokens that cannot be retrieved in the vocabulary (rarely used characters), they will be replaced by a special identifier “[UNK]”. The original tokens’ information will be discarded, making it impossible to exactly identify the topic entity. Although we can avoid some errors by expanding the vocabulary, it is impossible to add every rarely used Chinese character to the vocabulary in practical application. 2. The recognized topic entity cannot be linked to the entity in the knowledge base due to typos in user input. For KBQA-related systems, it is a common phenomenon that the user input contains typos. This leads to the fact that even if the entity boundary in the sentence is correctly demarcated, the extracted topic entity cannot exactly link to the knowledge base. 3. The NER model incorrectly recognizes the entity. No NER model can achieve 100% entity recognition accuracy. Inevitably, there will be some errors in recognition results due to the error of the model. Aiming at the above three kinds of topic entity recognition errors, we propose a fuzzy matching module to correct them. The construction of the module mainly relies on the following two features: (1) According to the analysis of the experimental results of the NER model, we find that most of the incorrect entities contain some effective information in either of the above cases. For example, consider the question “`åS-ýå
/ÀHöú°”? (Do you know when Chinese Kung Fu came into being?). The user wants to ask the question about “Chinese Kung Fu” but enters “Chinese Uniform” when entering the query. Our model can extract the topic entity “-ýå
” (chinese uniform) from the sentence, but the right word is “-ý+” (Chinese kung fu). The recognized entity cannot be retrieved in the knowledge base because of typos, but the word “- ý” (Chinese) is an important part of the right word “-ý+” (Chinese kung fu). We can utilize this effective information to construct a fuzzy matching module. (2) In the application scenario of the KBQA system, the scope of the user ’s query is limited. In other words, the user can only query the content that is already stored in the knowledge base. This means when matching, the entity to be matched is obtained from the user ’s query, while the match objects can only be entities that already exist in the knowledge base. The determination of matching scope enables us to improve the efficiency of fuzzy matching by constructing artificial rules. Therefore, according to the above two characteristics, we propose a fuzzy matching module based on the combination of multiple methods. With the help of the information in the knowledge base, this module can fix errors in the recognition results of the NER model. In the following part, we will introduce the construction method of the fuzzy matching module. For a relatively large knowledge base, such as the NLPCC2016 dataset (43,063,796 pieces of data and 5,928,836 entities in total), it is unrealistic to match the recognized entities with each entity stored in the knowledge base. Therefore, an entity candidate set needs to be generated through primary screening first. Here, we will propose a simple but effective method based on artificial rules to generate the candidate set. First, to simplify the knowledge base, we use a CWS tool [46] to segment all entities recorded in the knowledge base. For word segmentation results, the repeated words are combined into a set, and the ID numbers of their corresponding triples in the knowledge base are recorded as the ID set. In addition, considering the efficiency of the matching, we removed words with Appl. Sci. 2022, 12, 5373 10 of 19 a word frequency more than 300 and words with a word frequency less than 3 in the segmented results. After processing the knowledge base using the above method, we obtain a simplified word segmentation dictionary. An example of the process of generating the word segmentation dictionary is shown in Figure 5. Figure 5. An example of the process of generating the word segmentation dictionary. (Words “a <-àô¯÷” (Cassini–Huygens) and “a<Þ9” (Cassini Spacecraft) are entities stored in the knowledge base. The [1029963, 5701047] is the ID set of the segemented word “a<” (Cassini), and so on). Then, we use the CWS tool to segment the entities to be matched. The result of word segmentation will be matched precisely with the words in the word segmentation dictionary. If the match is successful, all entities corresponding to the ID numbers recorded in the word segmentation dictionary will be recalled to construct a candidate entity set. For an entity to be matched e = fc , c , c . . . c g, c is the character of the entity e. 1 2 3 n C = e (5) entity å å kb w2D ID2IDs C is the generated candidate entity set; w is a word of the entity segmentation entity results e = fw , w , w . . . w g. D is the word segmentation dictionary obtained by process- 1 2 3 n ing the knowledge base. IDs denotes the ID set of the matched word recorded in the word segmentation dictionary; e is the entity corresponding to the ID in set. kb Through the above primary screening method, we narrow the matching scope of the fuzzy matching module. All the entities stored in the knowledge base containing partial information of the entities to be matched are extracted to form the candidate entity set. In the following part, we will determine the final topic entity by comparing the similarity between the entity to be matched and the entity in the candidate set. According to the discussion in the Related Work section, we will use the Word2Vec method and the Edit Distance algorithm to complete the calculation and comparison of similarity. For the nWord2Vec method, we first use the word embedding averaging method to ob- tain the sentence embedding of the entity to be matched. For an entity e = fw , w , w . . . w g 1 2 3 n that has been segmented using the word segmentation tool, we will generate the word em- bedding of each word w according to the word embedding lookup table e . Then we obtain the sentence embedding S(e) of the entity by averaging the sum of these word embeddings. S(e) = e (w ) (6) å i i=1 Finally, we calculate the cosine similarity between the sentence embedding of the entity to be matched and the sentence embedding of each entity in the candidate set. In this work, we choose the two entities with the highest similarity in the candidate entity set as the output of the Word2Vec-based matching method. For the Edit Distance algorithm, we use the FuzzyWuzzy tool to complete the com- parison of similarity [47]. The FuzzyWuzzy tool utilizes the Edit Distance (the minimum Appl. Sci. 2022, 12, 5373 11 of 19 number of single-character edits required to change one word into the other) to calculate the differences between sequences. This algorithm has efficient performance in calculat- ing the similarity between short texts. In this part of the work, we similarly choose the two entities with the highest similarity in the candidate entity set as the output of the Edit-Distance-based matching method. Then we combine the entities obtained from the two methods to form a final entity set. Duplicate entities obtained in both methods will be merged. The entity set will be returned to the user for selection when the entity recognition error occurs. In this way, the fuzzy matching module can modify those incorrectly recognized entities. Alternatively, the system can also select the entity with the highest similarity as the topic entity of this query and perform the following operations such as entity disambiguation and entity linking. An example of a query is in Figure 6. Figure 6. The workflow of the fuzzy matching module. The input query is “`åSa<àô¯ ÷/'” (Do you know the power consumption of Cassini–Huygens?). The topic entity extracted by the NER model is “a<àô¯÷” (Cassini–Huygens). However, the entity “a<à ô¯÷” (Cassini–Huygens) cannot be retrieved in the knowledge base. Therefore, we put it into the fuzzy matching module for further processing. After processing, the correct entity “a<-àô¯ ÷” (Cassini–Huygens) with the highest similarity to it is included in the candidate results. At the same time, three related entities with high similarity are also included in the candidate results, which improves the robustness of the module. 4. Experimental Results and Discussion 4.1. Dataset In this experiment, we use two knowledge bases as experimental data, NLPCC2016 dataset and a power grid dataset, respectively. KB is a structured database that contains a collection of triples in the form of (subject, relation, object) [48]. It is the core of the KBQA system. The system will retrieve the knowledge base according to the query input by the user and return the results of this query. 4.1.1. NLPCC2016 The NLPCC2016 is provided by the NLPCC-ICCPOL (Conference on Natural Lan- guage Processing and Chinese Computing) in 2016 [17]. The dataset is made up of two parts. The first part is a large knowledge base that stores a total of 5,928,836 entities and 43,063,796 triples of the entities. The second part is the question–answer pair data. Appl. Sci. 2022, 12, 5373 12 of 19 The NLPCC-ICCPOL divided those into two sets: a training set and a test set. The training set has 14,609 pieces of question–answer pairs, and the test set contains 9870 pieces of question–answer pairs. An example of the dataset is in Table 1. Table 1. Sample data from the NLPCC2016 dataset. The upper part is the sample of triple in the knowledge base. The second part is the sample of the question–answer pair. ·È ||ýM ||i³å Knowledge Base (Andre || Nationality || Monaco) <question id=4>·È/ê*ý¶ºb (What’s Andre’s nationality) <triple id=4>·È || ýM ||i³å Question–Answer Pairs (Andre || Nationality || Monaco) <answer id=4>i³å (Monaco) The NLPCC2016 training set only provides unlabeled data. Therefore, we need to label the data in the dataset. We remove some invalid data and further divide the data into a training set, a validation set and a test set. Finally, there are 14,480 pieces of data in the training set, 1910 pieces of data in the validation set and 7690 pieces of data in the test set. In this work, we use the “BIO” tagging scheme (the Softlexicon method is also applicable to this tagging method) to label the data: B-begin, I-Inside and O-Outside. 4.1.2. Power Grid Dataset The power grid dataset contains 4630 pieces of data related to the services and systems of power grid companies. Compared to NLPCC2016, the dataset contains some specialist vocabulary related to the power grid field, such as “Q¯” (Wang XunTong), “ERP¢ U” (ERP order) “7ºã5÷” (Electricity price for multiple households) and so on. After screening, we obtained a total of 4523 available data. Since the amount of the data is relatively small, we divide the whole dataset into a training set (3500 pieces of data) and a test set (1023 pieces of data). Then we use the same method for labeling. An example of the original data is in Table 2. Table 2. Sample data from the power grid dataset. The table shows the original power grid data collected by the Hebei Electric Power Company. áoáÐô»(ûß»§èrûß»áoá