Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed... AbstractObjective To develop, evaluate, and share: (1) syntactic parsing guidelines for clinical text, with a new approach to handling ill-formed sentences; and (2) a clinical Treebank annotated according to the guidelines. To document the process and findings for readers with similar interest.Methods Using random samples from a shared natural language processing challenge dataset, we developed a handbook of domain-customized syntactic parsing guidelines based on iterative annotation and adjudication between two institutions. Special considerations were incorporated into the guidelines for handling ill-formed sentences, which are common in clinical text. Intra- and inter-annotator agreement rates were used to evaluate consistency in following the guidelines. Quantitative and qualitative properties of the annotated Treebank, as well as its use to retrain a statistical parser, were reported.Results A supplement to the Penn Treebank II guidelines was developed for annotating clinical sentences. After three iterations of annotation and adjudication on 450 sentences, the annotators reached an F-measure agreement rate of 0.930 (while intra-annotator rate was 0.948) on a final independent set. A total of 1100 sentences from progress notes were annotated that demonstrated domain-specific linguistic features. A statistical parser retrained with combined general English (mainly news text) annotations and our annotations achieved an accuracy of 0.811 (higher than models trained purely with either general or clinical sentences alone). Both the guidelines and syntactic annotations are made available at https://sourceforge.net/projects/medicaltreebank.Conclusions We developed guidelines for parsing clinical text and annotated a corpus accordingly. The high intra- and inter-annotator agreement rates showed decent consistency in following the guidelines. The corpus was shown to be useful in retraining a statistical parser that achieved moderate accuracy. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences

Loading next page...
 
/lp/oxford-university-press/syntactic-parsing-of-clinical-text-guideline-and-corpus-development-x1FoGRw9wS

References (42)

Publisher
Oxford University Press
Copyright
Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1136/amiajnl-2013-001810
pmid
23907286
Publisher site
See Article on Publisher Site

Abstract

AbstractObjective To develop, evaluate, and share: (1) syntactic parsing guidelines for clinical text, with a new approach to handling ill-formed sentences; and (2) a clinical Treebank annotated according to the guidelines. To document the process and findings for readers with similar interest.Methods Using random samples from a shared natural language processing challenge dataset, we developed a handbook of domain-customized syntactic parsing guidelines based on iterative annotation and adjudication between two institutions. Special considerations were incorporated into the guidelines for handling ill-formed sentences, which are common in clinical text. Intra- and inter-annotator agreement rates were used to evaluate consistency in following the guidelines. Quantitative and qualitative properties of the annotated Treebank, as well as its use to retrain a statistical parser, were reported.Results A supplement to the Penn Treebank II guidelines was developed for annotating clinical sentences. After three iterations of annotation and adjudication on 450 sentences, the annotators reached an F-measure agreement rate of 0.930 (while intra-annotator rate was 0.948) on a final independent set. A total of 1100 sentences from progress notes were annotated that demonstrated domain-specific linguistic features. A statistical parser retrained with combined general English (mainly news text) annotations and our annotations achieved an accuracy of 0.811 (higher than models trained purely with either general or clinical sentences alone). Both the guidelines and syntactic annotations are made available at https://sourceforge.net/projects/medicaltreebank.Conclusions We developed guidelines for parsing clinical text and annotated a corpus accordingly. The high intra- and inter-annotator agreement rates showed decent consistency in following the guidelines. The corpus was shown to be useful in retraining a statistical parser that achieved moderate accuracy.

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Nov 1, 2013

There are no references for this article.