Corpus-based Statistical Screening for Phrase Identification

Won Kim; W. John Wilbur

doi:10.1136/jamia.2000.0070499

Loading next page...

References (47)

F. Damerau (1965)
An experiment in automatic indexing
American Documentation, 16
G. Furnas, T. Landauer, L. Gómez, S. Dumais (1984)
Statistical semantics: analysis of the potential performance of keyword information systems
W. Cooper (1968)
Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems
American Documentation, 19
D. Lindberg, B. Humphreys, A. McCray (1993)
The Unified Medical Language System
Methods of Information in Medicine, 32
B. Humphreys, D. Lindberg, H. Schoolman, G. Barnett (1998)
Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration
Journal of the American Medical Informatics Association : JAMIA, 5 1
L. Gomez, C. Lochbaum, T. Landauer (1990)
All the Right Words: Finding What You Want as a Function of Richness of Indexing Vocabulary.
Journal of the Association for Information Science and Technology, 41
W. Croft, Jing Yufeng (1994)
An Association Thesaurus for Information Retrieval
M. Bates (1986)
Subject access in online catalogs: A design model
J. Am. Soc. Inf. Sci., 37
M. Costanza, H. Larson (1983)
Introduction to Probability Theory and Statistical Inference. (3rd ed.)
Journal of the American Statistical Association, 78
M. Bates (1989)
Rethinking Subject Cataloging in the Online Environment.
Library Resources & Technical Services, 33
S. Harter (1975)
A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature
J. Am. Soc. Inf. Sci., 26
Eric Brill (1995)
Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging
T. Strzalkowski, Barbara Vauthey (1992)
Information Retrieval Using Robust Natural Language Processing
H. Luhn (1953)
A new method of recording and searching information
American Documentation, 4
D. Cutting, J. Kupiec, Jan Pedersen, Penelope Sibun (1992)
A Practical Part-of-Speech Tagger
A. Bookstein, S. Klein, T. Raita (1995)
Detecting Content-Bearing Words by Serial Clustering.
G. Furnas, T. Landauer, Louis Gomez, S. Dumais (1983)
Human factors and behavioral science: Statistical semantics: Analysis of the potential performance of key-word information systems
The Bell System Technical Journal, 62
Don Stone, M. Rubinoff (1968)
Statistical generation of a technical vocabulary
American Documentation, 19
D. Lewis, Karen Jones (1996)
Natural language processing for information retrieval
Commun. ACM, 39
D. Lewis, W. Croft (1989)
Term clustering of syntactic phrases
A. McCray, S. Srinivasan, Allen Browne (1994)
Lexical methods for managing variation in biomedical terminologies.
Proceedings. Symposium on Computer Applications in Medical Care
D. Harman (1993)
The First Text REtrieval Conference (TREC-1)
S. Finch (1995)
Partial orders for document representation: a new methodology for combining document features
Eric Brill (1993)
Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach
G. Cooper, R. Miller (1998)
Research Paper: An Experiment Comparing Lexical and Statistical Methods for Extracting MeSH Terms from Clinical Free Text
Journal of the American Medical Informatics Association : JAMIA, 5 1
W. Wilbur (1992)
An information measure of retrieval performance
Inf. Syst., 17
Atro Voutilainen (1995)
NPtool, a Detector of English Noun Phrases
ArXiv, cmp-lg/9502010
UMLS-based Conceptual Queries to Biomedical Information Databases:
S. Harter (1974)
A probabilistic approach to automatic keyword indexing
Journal of the Association for Information Science and Technology
G. Salton (1992)
The State of Retrieval System Evaluation
Inf. Process. Manag., 28
H. Larson (1970)
Introduction to Probability Theory and Statistical Inference
D. Lewis (1992)
An evaluation of phrasal and clustered representations on a text categorization task
Leslie Jones, Edward Gassie, S. Radhakrishnan (1990)
INDEX: The statistical basis for an automatic conceptual phrase-indexing system
J. Am. Soc. Inf. Sci., 41
M. Bates (1998)
Indexing and Access for Digital Libraries and the Internet: Human, Database, and Domain Factors
J. Am. Soc. Inf. Sci., 49
James Allen (2016)
Natural Language Understanding
Artificial Intelligence
T. Strzalkowski (1994)
Document indexing and retrieval using natural language processing
Rosalie Steier (1985)
An evaluation of retrieval effectiveness for a full-text document-retrieval system
Commun. ACM, 28
David Evans, Robert Lefferts, G. Grefenstette, Steve Handerson, W. Hersh, Armar Archbold (1992)
CLARIT TREC Design, Experiments, and Results
G. Salton, Anita Wong, Clement Yu (1976)
Automatic indexing using term discrimination and term precision measurements
Inf. Process. Manag., 12
Nuala Bennett, Qin He, Kevin Powell, B. Schatz (1999)
Extracting noun phrases for all of MEDLINE
Proceedings. AMIA Symposium
A. Bookstein, S. Klein, T. Raita (1998)
Clumping Properties of Content-Bearing Words
J. Am. Soc. Inf. Sci., 49
M. Funk, C. Reid (1983)
Indexing consistency in MEDLINE.
Bulletin of the Medical Library Association, 71 2
I. Witten (1994)
Managing gigabytes
M. Joubert, M. Fieschi, J. Robert, F. Volot, D. Fieschi (1998)
Model Formulation: UMLS-based Conceptual Queries to Biomedical Information Databases: An Overview of the Project ARIANE
J. Am. Medical Informatics Assoc., 5
H. Luhn (1958)
The Automatic Creation of Literature Abstracts
IBM J. Res. Dev., 2
J. Fagan (1989)
The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval
Journal of the Association for Information Science and Technology, 40
William Hersh, Emily Campbell, David Evans, Nicholas Brownlow (1996)
Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools.
Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium

Publisher: Oxford University Press
Copyright: American Medical Informatics Association
ISSN: 1067-5027
eISSN: 1527-974X
DOI: 10.1136/jamia.2000.0070499
Publisher site: See Article on Publisher Site

Abstract

AbstractPurpose: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material.Method: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database.Measurements: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings.Result: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases.Conclusion: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.

Journal

Journal of the American Medical Informatics Association – Oxford University Press

Published: Sep 1, 2000

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Corpus-based Statistical Screening for Phrase Identification

Corpus-based Statistical Screening for Phrase Identification

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Corpus-based Statistical Screening for Phrase Identification

Corpus-based Statistical Screening for Phrase Identification

References (47)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies