Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Active Sampling for Entity Matching with Guarantees

Active Sampling for Entity Matching with Guarantees Active Sampling for Entity Matching with Guarantees KEDAR BELLARE, Yahoo Research and Facebook Inc. SURESH IYENGAR, Yahoo Research and Microsoft Research Lab India ADITYA PARAMESWARAN, Yahoo Research and Stanford University VIBHOR RASTOGI, Yahoo Research and Google Inc. In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0­1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst case. Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0­1 loss. We show http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Knowledge Discovery from Data (TKDD) Association for Computing Machinery

Loading next page...
 
/lp/association-for-computing-machinery/active-sampling-for-entity-matching-with-guarantees-ysjTMEb3G8

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Association for Computing Machinery
Copyright
Copyright © 2013 by ACM Inc.
ISSN
1556-4681
DOI
10.1145/2500490
Publisher site
See Article on Publisher Site

Abstract

Active Sampling for Entity Matching with Guarantees KEDAR BELLARE, Yahoo Research and Facebook Inc. SURESH IYENGAR, Yahoo Research and Microsoft Research Lab India ADITYA PARAMESWARAN, Yahoo Research and Stanford University VIBHOR RASTOGI, Yahoo Research and Google Inc. In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0­1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst case. Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0­1 loss. We show

Journal

ACM Transactions on Knowledge Discovery from Data (TKDD)Association for Computing Machinery

Published: Sep 1, 2013

There are no references for this article.