Research Article
Domain-independent automatic keyphrase indexing with small training sets
Article first published online: 14 MAR 2008
DOI: 10.1002/asi.20790
© 2008 ASIS&T
Issue

Journal of the American Society for Information Science and Technology
Volume 59, Issue 7, pages 1026–1040, May 2008
Additional Information
How to Cite
Medelyan, O. and Witten, I. H. (2008), Domain-independent automatic keyphrase indexing with small training sets. J. Am. Soc. Inf. Sci., 59: 1026–1040. doi: 10.1002/asi.20790
Publication History
- Issue published online: 18 APR 2008
- Article first published online: 14 MAR 2008
- Manuscript Accepted: 26 OCT 2007
- Manuscript Revised: 22 OCT 2007
- Manuscript Received: 2 JUL 2007
- Abstract
- Article
- References
- Cited By
Abstract
Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual assignment is expensive because trained human indexers must reach an understanding of the document and select appropriate descriptors according to defined cataloging rules. We propose a new method that enhances automatic keyphrase extraction by using semantic information about terms and phrases gleaned from a domain-specific thesaurus. The key advantage of the new approach is that it performs well with very little training data. We evaluate it on a large set of manually indexed documents in the domain of agriculture, compare its consistency with a group of six professional indexers, and explore its performance on smaller collections of documents in other domains and of French and Spanish documents.

1532-2890/asset/olbannerleft.gif?v=1&s=d833098325c9f1060bcbee51adf276c155608167)
1532-2890/asset/olbannercenter.gif?v=1&s=661179918edb4fa732edfd3408eb050a6ce87809)
1532-2890/asset/olbannerright.gif?v=1&s=1ef8a363944134c502cbffa1937878a71b4cc635)