Chapter 1

Automatic Keyword Extraction from Individual Documents

Stuart Rose

E-mail address: stuart.rose@pnl.gov

Pacific Northwest National Laboratory, Richland, WA, USA

Search for more papers by this author
Dave Engel

E-mail address: dave.engel@pnl.gov

Pacific Northwest National Laboratory, Richland, WA, USA

Search for more papers by this author
Nick Cramer

E-mail address: nick.cramer@pnl.gov

Pacific Northwest National Laboratory, Richland, WA, USA

Search for more papers by this author
Wendy Cowley

E-mail address: wendy@pnl.gov

Pacific Northwest National Laboratory, Richland, WA, USA

Search for more papers by this author
First published: 04 March 2010
Citations: 164

Summary

Keywords are widely used to define queries within information retrieval (IR) systems as they are easy to define, revise, remember, and share. This chapter describes the rapid automatic keyword extraction (RAKE), an unsupervised, domain‐independent, and language‐independent method for extracting keywords from individual documents. It provides details of the algorithm and its configuration parameters, and present results on a benchmark dataset of technical abstracts, showing that RAKE is more computationally efficient than TextRank while achieving higher precision and comparable recall scores. The chapter then describes a novel method for generating stoplists, which is used to configure RAKE for specific domains and corpora. Finally, it applies RAKE to a corpus of news articles and defines metrics for evaluating the exclusivity, essentiality, and generality of extracted keywords, enabling a system to identify keywords that are essential or general to documents in the absence of manual annotations.

Controlled Vocabulary Terms

benchmark polls

Number of times cited according to CrossRef: 164

  • Semantic Enrichment Tool for Implementing Learning Mechanism for Trend Analysis, Data Science and Intelligent Applications, 10.1007/978-981-15-4474-3_58, (535-543), (2021).
  • A BERT-Based Question Representation for Improved Question Retrieval in Community Question Answering Systems, Advances in Machine Learning and Computational Intelligence, 10.1007/978-981-15-5243-4_31, (341-348), (2021).
  • Identification of Entities in Scientific Documents, Data Management, Analytics and Innovation, 10.1007/978-981-15-5616-6_15, (209-219), (2021).
  • Unsupervised Automatic Keyphrases Extraction on Italian Datasets, Encyclopedia of Information Science and Technology, Fifth Edition, 10.4018/978-1-7998-3479-3.ch009, (107-126), (2021).
  • Pragmatic Text Mining Method to Find the Topics of Citation Network, Big Data and Networks Technologies, 10.1007/978-3-030-23672-4_15, (190-205), (2020).
  • Automatic Generation of E-Learning Contents Based on Deep Learning and Natural Language Processing Techniques, Advances in Internet, Data and Web Technologies, 10.1007/978-3-030-39746-3_33, (311-322), (2020).
  • Reviewer Credibility and Sentiment Analysis Based User Profile Modelling for Online Product Recommendation, IEEE Access, 10.1109/ACCESS.2020.2971087, 8, (26172-26189), (2020).
  • Automatic Keyword and Sentence-Based Text Summarization for Software Bug Reports, IEEE Access, 10.1109/ACCESS.2020.2985222, 8, (65352-65370), (2020).
  • Context Annotated Graph and Fuzzy Similarity  Based Document Descriptor, Proceedings of First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019), 10.1007/978-981-15-3369-3_53, (725-737), (2020).
  • A Novel Range Search Scheme Based on Frequent Computing for Edge-Cloud Collaborative Computing in CPSS, IEEE Access, 10.1109/ACCESS.2020.2991068, 8, (80599-80609), (2020).
  • Generation and evaluation of artificial mental health records for Natural Language Processing, npj Digital Medicine, 10.1038/s41746-020-0267-x, 3, 1, (2020).
  • undefined, 2020 IEEE 36th International Conference on Data Engineering (ICDE), 10.1109/ICDE48307.2020.00109, (1213-1224), (2020).
  • Automatic Keywords Extraction Based on Co-Occurrence and Semantic Relationships Between Words, IEEE Access, 10.1109/ACCESS.2020.3004628, 8, (117528-117538), (2020).
  • Architecture and evolution of semantic networks in mathematics texts, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 10.1098/rspa.2019.0741, 476, 2239, (20190741), (2020).
  • Story Forest, ACM Transactions on Knowledge Discovery from Data, 10.1145/3377939, 14, 3, (1-28), (2020).
  • An Empirical Study of TextRank for Keyword Extraction, IEEE Access, 10.1109/ACCESS.2020.3027567, 8, (178849-178858), (2020).
  • Predicting research trends with semantic and neural networks with an application in quantum physics, Proceedings of the National Academy of Sciences, 10.1073/pnas.1914370116, (201914370), (2020).
  • Feasibility of activity-based expert profiling using text mining of scientific publications and patents, Scientometrics, 10.1007/s11192-020-03414-8, (2020).
  • Explaining a bag of words with hierarchical conceptual labels, World Wide Web, 10.1007/s11280-019-00752-3, (2020).
  • Understanding a bag of words by conceptual labeling with prior weights, World Wide Web, 10.1007/s11280-020-00806-x, (2020).
  • A Lightweight Approach to Extract Interschema Properties from Structured, Semi-Structured and Unstructured Sources in a Big Data Scenario, International Journal of Information Technology & Decision Making, 10.1142/S0219622020500182, (1-41), (2020).
  • Diverse feature set based Keyphrase extraction and indexing techniques, Multimedia Tools and Applications, 10.1007/s11042-020-09423-2, (2020).
  • Analysis of direct citation, co-citation and bibliographic coupling in scientific topic identification, Journal of Information Science, 10.1177/0165551520962775, (016555152096277), (2020).
  • A review of keyphrase extraction, WIREs Data Mining and Knowledge Discovery , 10.1002/widm.1339, 10, 2, (2019).
  • Unsupervised Automatic Keyphrases Extraction Algorithms, On the Move to Meaningful Internet Systems: OTM 2018 Workshops, 10.1007/978-3-030-11683-5_29, (251-255), (2019).
  • undefined, 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 10.1109/MIPR.2019.00111, (550-553), (2019).
  • Visual Exploration of Neural Document Embedding in Information Retrieval: Semantics and Feature Selection, IEEE Transactions on Visualization and Computer Graphics, 10.1109/TVCG.2019.2903946, 25, 6, (2181-2192), (2019).
  • Identifying Unclear Questions in Community Question Answering Websites, Advances in Information Retrieval, 10.1007/978-3-030-15712-8_18, (276-289), (2019).
  • Toward Keyword Extraction in Constrained Information Retrieval in Vehicle Social Network, IEEE Transactions on Vehicular Technology, 10.1109/TVT.2019.2906799, 68, 5, (4285-4294), (2019).
  • undefined, 2019 IEEE 35th International Conference on Data Engineering (ICDE), 10.1109/ICDE.2019.00200, (1841-1849), (2019).
  • undefined, 2019 Systems and Information Engineering Design Symposium (SIEDS), 10.1109/SIEDS.2019.8735639, (1-6), (2019).
  • Implementation of Smart Legal Assistance System in Accordance with the Indian Penal Code Using Similarity Measures, Advances in Computing and Data Sciences, 10.1007/978-981-13-9942-8_42, (440-449), (2019).
  • Enhancing Content Marketing Article Detection with Graph Analysis, IEEE Access, 10.1109/ACCESS.2019.2928094, (1-1), (2019).
  • Engineering Knowledge Graph for Keyword Discovery in Patent Search, Proceedings of the Design Society: International Conference on Engineering Design, 10.1017/dsi.2019.231, 1, 1, (2249-2258), (2019).
  • undefined, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 10.1109/JCDL.2019.00040, (237-240), (2019).
  • undefined, 2019 23rd International Conference in Information Visualization – Part II, 10.1109/IV-2.2019.00028, (96-99), (2019).
  • undefined, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 10.1109/MSR.2019.00021, (79-83), (2019).
  • undefined, 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), 10.1109/ICCMC.2019.8819630, (969-973), (2019).
  • Automatic Stop Word Generation for Mining Software Artifact Using Topic Model with Pointwise Mutual Information, IEICE Transactions on Information and Systems, 10.1587/transinf.2018EDP7390, E102.D, 9, (1761-1772), (2019).
  • undefined, Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security - Asia CCS '19, 10.1145/3321705.3329818, (181-192), (2019).
  • undefined, 2019 Twelfth International Conference on Contemporary Computing (IC3), 10.1109/IC3.2019.8844932, (1-6), (2019).
  • Potential Technologies Review: A hybrid information retrieval framework to accelerate demand‐pull innovation in biomedical engineering, Research Synthesis Methods, 10.1002/jrsm.1350, 10, 3, (420-439), (2019).
  • Applied Data Science in Financial Industry, Research & Innovation Forum 2019, 10.1007/978-3-030-30809-4_32, (351-367), (2019).
  • undefined, 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), 10.1109/ICACCP.2019.8882946, (1-4), (2019).
  • undefined, 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS), 10.1109/ICTCS.2019.8923034, (1-7), (2019).
  • Domain and Schema Independent Ontology Verbalizing, Advances in Data Science, Cyber Security and IT Applications, 10.1007/978-3-030-36368-0_3, (30-39), (2019).
  • undefined, 2019 IEEE International Conference on Big Knowledge (ICBK), 10.1109/ICBK.2019.00048, (302-309), (2019).
  • undefined, 2019 IEEE 17th International Conference on Industrial Informatics (INDIN), 10.1109/INDIN41052.2019.8972331, (264-269), (2019).
  • undefined, 2019 25th Conference of Open Innovations Association (FRUCT), 10.23919/FRUCT48121.2019.8981519, (85-94), (2019).
  • undefined, 2019 IEEE International Conference on Big Data (Big Data), 10.1109/BigData47090.2019.9006160, (3640-3647), (2019).
  • undefined, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), 10.1109/ICAICTA.2019.8904194, (1-6), (2019).
  • WEBLORS – A Personalized Web-Based Recommender System, Advances in Web-Based Learning – ICWL 2019, 10.1007/978-3-030-35758-0_24, (258-266), (2019).
  • Corporate Disclosure Measurement, Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics, 10.4018/978-1-5225-7598-6.ch036, (489-501), (2019).
  • Smart MM, Cognitive Computing in Technology-Enhanced Learning, 10.4018/978-1-5225-9031-6.ch011, (225-251), (2019).
  • Tagging and Tag Recommendation, Text Mining - Analysis, Programming and Application [Working Title], 10.5772/intechopen.78994, (2019).
  • Timeline Visualization of Keywords, Advances in Digital Forensics XV, 10.1007/978-3-030-28752-8_13, (239-252), (2019).
  • undefined, 2019 IEEE Symposium on Security and Privacy (SP), 10.1109/SP.2019.00033, (365-379), (2019).
  • undefined, 2019 IEEE 5th International Conference on Computer and Communications (ICCC), 10.1109/ICCC47050.2019.9064234, (2193-2197), (2019).
  • undefined, 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 10.1109/IISA.2019.8900745, (1-8), (2019).
  • undefined, 2019 Twelfth International Conference on Ubi-Media Computing (Ubi-Media), 10.1109/Ubi-Media.2019.00023, (74-78), (2019).
  • undefined, 2019 Portland International Conference on Management of Engineering and Technology (PICMET), 10.23919/PICMET.2019.8893839, (1-9), (2019).
  • Discursive profile of international telecommunication regulations as institutional dialogue: a sociosemiotic perspective, Social Semiotics, 10.1080/10350330.2019.1681075, (1-18), (2019).
  • An innovative user-attentive framework for supporting real-time detection and mining of streaming microblog posts, Soft Computing, 10.1007/s00500-019-04478-2, (2019).
  • Aspect-oriented challenges in system integration with microservices, SOA and IoT, Enterprise Information Systems, 10.1080/17517575.2018.1462406, 13, 4, (467-489), (2018).
  • The Distiller Framework: Current State and Future Challenges, Digital Libraries and Multimedia Archives, 10.1007/978-3-319-73165-0_9, (93-100), (2018).
  • Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals, Empirical Software Engineering, 10.1007/s10664-018-9597-6, 23, 6, (3630-3683), (2018).
  • Graph-Based Keyword Extraction, Intelligent Natural Language Processing: Trends and Applications, 10.1007/978-3-319-67056-0_9, (159-172), (2018).
  • Keyword Extraction Using Graph Centrality and WordNet, Towards Extensible and Adaptable Methods in Computing, 10.1007/978-981-13-2348-5, (363-372), (2018).
  • Network Analysis of Design Automation Literature, Journal of Mechanical Design, 10.1115/1.4040787, 140, 10, (101403), (2018).
  • undefined, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), 10.1109/ICBSLP.2018.8554771, (1-4), (2018).
  • undefined, 2018 IEEE International Conference on Data Mining (ICDM), 10.1109/ICDM.2018.00042, (267-276), (2018).
  • undefined, 2018 IEEE International Conference on Data Mining Workshops (ICDMW), 10.1109/ICDMW.2018.00186, (1308-1315), (2018).
  • Robust Single-Document Summarizations and a Semantic Measurement of Quality, Primate Life Histories, Sex Roles, and Adaptability, 10.1007/978-3-030-15640-4_7, (118-138), (2018).
  • undefined, 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC), 10.1109/PCCC.2018.8710994, (1-8), (2018).
  • undefined, Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI '18, 10.1145/3173574.3174074, (1-13), (2018).
  • YAKE! Collection-Independent Automatic Keyword Extractor, Advances in Information Retrieval, 10.1007/978-3-319-76941-7_80, (806-810), (2018).
  • Realising the affective potential of patents: a new model of database interpretation for user-centred design, Journal of Engineering Design, 10.1080/09544828.2018.1448056, 29, 8-9, (484-511), (2018).
  • undefined, Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18, 10.1145/3184558.3186334, (251-258), (2018).
  • undefined, 2018 Thirteenth International Conference on Digital Information Management (ICDIM), 10.1109/ICDIM.2018.8847012, (228-233), (2018).
  • Information extraction meets the Semantic Web: A survey, Semantic Web, 10.3233/SW-180333, (1-81), (2018).
  • Corporate Disclosure Measurement, Encyclopedia of Information Science and Technology, Fourth Edition, 10.4018/978-1-5225-2255-3, (1896-1906), (2018).
  • undefined, 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 10.1109/FUZZ-IEEE.2018.8491486, (1-7), (2018).
  • undefined, 2018 IEEE International Conference on Smart Computing (SMARTCOMP), 10.1109/SMARTCOMP.2018.00024, (327-332), (2018).
  • undefined, 2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE), 10.1109/JCSSE.2018.8457393, (1-5), (2018).
  • Contextual understanding of microservice architecture, ACM SIGAPP Applied Computing Review, 10.1145/3183628.3183631, 17, 4, (29-45), (2018).
  • SemRe-Rank, ACM Transactions on Knowledge Discovery from Data, 10.1145/3201408, 12, 5, (1-41), (2018).
  • Ontology-based heuristic patent search, International Journal of Web Information Systems, 10.1108/IJWIS-06-2018-0053, (2018).
  • TexTonic: Interactive visualization for exploration and discovery of very large text collections, Information Visualization, 10.1177/1473871618785390, (147387161878539), (2018).
  • HuMan: an accessible, polymorphic and personalized CAPTCHA interface with preemption feature tailored for persons with visual impairments, Universal Access in the Information Society, 10.1007/s10209-017-0567-3, 17, 4, (841-864), (2017).
  • An improved method of automatic text summarization for web contents using lexical chain with semantic-related terms, Soft Computing, 10.1007/s00500-017-2612-9, 22, 12, (4013-4023), (2017).
  • undefined, Proceedings of the 22nd International Conference on Intelligent User Interfaces - IUI '17, 10.1145/3025171.3025181, (547-552), (2017).
  • undefined, 2017 IEEE International Conference on Big Data (Big Data), 10.1109/BigData.2017.8258072, (1390-1399), (2017).
  • Web information monitoring and crowdsourcing for promoting and enhancing the Algerian geoheritage, Arabian Journal of Geosciences, 10.1007/s12517-017-3061-6, 10, 13, (2017).
  • undefined, 2017 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), 10.1109/IIAI-AAI.2017.21, (379-384), (2017).
  • Semantically Enhanced Medical Information Retrieval System: A Tensor Factorization Based Approach, IEEE Access, 10.1109/ACCESS.2017.2698142, 5, (7584-7593), (2017).
  • undefined, 2017 International Conference on Networks & Advances in Computational Technologies (NetACT), 10.1109/NETACT.2017.8076789, (323-328), (2017).
  • Semantic Measures for Keywords Extraction, AI*IA 2017 Advances in Artificial Intelligence, 10.1007/978-3-319-70169-1_10, (128-140), (2017).
  • Content In-context: Automatic News Contextualization, Advances in Computing, 10.1007/978-3-319-66562-7_14, (184-198), (2017).
  • The Impact of Streaming Data on Sensemaking with Mixed-Initiative Visual Analytics, Augmented Cognition. Neurocognition and Machine Learning, 10.1007/978-3-319-58628-1_36, (478-498), (2017).
  • undefined, 2017 IEEE International Conference on Big Data (Big Data), 10.1109/BigData.2017.8258552, (4816-4818), (2017).
  • See more

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.