Discovering emerging topics in textual corpora of galleries, libraries, archives, and museums institutions

For some decades now, galleries, libraries, archives, and museums (GLAM) institutions have provided access to information resources in digital format. Although some datasets are openly available, they are often not used to their full potential. Recently, approaches such as the so‐called Labs within GLAM institutions promote the reuse of digital collections in innovative and inspiring ways. In this article, we explore a straightforward computational procedure to identify emerging topics in periodical materials such as newspapers, bibliographies, and journals. The method is illustrated in three use cases based on public digital collections. This type of tools are expected to promote further usage by researchers of the digital collections.


| INTRODUCTION
Over the past decades, national libraries, universities, archives, and museums have been made available in digital format periodic publications such as journals, magazines, and newspapers. These collections open new opportunities for historians, journalists and researchers, and also for the general public.
In parallel, standards have been promoted to increase interoperability and facilitate the access to large digital collections. The use of open licenses, for example, facilitates the reuse of the collections in new contexts including computationally driven research.
GLAM institutions have published digital collections using traditional methods based on websites and Application Programming Interfaces (APIs) (Europeana, 2020b). In some cases, advanced technologies such as the semantic web and Linked Open Data (LOD) are employed (BnF Data, 2014;Romero et al., 2018). Often, experimental methods are based on collaborative edition such as transcription, tagging, and description (Biblioteca Nacional de España, 2019;British Library, 2015;Library of Congress, 2018). Some innovative methods, such as collections as data, aim to the publication of digital collections as datasets amenable to computational use (Padilla et al., 2019). Numerous papers and best practices have highlighted the importance of using open data to ensure reproducibility in computational research (Baillieul et al., 2017(Baillieul et al., , 2018Rule et al., 2018).
Text mining techniques such as information extraction, categorization, clustering, natural language processing (NLP), and topic modeling have become very popular in the research community to search for relationships among text documents (Buenaño-Fern andez et al., 2020). For such purposes, visualization methods can facilitate the analysis of text by providing interactive charts and dashboards. When periodic literature is available, it is possible to explore trends and temporal variations in the content. For example, we may be interested in the early detection of emerging topics, that is, novelties in the content. Note that this is different from the identification of trending topics which are not necessarily new and can be in use for a longer period. For example, emerging subjects in scientific papers can help funding bodies to identify promising areas of research where additional investment may be recommended.
The purpose of this paper is to illustrate how a computational method can easily identify emerging topics in corpora provided by GLAM institutions when the content has a publication time mark, as it is the case of periodical publications. Books could be also in principle amenable to this type of analysis, using for example, the publication date, although the amount of items should be large enough and, in case there is a wide coverage of topics, results can be more difficult to compare and interpret. We will therefore examine use cases providing reproducible notebooks on the journals and newspapers provided by British Library, Biblioteca Virtual Miguel de Cervantes (BVMC) and dblp computer science bibliography.
Our approach is based on a simple lexical analysis of the content, with no attempt to semantically cluster related concepts (for example, migration and refugee). It is, however, sometimes difficult to conclude if a new term should be considered as a new category or as the continuation of a pre-existing concept (for example, if deep learning and artificial neural networks should be categorized together).
The main contributions of this paper are the following: (a) A computational method to identify emerging topics in periodic publications; (b) a collection of notebooks based on newspapers and journals published by relevant GLAM institutions; and (c) a practical example of reuse of journals published as LOD.
The paper is organized as follows: After a brief description of the state of the art in Section 2, Section 3 introduces the framework to identify emerging topics in large corpus. Section 4 evaluates the framework by means of several relevant GLAM institutions and shows the results of their application. The paper concludes with an outline of the framework and future work.

| RELATED WORK
A number of GLAM institutions distribute digital collections which include periodical publications such as newspapers and journals. For instance, the Newspaper Navigator project of the Library of Congress allows users to browse over 1.5 million images through the Chronicling America * collection (Library of Congress, 2020). The British Library's EThOS service comprises metadata descriptions of hundreds of thousands of PhD theses awarded by UK Higher Education institutions (Heather Rosie, 2021). Europeana newspaper explores the headlines, articles, advertisements, and opinion pieces from European newspapers from 20 countries, dating from 1618 to 1996 (Europeana, 2018(Europeana, , 2020a. The Atlas of Digitized Newspapers and Metadata is an open-access guide to a selection of newspaper databases around the world (Beals & Bell, 2020). The Early Journal Content (EJC) on JSTOR-a digital library of academic content in many formats and disciplines-includes public domain journal articles published in the United States before 1923 and articles published in other countries before 1870 including metadata, n-grams, and full text for text mining purposes (JSTOR, 2017). A sample of GLAM institutions providing access to periodical datasets can be found in the appendix (Table A1).
When digital collections are published as LOD by using standard vocabularies to describe their content, the collections can be enriched with external repositoriessuch as Wikidata and GeoNames-to include contextual information (Romero et al., 2019). In particular, open publication allows researchers to create Jupyter Notebooks, † a type of publication which has become very popular in the research community and provides a web environment for transparent, collaborative, reproducible, and reusable data analyses. A notebook integrates detailed workflows, narrative text, and visualization of results. For instance, the GLAM Workbench (Sherratt, 2019) provides notebooks based on digital collections showing a collection of tools, tutorials, and examples. The GLAM Jupyter Notebooks explore the creation of machine actionable collections by means of datasets provided by several relevant GLAM institutions . The Library of Congress has published a collection of Jupyter Notebooks to query, download, and visualize cartographic material (Weinryb-Grohsgal, 2020). Recently, a new collection of Jupyter Notebooks has been published by the National Library of Scotland (National Library of Scotland, 2020). Additional Jupyter Notebooks are based on digitized newspaper data from the National Library of Estonia (Tinits, 2020).
So far, tools based on newspapers allow users to browse curated lists, and search the full text and metadata in a date range. In addition, NLP techniques, such as Named Entity Recognition (NER), have been applied to unstructured text in order to identify and classify named entities (places, persons, and locations). (Neudecker, 2016) Other examples are built upon machine learning techniques to detect shapes and objects, and search for similar images (Wevers & Lonij, 2017). Swiss and Luxembourgian newspapers have been explored in order to segment newspaper images and to classify detected segments according to a newspaper typology (Barman et al., 2020). A computational analysis of historical Hebrew newspapers has been performed to identify trends in the discourse (Segal et al., 2019).
Lately, words embeddings have attracted strong interest from the research community to perform tasks, such as Named Entity Recognition, part-of-speech tagging, and text classification (Bakarov, 2018;Sung et al., 2020). Danish newspapers have been reused to identify how the use of language has evolved since the 18th century by means of experimental visualizations and word embeddings (KB Labs, 2016. Other approaches are based on more advanced models such as fastText and Bidirectional Encoder Representations from Transformers (BERT) (Hammou et al., 2020;Polignano et al., 2019).
The text mining technique topic modeling has become a popular procedure for clustering documents into semantic groups. Interactive visualization tools based on raw text data have been introduced by organizations such as DARIAH-DE that supports research in the humanities and cultural sciences with digital methods and procedures (DARIAH-DE, 2020).
There are existing web-based tools which assist in the discovery of trends in massive textual corpora: For example, Lansdall-Welfare and Cristianini employ n-grams and Zipf's law to identify the topics with the highest relevance. Also burst detection in time series has been addressed before to identify topics that grow in intensity for a period of time (Kleinberg, 2003). Detection of emerging topics has however a slightly different objective, as novelties may pass undetected before they become a trend or burst and, sometimes, early detection of such topics is requested. Past work implements term-selection or clustering techniques to select relevant terms. (Chandrakala et al., 2019) These methods employ standard techniques, such as stop-word removal and computation of TF-IDF frequencies to select terms before analyzing their temporal evolution. In this paper, we show how a straightforward and reproducible method, weighting the terms according to their novelty, provides good results without the need of a term-selection or training step.
Although many studies are focused on newspapers and journals to extract information from text corpora, to our best knowledge, the definition of a framework for the identification of emerging topics in datasets provided by GLAM institutions have not been addressed. In this sense, the combination of the definition of a framework and the publication of a collection of notebooks enabling the research reproducibility can help to increase the visibility of the datasets encouraging researchers to reuse them.

| A SIMPLE METHOD TO IDENTIFY EMERGING TOPICS
The procedure depicted in Figure 1 works in four steps: 1. Identification, 2. Access and retrieval, 3. Machine processing, and 4. Human revision and visualization.
Depending on the characteristics of the dataset such as license, language, format, and accessing method, the application of the framework may slightly vary for each institution. Each step may require different adjustments in order to adapt the framework to other digital collections published by GLAM institutions.

| Identification
The identification step consists of the selection of existing resources which support the identification of emerging trends. It requires the application of a set of evaluation criteria including, among others: (a) License to identify if the resources are available for open access or using copyright restrictions; (b) accessibility in order to measure the extent to which data including metadata and full text are available; (c) completeness to measure the extent to which data are of sufficient depth for the task (e.g., dates of publication are distributed over a long enough period); and (d) ease of use to assess whether the collections are available by means of simple APIs or dump files.
Moreover, the type of the resources -including newspapers and journals-, and its content and metadata are relevant aspects to take into consideration. In this sense, coverage, provenance, and transparency are crucial aspects to obtain optimum results. GLAM institutions have made a considerable effort in providing F I G U R E 1 Framework employed to identify emerging topics in text corpora provided by GLAM institutions computational access to the digital collections and documentation about them as well as making them available under open licenses to promote innovation and provide access to the audience including researchers and general users.
Information retrieval systems built upon websites allow users to explore the digital collections provided by GLAM institutions. Particular sections may be devoted to specific content such as thesis, maps, articles, journals, and newspapers. The content of journals and newspapers resources can be provided as PDF files and HTML, but also as text files that are more easily processed by computers. GLAM Labs have started to publish datasets ready for reuse in several formats, demonstrating examples of use and reuse (Mahey et al., 2019). In some cases, openaccess data repositories such as Zenodo and figshare allow researchers to deposit datasets and research results.
Although many institutions publish their digital collections under open licenses, it is not infrequent to find images and resources with imprecise copyright statements (Schlosser, 2009). Some records may also use outdated models and vocabularies (Pitt Rivers Museum, 2020).

| Access and retrieve
Basic approaches for the access and retrieval of content focus on the extraction of text corpora and metadata by accessing the websites using a web crawler or a script. Advanced methods rely, however, on the use of APIs enabling the download of manageable slices of the data in several formats such as the result of a search, or the works of an author or a subject. Several models to build APIs are available including, among others, Open Archives Initiative Protocol for Metadata Harvesting(Open Archives Initiative, 2002) (OAI-PMH), International Image Interoperability Framework (IIIF Consortium, 2014) (IIIF), and SPARQL (World Wide Web Consortium, 2013). IIIF has become very popular among the community and many GLAM institutions are adopting it for the publication of their digital collections including images. SPARQL is a powerful language to query a repository, however, its use requires some knowledge of semantic web technologies.
Simpler approaches are based on the publication of data dumps as large files which often require a preprocessing step before their usage. For example, Chronicling America provides access to the complete set of images, texts, and OCR coordinates as compressed archive files. * Also the Bibliothèque Nationale du Luxembourg provides several newspapers datasets of increasing size (Bibliothèque nationale du Luxembourg, 2019).

| Machine processing
Uses a corpus of textual data as input and identifies.
The procedure to identify emerging topics uses a corpus of textual data as input and identifies incipient terms. We have tested that a straightforward approach may suffice for newspaper and journal collections and, in general, for publications issued at regular intervals.
Given a collection containing specifies the number c mn of occurrences of term w m in document d n . The global frequency g m in the collection of term t m can be obtained by simply adding the content of every row, that is, g m ¼ P N n¼1 c mn . If the addition is replaced by the count of nonzero values, one obtains the document frequency vector DF ¼ f 1 , f 2 ,…f M ð Þ , that is, the number f m of documents containing w m at least once.
The average date of the occurrences of term w m is simply: where the quotient represents element-wise division. Terms which are novel in the collection will show the highest value of T and, therefore, ranking terms according to this value should reveal which ones are the most recent. User-defined lower threshold for the term and document frequencies will remove cases where the available information is not enough to draw significant conclusions-by default, we have selected c m ≥ 10 and f m ≥ 3. Stopwords and terms with a very high document frequency (by default, above 0:1N) have been also removed. The pseudo-code of the procedure is shown below.
Required input: A list of documents.
Þand a list of dates Optional input: thresholds f min , g min , g max , and a range of dates (t min ,t max ). Output: A sorted list of emerging topics. Procedure: 1. Remove d n and t n if t n > t max or t n < t min and reindex D and T accordingly. 2. Compute term frequency matrix TF = c mn ð Þ 3. Get document frequency vector DF = f m ð Þ 4. Add TF rows for global frequency G = g m ð Þ 5. Remove t n from TF and DF (and reindex) if f m < f min , g m < g min , or g m > g max 6. Compute T ¼ TF Â T=DF 7. Return list of terms w m sorted according to T Terms in this approach can be single words or n-grams: An n-gram is a sequence of n consecutive words such as" deep learning." In our experiments, terms of length n ¼ 1 (words) and n ¼ 2 (bigrams) were extracted from the documents, although storing terms with higher values of n is also feasible.

| Human revision and visualization
Human revision of the results can improve the accuracy of this procedure, for example, discarding cases where an apparent novelty of a term is indeed a change of nomenclature.
Furthermore, the results can be transformed into interactive charts or graphical reports enabling researchers to understand, explain, and collect patterns from the data. Visualizations tools such as interactive charts and scorecards -visual interfaces to easily find, analyze, and explore information-can lead to new discoveries which, in turn, can foster the exploitation of legacy material using new techniques and methods.

| EVALUATION OF THE FRAMEWORK
This section introduces three use cases to assess the framework proposed in Section 3. We have selected the datasets according to the following criteria: • They are available for open access, • The content is provided as full text, • Dates of publication are distributed over a long enough period, and • Simple access to the contents is provided through an API or in the form of bulk data files.
Jupyter Notebooks have been used to combine documentation, data, charts, and code: The project is available in GitHub * as a collection of interactive notebooks executable in the cloud-based platform Binder. † The notebook collection has been assigned a Digital Object Identifier (DOI) with the data archiving tool Zenodo (Candela et al., 2021).
The collection of Jupyter Notebooks is based on Python since it is a popular language with a low entry barrier (Koerner et al., 2020;Raschka et al., 2020). Binder is a well-known platform in the research community and provides an easy-to-use cloud environment to execute and reproduce the results.
The notebooks employ some open-source tools for handling data in Python such as NumPy, ‡ Python Data Analysis Library § SciPy ¶ and Natural Language Toolkit k Additional packages are used to create HTTP requests, and to retrieve and handle the results in a variety of formats such as JSON and CSV.
Regarding the optimization, we have identified the optimum results by setting the DF and TF thresholds to 3 and 20, respectively. Results are shown in Tables 3-10. The first column corresponds to the term identified as an emerging topic and the second column shows the average date in which the term appears in the documents. It is relevant to notice that the results may include several terms in the same year (e.g., the values 1994,4 and 1994,25 shown in Table 6).

| Doxa: A journal published as LOD at the BVMC
The catalog of the BVMC contains about 285,000 records and it was published as LOD in 2015. The repository was built using the Resource, Description, and Access (RDA) vocabulary to describe the items in the catalog (RDA Steering Committee and ALA Digital Reference, 2015). The LOD repository contains several types of materials such as videos, audios, images, books, journals, and maps, including metadata about the authors, dates, and subjects. The RDA vocabulary contains classes and properties to describe the resources that are linked by means of typed relationships. For instance, the whole-part relation is an association between a resource representing a part and a resource representing its corresponding whole that is used to describe journals. Figure 2 shows how the manifestations representing journals, volumes, and articles are linked by means of the property wholePartManifestationRelationship in the namespace rdam. ** Doxa. Cuadernos de Filosofía del Derecho is a periodical publication † † issued every year since 1984 to promote the interaction between philosophers of law from Latin America and Latin Europe. The information regarding this publication has been included in the BVMC and it has been published as LOD in the repository, including metadata and text, being accessible by means of the public SPARQL endpoint.
In the examples below, the SPARQL API from data. cervantesvirtual.com/sparql was employed to retrieve the articles of the journal Doxa-Filosofía del derecho. Figure 3 shows the SPARQL query to retrieve the articles, including the PDF file containing the full text of each item. An overview of the results retrieved are shown in Table 1.
Although the LOD repository does not contain the full text, it can be easily retrieved through the URLs which are stored as Item entities in the RDA vocabulary using the property rdai:identifierForTheItem. The Tika library ‡ ‡ was used to extract the content of PDF files.
The metadata and full text retrieved are loaded as a Pandas DataFrame before the extraction of emerging topics takes place. In order to provide a snapshot of the application of the method, the analysis has been performed for three periods: 1984-1989 (164 issues, results shown in Table 1), the period 1990-1995 (224 issues), and the period 2010-2018 (261 issues).
The output was assessed by content managers of the library in charge of the curation of the journal. Identifying the most relevant topics such as concepts, people, and locations, as well as particular terms that were highlighted as emerging topics in specific years.
In particular, the top 100 potential emerging topics obtained as a result per each period were manually assessed with an average success rate of about 50%. Tables 2-4 show the top 10 emerging topics that are automatically obtained by the framework for each period, while Tables 5-7 show the top 10 emerging topics obtained after the assessment by the content curators. In general, the overall assessment by the content managers was positive, and they provided useful feedback and comments about the emerging topics obtained for each period.
Additional tests were performed to compare the performance for different lengths of the time period. For example, when the intervals 2010-2019 and 2015-2019 were employed to obtain 20 potential emerging topics, the manual assessment found a success rate of 80% and 70%, respectively. Longer periods did not, however, lead to significant improvements.

| UK doctoral thesis metadata from EThOS: The British library
The British Library provides the bibliographic metadata for all UK doctoral theses listed in EThOS, the UK's national thesis service (Heather Rosie, 2021). The data in this collection comprise PhDs awarded by the UK Higher Education institutions described since 1787.
The metadata is provided as a CSV file including several fields such as the title of the thesis, the name of the author, the abstract (long abstracts have been truncated), the year or the institution responsible for the thesis.
The CSV file containing the metadata is loaded as a Pandas DataFrame before the extraction of emerging topics takes place. The analysis has been performed for the period 2015-2021 including 112,776 PhDs. The top first 10 potential emerging topics obtained as a result are shown in

| The DBLP computer science bibliography
The DBLP computer science bibliography provides open bibliographic information on major computer science journals and proceedings since 1993. * Inspired by the BibTeX format-a management tool for formatting lists of references-the project provides a dataset in XML format including the publication records. (Ley, 2009) The metadata is provided as an XML file including an extensive list of publication records described with fields such T A B L E 2 Top 10 emerging topics obtained as a result of the framework based on the journal Doxa (1984)(1985)(1986)(1987)(1988)(1989)  as the type of document (e.g., article, PhD thesis or conference paper), the title, the name of the authors, the year, the URL, or the journal in which the article was published (Figure 8). First, a CSV file including only the title, year, and type of content (e.g., article) is extracted from the original XML file which is publicly available as a downloadable dataset (Carrasco & Candela, 2021).
The CSV file containing the metadata is loaded as a Pandas DataFrame before the extraction of emerging topics takes place. The analysis has been performed for two periods: 2000-2010 (570,116 articles) and 2010-2021 (1,548,362 articles). The top first 10 potential emerging topics obtained as a result per each period are shown in Tables 9 and 10. Figures  8-11 show the document frequency according to the period

| Discussion
The framework described in Section 3 can be optimized by selecting adequate values of the DF and TF thresholds. For collections containing a significant amount of text -such as Doxa and EThOS-, f min ¼ 3 , and g min ¼ 20 proved to work satisfactorily. For shorter texts, such as titles provided by dblp, additional tests were done by using a lower value for g min (e.g., 10). The results were similar, however, there were some differences which indicated that further configuration and optimization are needed to achieve the optimal performance. The application of the methodology described here did not always lead to satisfactory results. For example, preliminary tests were performed on the Chronicling America newspapers collection * (Bourbon News section) provided by the Library of Congress. An example to reproduce the results is included in the Jupyter Notebook collection. The output included noisy terms due to the low accuracy of the published transcription, which consists of unsupervised OCR of documents that often include unusual text styles and small fonts (see Table 11). Although this method is sensitive to OCR quality, it will not reveal most common errors as emerging terms are not the most frequent ones. One could, however, employ a probabilistic language model to detect mistakes in the transcription, as characters recognized incorrectly usually appear as content that deviates from the standard morphemes in the target language.
The last step of the methodology (human revision) is particularly relevant. Issues solved include: • Ambiguous acronyms (such as CER, which according to EThOS could be expanded to Conditioned Emotion Response or Corporate Environmental Responsibility). • Structural contents such as section titles (e.g., Section B).
Although the overall assessment by the content managers was positive, selecting the right period to achieve good performance is a challenging task due to several reasons including the quality of the content, the number of years included in the collection or the number of results to assess. In this way, the knowledge of the content curators contributes significant value to the assessment process.
During the evaluation of the results on the Doxa journal, the curators identified a number of potential impacts of the technique: (a) The enhancement of the catalog with suggestions for search keywords based on the identified emerging topics; (b) visualizations based on the emerging topics of a particular journal depicting the dynamic behavior of terms over time, as hint for researchers; and (c) the identification of relevant items, such as popular works and topics, connected to a specific author, by analyzing author-oriented collections such as the Anales Galdosianos journal. *

| CONCLUSIONS
The framework described in Section 3 provides a simple method to identify emerging topics in newspapers, bibliographies, and journals. The method has been applied to three datasets provided by relevant GLAM institutions. The overall assessment by the content curators regarding the journal Doxa was positive obtaining a success rate of 80%, which was obtained by analyzing a period of 10 years.
The framework can be optimized by means of several parameters such as document and term frequency F I G U R E 1 0 DF of the term cone metric in dblp F I G U R E 1 1 DF of the term Cloud Computing in dblp T A B L E 1 1 Top five 10 emerging topics obtained as a result based on Library of Congress' Chronicling America dataset (Bourbon News, 1913-1920  thresholds. The adoption of this procedure by other institutions may require an adjustment of the configuration in order to obtain optimal results. The differences of the datasets reused in our approach and the results obtained are useful to promote the reuse of the digital collections based on computational-driven research within GLAM institutions. In addition, OCR quality can pose a challenge when reusing text corpora.
Future work to be explored includes the normalization of the number of articles per year and the exploitation by means of visualization techniques. The definition of a semantic vocabulary to describe the model as well as the outputs will be explored. We plan to explore datasets, such as the Current Research Information Systems † digital repository, to check their applicability to support funding bodies in identifying promising areas of research.