Automated Geocoding of Textual Documents: A Survey of Current Approaches
Abstract
This survey article describes previous research addressing text‐based document geocoding, i.e. the task of predicting the geospatial coordinates of latitude and longitude, that best correspond to an entire document, based on its textual contents. We describe (1) early document geocoding systems that use heuristics over place names mentioned in the text (e.g. names of cities and states), (2) probabilistic language modeling approaches, where generative models are built for different regions in the world (usually considering a discretization based on a rectangular grid) from the words occurring in a set of georeferenced training documents, which are then used to predict per‐region probabilities for previously unseen test documents, (3) combinations of different models and heuristics, including clustering procedures, feature selection approaches, and/or language models built from different sources, and (4) recent approaches based on discriminative classification models.
1 Introduction
Many text documents, from different application domains, can be said to be related to some form of geographic context. In the past, the relation between geographical location and language has been studied for instance as a sub‐field of sociolinguistics known as dialectology (Chambers 1998) and, more recently, several studies within the emerging field of computational sociolinguistics (Nguyen et al. 2015) have also addressed issues related to language and geography, such as regional patterns of linguistic variation (Huang et al. 2016; Gonçalves and Sánchez 2015), the diffusion of linguistic change (Eisenstein et al. 2014; Eisenstein 2015) or the way demographic variables interact with geography to affect language use (Pavalanathan and Eisenstein 2015). In recent years, given the increasing volume of unstructured information being published online, researchers from the areas of information retrieval and natural language processing have shown an increased interest in applying computational methods to extract geographic information from heterogeneous and unstructured data, including textual documents. This information is useful not only for scientific studies, e.g. in the aforementioned field of sociolinguistics, or in the spatial humanities (Bodenhamer et al. 2010; Gregory and Geddes 2014; Gregory et al. 2015), but also to support many different types of practical everyday applications.
Geographical Information Retrieval (GIR) has, for instance, captured the attention of many different researchers who work in fields related to text processing, in an attempt to go beyond the capabilities of traditional geographical information systems, which express information about the world in terms of associations to formal geospatial coordinates of latitude and longitude, instead of handling named places and locative expressions (Purves and Jones 2011; Hill 2006). The vision of a geospatial semantic Web has also attracted significant research, addressing issues related to: (1) the management of large knowledge bases describing entities such as locations (e.g. through endeavours like DBpedia (Bizer et al. 2009; Lehmann et al. 2014), YAGO (Hoffart et al. 2013) or the Geonames ontology (http://www.geonames.org/ontology/); (2) the provision of interlinked data between different geospatial data sources (Moura and Davis 2014); and also issues related to (3) the analysis of textual contents in order to add semantic annotations, for instance by linking entities discovered in texts (e.g. names for people, organizations or locations) to entries in knowledge bases such as DBpedia (Mendes et al. 2011; Hoffart et al. 2011). Within the semantic Web, natural language processing, information retrieval and information extraction communities, much work has been done on extracting facts and relations about geospatial entities, specifically attempting to deal with the fact that geospatial expressions are often vaguely defined and context dependent. This includes studies focused on the extraction and normalization of named places from text (Roberts et al. 2010; Grover et al. 2010; Qin et al. 2010; Gelernter and Mushegian 2011; Gelernter and Balaji 2013; Zhang and Gelernter 2014; Moncla et al. 2014; Santos et al. 2015; Speriosu and Baldridge 2013; Inkpen et al. 2015; DeLozier et al. 2015; Awamura et al. 2015), studies focusing on geospatial topic modeling (Speriosu et al. 2010; Eisenstein et al. 2010), studies focusing on the extraction of locative expressions beyond named places (Liu et al. 2014; Wallgrün et al. 2014), studies focused on extracting itineraries as described in text (Moncla et al. 2015; Bekele 2014), studies focusing on the extraction of qualitative spatial relations between places (Khan et al. 2013; Wallgrün et al. 2014; Antelman and Cleary 2015), or studies focusing on the extraction and modeling of general spatial semantics from natural language descriptions (Recchia and Louwerse 2014; Kordjamshidi et al. 2011, 2013).
Computational models for understanding geospatial language are indeed a cardinal issue in several disciplines, and they can also provide critical support for multiple practical applications. The task of resolving individual place references in textual documents has been specifically addressed in several previous studies, with the aim of supporting subsequent GIR processing tasks, such as document retrieval or the production of cartographic visualizations from textual documents (Mehler et al. 2006; Lieberman and Samet 2011; Gregory and Hardie 2011; Adams et al. 2015; Borin et al. 2014). However, place reference resolution directly presents several non‐trivial challenges (Amitay et al. 2004), due to the inherent ambiguity of natural language discourse (e.g. place names often have other non‐geographic meanings, different places are often referred to by the same name, and the same places are often referred to by different names). Moreover, there are many vocabulary terms besides place names that can frequently appear in the context of documents related to specific geographic areas. People may, for instance, refer to vernacular names (e.g. The Alps or Southern Europe) or vague feature types (e.g. downtown) which do not have clear administrative borders (Hollenstein and Purves 2010; Schockaert 2011), and several other types of natural language expressions (e.g. culturally local features such as soccer vs. hockey, or stylistic and dialectical differences such as cool vs. kewl or kool) can indeed be geo‐indicative, even without making explicit use of place names (Adams and Janowicz 2012, 2013). Instead of trying to resolve the individual place references that are made in textual documents, several authors have noted that it may be more interesting instead to study methods for assigning entire documents to geospatial locations (Anastácio et al. 2009; Wing and Baldridge 2011; Laere et al. 2014; Melo and Martins 2015).
This survey article specifically focuses on text‐based document geocoding, i.e. the task of predicting the latitude and longitude coordinates of a given document, based on its entire textual contents. We present a detailed review of previous studies that have addressed this problem, leveraging different types of techniques from the areas of information retrieval, natural language processing and machine learning. Although some of the methods discussed in the survey leverage the resolution of toponyms in the text (i.e. they leverage methods for recognizing and disambiguating the place names occurring in the text), our focus is on surveying previous studies specifically focusing on document geocoding, instead of extensively describing toponym resolution approaches. It should also be noted that, in the context of Geographical Information Systems, the term geocoding often denotes the task of address geocoding (i.e. the task of converting postal address data into geographic coordinate of latitude and longitude). Address geocoding involves significantly different methods than those surveyed in this article (e.g. pre‐processing techniques for address standardization, similarity search methods for matching addresses against gazetteers, and interpolation methods over street segments associated to address ranges and/or parcels) and, for more details about address geocoding, the reader should refer to the overviews given by Goldberg et al. (2007) and Goldberg (2008).
Our survey starts with a section on early proposals, specifically describing two seminal document geocoding systems that relied on heuristics over locations mentioned in the texts (e.g. names for cities and states, as described in large gazetteers). Section 3 describes language modeling and information retrieval approaches for document geocoding, in which generative probabilistic models for different regions of the world are first estimated from training data and then used to assign per‐region probabilities to previously unseen documents. Section 4 describes recent extensions to the language modeling approaches from the previous section, specifically covering methods in which different strategies are used to address issues such as proper term selection, model smoothing by leveraging different data sources, or the final estimation of geospatial coordinates based on the per‐region probabilities. Section 5 discusses recent approaches based on document classification through discriminative models such as logistic regression, support vector machines, or multilayer neural networks. Section 6 summarizes the main aspects of the methods surveyed in the article, and discusses open challenges and ideas for future work. Finally, Section 7 concludes our survey.
2 Early Proposals
This section describes two seminal studies focusing on the problem of document geocoding, namely: (1) the GIPSY system from Woodruff and Plaunt (1994); and (2) the Web‐a‐where approach described by Amitay et al. (2004).
Woodruff and Plaunt (1994) described the Geo‐referenced Information Processing SYstem (GIPSY), i.e. a prototype retrieval service, integrating text‐based geocoding functionalities, for handling documents related to the region of California. In this system, document geocoding relied on an auxiliary dataset containing, among other items, a subset of the US Geological Survey's Geographic Names Information System (GNIS) database, which contained geospatial footprints for over 60,000 geographic place names in California. In this approach, each possible location for a given place name is associated with a weight that depends on: (1) the geographic terms extracted from the text; (2) the position within the document and the frequency of these terms; (3) knowledge of the geographic objects and their attributes on the auxiliary database; and (4) geospatial reasoning over the geographic regions of the object. GIPSY's document geocoding algorithm can be divided into three main steps, which we outline next.
Step (1) is a parsing stage, where the document's relevant keywords and phrases are extracted. Terms and phrases related to geospatial locations in the auxiliary dataset are collected from the text, along with other lexical constructs containing spatial information, such as adjacent, south of or neighboring. The terms and phrases are also weighted, according to a combination of different heuristics.
In Step (2), in order to obtain geospatial footprints (e.g. polygonal representations for the area covered by a given place) for the places referenced in the document, the system uses the auxiliary geospatial dataset containing information such as the locations of cities, names and locations of endangered species, bio‐regional characteristics of climate regions, etc. The system identifies the spatial locations that are the most similar to the geographic terms retrieved from the text in Step (1), also looking for synonymy (e.g. the Latin and the common name of a species) and hierarchical containment relations (e.g. when the auxiliary dataset does not have geospatial information about a given sub‐species, although there is information on a hierarchically superior species) between geospatial concepts in the auxiliary dataset. Finally, geographic reasoning is also applied, e.g. for handling expressions like south of California.
After extracting all the possible locations for all the terms and phrases denoting places in a given document, the final step overlays the geospatial footprints in order to estimate approximate locations. Every combination of place name, weight, and geospatial footprint, is represented as a three‐dimensional polyhedron with a polygonal base on the plane formed by the standard x, z axes, and that is elevated upward on the y axis according to the corresponding weight. The polygons for a given document are added one by one to a skyline that starts empty. When adding a polygon there are three possible scenarios: (1) the polygon to add does not intersect any of the other polygons, and is mapped to y = 0 (see the top representation on Figure 1); (2) the polygon to add is contained in a polygon that has already been added, and its base is positioned in a higher plane (see the middle representation in Figure 1); or (3) the polygon to add intersects, but is not fully contained by one or more polygons, and thus the polygon to add is split. The portion that does not intersect any polygon is laid at y = 0, and the portions that intersect polygons are put on top of the existing polygons they intersect (see the bottom representation in Figure 1).

Illustrations for the cases when two polygons do not intersect each other (top), when one polygon is contained in another polygon (middle), and when a polygon partially intersects another polygon (bottom)
After all the polygons for a given document have been added to the skyline, one can estimate the geospatial region that best fits a document, either by calculating a weighted average of the regions that have the highest elevation in the skyline generated for the document, or by assuming that the document is located in the higher region of the skyline.
In another seminal publication concerned with document geocoding, Amitay et al. (2004) described the Web‐a‐Where system for associating geographic regions to Web pages. A hierarchical gazetteer was used in this work, organizing the world according to continents, countries, states (for some countries only) and cities. The gazetteer contained approximately 40,000 unique places, and 75,000 names for those unique places, including different spellings and abbreviations. As stated in the introduction of this article, several previous efforts (e.g. within the semantic Web community) have specifically dealt with the development of large gazetteers and ontological resources for describing places (Hill et al. 1999; Chaves et al. 2005; Goodchild and Hill 2008; Bizer et al. 2009; Stadler et al. 2012; Hoffart et al. 2013; Lehmann et al. 2014), which are important resources in the context of methods for resolving place names referenced in textual documents. A seminal paper by Hill (2000) provides a brief overview on the core elements of digital gazetteers, in the context of text mining and information retrieval applications.
The Web‐a‐Where geocoding algorithm has three main steps, namely: (1) place name spotting; (2) place name disambiguation; and (3) geographical focus determination.
In the first step, the goal is to find all the possible locations mentioned in a given Web page. The authors relied on a simple procedure, based on dictionary matching and using the names in the gazetteer, so there are terms and phrases that, despite being extracted, do not actually correspond to places.
The second step, i.e. disambiguation, addresses this particular issue through a pipeline of heuristics. According to Amitay et al. (2004) there are two possible sources of ambiguity, namely geo/geo cases (for example London in England vs. London in Ontario) and geo/non‐geo cases (e.g. the term London in Jack London, which should be considered as part of a name for a person). In order to disambiguate place names, the surrounding words are analyzed. If a surrounding word is a country name, or the name of a state to which the place name could be contained, a high confidence level is assigned to that place name. Each reference that remains unresolved, after the application of this first heuristic, is then assigned a low confidence and associated with the geographical place with the highest population. When there is already a resolved place name and there are other unresolved occurrences of the same place name in the document, the unresolved place names are assigned to the same geographic region of the resolved place name, with an in‐between value for the confidence. Finally, the algorithm attempts to better estimate the location of place names having a low confidence, through a deeper analysis of the surrounding context. For instance, if we have Hamilton and London occurring together with Ontario, one can infer that the most probable location for Ontario is the Canadian province with this particular name, because this would be the only place in the gazetteer that has both Hamilton and London as descending cities. In these cases, the confidence levels are also increased.
The final step of the geocoding algorithm concerns finding the geospatial focus, i.e. the region that is most closely related to the contents of the document (e.g. a Web page mentioning San Francisco and San Diego in California should perhaps have its focus on the gazetteer node corresponding to California → United States → North America, whereas a document mentioning the countries Italy, Portugal, Greece and Germany should perhaps have its focus as the taxonomy node corresponding to Europe). The focus determination algorithm attempts to be as specific as possible, by selecting nodes closer to the leaves of the hierarchical gazetteer, but that simultaneously cover the different places referenced in the text. For instance, if a document mentions several times the place name Lisbon, and if no other cities nor countries are mentioned, then the focus will be the gazetteer node corresponding to the city of Lisbon → Portugal → Europe, rather than to the node Portugal → Europe or the node Europe that, although correct, would not be as specific as the first taxonomy node.
To evaluate the proposed approach, Amitay et al. (2004) used a dataset collected from the Open Directory Project (http://www.dmoz.org), with almost one million English pages having a geographic focus (i.e. the authors used the part of the Web directory that organizes pages according to geographical locations). The Web‐a‐Where system predicted the country that was the focus of each of these articles with an accuracy of 92%, and an accuracy of 38% was measured for the task of correctly predicting the city that was the focus of each of these articles.
3 Language Modeling Approaches
Instead of relying on heuristic rules, several previous document geocoding methods are instead based on probabilistic language modeling approaches (Wing and Baldridge 2011; Roller et al. 2012; Dias et al. 2012), similar to those commonly used in the area of information retrieval (Ponte and Croft 1998; Zhai and Hirst 2008). These methods leverage training data (i.e. documents known to be associated to particular geospatial regions) in order to estimate the parameters of models that are able to assign a probability to previously unseen documents. By having language models for different geospatial regions, documents can be geocoded by choosing the model (i.e. the geospatial region) that best explains its textual contents.
Language models correspond to generative approaches, in the sense that they define a probabilistic mechanism for explaining the generation of natural language contents. One of the most popular language‐modeling techniques is the n‐gram model, which relies on a Markov assumption in which the probability of a given word at the i‐th position of a sequence (i.e. documents are seen as sequences of words), can be approximated to depend only on the k previous words, instead of depending on all of the previous words, i.e.
(1)The simplest n‐gram models are based on unigrams, where the k in the previous formula is actually equal to zero, and where the probability of a sequence of words is approximated to the multiplication of the occurrence probabilities of each word (i.e. word occurrences are seen as independent events), based on a set of training documents. This intuition can be formalized as shown in the next equation, where count(wi) corresponds to the number of times that word wi is seen in the training corpus:
(2)A more sophisticated approach corresponds to the bigram language model, where the probability of a word depends on the previously occurring one, as shown in the following equation:
(3)This approach can also be extended to take into account the last two words (i.e. a trigram language model), the last three words (i.e. a 4‐gram model), etc. Nonetheless, n‐gram language models can be hard to estimate for large n, given the number of parameters that would be involved, and given the fact that events corresponding to the occurrence of particular n‐grams will be very sparse, even with large training corpora.
It is important to notice that sparsity can be an issue even with bigram or unigram language models. For instance, according to Equation 2, the probability of seeing a particular sequence of words
will be zero if any of the wi words occurring in the sequence is never seen in the training data. In bigram language models and according to Equation 3, the probability
will be zero if word Wi– 1 never precedes word Wi, even if both these words indeed occur in the training data. Smoothing is thus required in order to build effective language models. One simple method is the Laplace (i.e. add‐one) smoothing technique (Chen and Goodman 1998), where we adjust the empirical counts by adding a residual value to the cases that are never seen in the training data. For instance, in the case of Equation 2, we would add one to the value of count (Wi), and add the size of the vocabulary V to the number of word tokens (i.e. unseen events will now have a probability of
instead of zero). The probability of observing a word Wi in the context of a bigram language model, after Laplace smoothing, would be given by:
(4)Taking inspiration from previous work by Serdyukov et al. (2009), concerned with geocoding Flickr photos based on their tags, Wing and Baldridge (2011) investigated the use of language modelling methods for automatic document geocoding. The authors started by applying a regular geodesic grid to divide the Earth's surface into discrete rectangular cells (i.e. a set J with regions of
of latitude by
of longitude). Each of these cells can be seen as being associated to a cell‐document, corresponding to a concatenation of all the training documents known to be located within the cell's region. The cells that do not contain any training documents are ignored.
The unsmoothed probability of observing a word wi, in a specific cell C from the set of possible cells J, is given by the following equation, where
corresponds to the number of times word wi is seen in a document d:
(5)An equivalent distribution
exists for each test document d, corresponding to the unsmoothed probability of observing each word wi on document d:
(6)To address the issue of events that are never observed on the training data (i.e. words from the vocabulary that are never seen in particular cells and/or documents), the authors proposed using the Good‐Turing smoothing technique (Chen and Goodman 1998; Good 1953). The smoothed probability for observing a word wi from a vocabulary V, in a document d and considering a corpus of training documents D, is thus given by:
(7)In the previous equation, α is the probability mass for unseen words, which is equal to the probability of observing a word only once in document d. An analogous procedure is used to smooth the cell distributions.
The geocoding algorithm attempts to find the cell‐document that is the most similar to the test document, afterwards assigning the centroid coordinates of the region that corresponds to the most similar cell. Three different methods were tested, namely: (1) Kullback‐Leibler divergence; (2) naïve Bayes; and (3) the average cell probability.
The Kullback‐Leibler (KL) divergence measures how well a distribution encodes another one, and the smaller the obtained value the closer both distributions are. Each cell C has a distribution that represents the probability of observing each of the words of the vocabulary, and we can thus find the cell
, in the grid, whose distribution better encodes the distribution for a test document d, given by:
(8)Naïve Bayes corresponds to a simple generative language model, based on unigrams (McCallum and Nigam 1998). The most similar cell
is, in this case, given by the following equation, corresponding to a maximum a posteriori probability rule for deciding the most likely region
:
(9)Note that in the previous formula, P(d) is the same for every cell C in the grid of J cells, and since the objective is to find the most probable cell, we can remove the denominator from the equation. The probability of observing a document d within the context of a cell C can be computed by multiplying the probabilities for each word in document d occurring in the context of cell C (i.e. the model assumes that words occur independently). Thus, the most likely region
can be computed according to:
(10)In the previous equation, P(C) can be computed by dividing the number of training documents in cell C, by the total number of documents in the training corpus.
Finally, the method based on the average cell probability corresponds to the following equation, where the sum over
corresponds to a sum over all the cells from J that contain training documents:
(11)Wing and Baldridge used two different datasets to evaluate the proposed methods, namely a collection of georeferenced articles from the English version of Wikipedia, taken from a dump produced in 2010, and a collection of georeferenced Twitter documents (i.e. concatenations of tweets belonging to an individual user) collected by Eisenstein et al. (2010), from the 48 states of the continental region of the USA. Prediction errors, for each test document, were computed through the great‐circle distance between the predicted location and the ground‐truth location.
The best results were obtained with the approach based on the Kullback‐Leibler divergence, although the naïve Bayes method was a close second. For the Wikipedia dataset, the best results corresponded to a mean error of 221 km and a median error of just 11.8 km. On the Twitter dataset, the best results corresponded to a mean error of 892 km and a median error of 479 km.
In subsequent work by the same team, Roller et al. (2012) reported on two improvements over the previously described method. The first improvement relates to the use of k‐d‐trees to construct an adaptive grid (Bentley 1975). This method deals better with sparseness (i.e. with a regular grid, the training documents concentrate around popular regions and many of the cells thus associate to very few textual contents), producing variable‐sized cells that have roughly the same number of training documents in them. The second improvement relates to the use of a different approach to choose the final geospatial coordinates from the most probable cell. Instead of assigning to the test document the coordinates of the centroid for the most probable cell, the authors instead proposed to assign the geographic midpoint (Jenness 2008) of the training documents that are contained in that cell. In order to find the most probable cell for a given document, the authors again used an approach based on the Kullback‐Leibler divergence, given that this method achieved the best results in the previous study. Roller et al. used three different datasets in their study, namely the two datasets described in the previous article (Wing and Baldridge 2011), and a third one consisting of 38 million tweets located inside North America. Roller et al. (2012) improved on the results reported by Wing and Baldridge (2011), for instance reducing the mean error in the Wikipedia dataset from 221 to 181 km, and the median error from 11.8 to 11.0 km, by using k‐d‐trees instead of uniform grids, and by using the geographic midpoint instead of the centroid of the most probable cell. The mean error was further reduced to 176 km, in an approach that combined uniform grids with k‐d‐trees (i.e. considering cell‐documents produced by applying either a uniform grid or a k‐d‐tree). On the new Twitter dataset, the best results were obtained using only the cells produced through a k‐d‐tree, and these corresponded to a mean error of 860 km and a median error of 463 km.
Dias et al. (2012) also evaluated several techniques for automatically geocoding textual documents, based on language models and using only the text of the documents as input evidence. Experiments were made using georeferenced Wikipedia articles written in English, Spanish and Portuguese, taken from dumps produced in 2012.
To partition the geographic space, Dias et al. (2012) relied on a Hierarchical Triangular Mesh (HTM), i.e. a multi‐level recursive approach to decompose the Earth's surface into triangular regions with almost equal shapes and sizes (Szalay et al. 2005). An HTM offers a convenient approach for indexing data georeferenced to specific points on the surface of the Earth. The method starts with an octahedron – see the first shape on the leftmost part of Figure 2, adapted from original figures available on the HTM website (http://www.skyserver.org/htm/) – that has eight spherical triangles, four in the northern hemisphere and four in the southern hemisphere, resulting from the projection of the edges of the octahedron onto a spherical approximation for the Earth's surface. These eight spherical triangles correspond to level 0 regions. Each extra level of decomposition divides each of the regions into four new ones, creating smaller regions that preserve an equal‐area between themselves. This decomposition is done by adding vertices to the midpoints on each of the regions from the previous level, and then creating great‐circle arc segments to connect the new vertices in each region, as one can see in the rightmost part of Figure 2. The two rows on the left of Figure 2 illustrate the first six levels of the HTM decomposition, resulting from recursively dividing the original eight spherical triangles. The total number of regions n, for the level of resolution r, is given by
. For more information about the HTM procedure, please refer to the paper by O'Mullane et al. (2000), which also describes the HEALPix decomposition that will be introduced in Section 5 of this article.

The hierarchical triangular sphere decomposition that is used by the HTM procedure (Szalay et al. 2005)
In the experiments made by Dias et al. (2012), the resolution that was used in the HTM approach varied from four to 10, i.e. from 2,048 to 8,388,608 regions associated to language models. The authors also experimented with a greedy hierarchical procedure that involved two layers of classifiers, with a first classification layer considering a coarse‐grained HTM resolution, and then having a second layer using a thinner resolution.
The authors represented the documents through n‐grams of either characters or terms (i.e. using 8‐grams of characters or word bigrams). The language models based on these representations were essentially generative approaches relying on the chain rule, smoothed by a linear interpolation with lower order models, and considering a probability of 1.0 for the sum of all sequences of a given length, following on ideas from previous publications concerned with the use of language models for document classification (Carpenter 2005; Peng et al. 2003). The authors estimated the distribution P(C) for each cell C in a set of cells J, and also the probability of a document d occurring within the region covered by each specific cell, i.e. P(d|C). With these two probabilities, it is then possible to estimate the a posteriori probability of a document belonging to a specific cell, i.e. P(C|d), through the application of Bayes theorem. These probabilities, combined with post‐processing techniques, allowed the authors to decide which is the most likely location of a given document.
Every document in the test set is assigned to the most similar region(s) from the set J, i.e. the one(s) with higher values for P(C|d). Finally, the latitude and longitude coordinates of each test document can be assigned, for instance based on the centroid coordinates of the most similar regions.
Four different post‐processing techniques were tested for the assignment of the coordinates to the documents: (1) the coordinates of the centroid of the most likely region; (2) the weighted geographic midpoint (Jenness 2008) of the coordinates of the most likely regions; (3) a weighted average of the coordinates of the neighboring regions, according to the HTM decomposition, of the most likely region; and (4) the weighted geographic midpoint of the coordinates of the k most similar training documents, contained in the most likely region for the document.
The second and third techniques require well‐calibrated probabilities regarding the possible classes, while the approach based on language models that was used by the authors is known to produce extreme estimates. In order to calibrate the probabilities, the authors used a sigmoid function similar to
, where σ was adjusted empirically.
The best results were obtained in the English Wikipedia collection, using n‐grams of characters together with the post‐processing method that used the k most similar documents. These best results corresponded to an average error of 265 km and a median error of 22 km. For the Spanish and Portuguese collections, the authors obtained average errors of 273 and 278 km, respectively, and median errors of 45 and 28 km, using the same procedures. The errors reported by the authors correspond to the distance between the ground‐truth location and the predicted one, using Vincenty's geodetic formulae (Vincenty 1975), which is slightly more accurate than the great‐circle distance.
4 Combinations of Different Models and Heuristics
This section details recent studies that introduced innovative approaches for geocoding textual documents, going beyond the language modeling methods in the previous section. These previous publications involved techniques such as: (1) the combination of language models from different sources (Laere et al. 2014); (2) the use of feature selection methods to improve the predictions (Han et al. 2014; Cheng et al. 2013); or (3) the use of document representations obtained through more sophisticated probabilistic topic models (Adams and Janowicz 2012). Laere et al. (2014) studied the use of textual information from social media (i.e. tags from Flickr photos and terms from Twitter messages), to help georeference Wikipedia documents.
The first step of the proposed algorithm involved finding the most probable region C, from a set of regions J, for a given document d, through the use of language models. Instead of using a regular grid for discretizing the geographic space, such as Wing and Baldridge (2011), the authors relied on a k‐medoids algorithm, which clusters the training documents into several regions. Each of these k‐clusters contains approximately the same number of documents, which means that small areas will be created for very dense regions, and larger areas for sparser regions. The k‐medoids algorithm is similar to k‐means, but more robust to outliers (Kaufman and Rousseeuw 1987).
After clustering, the authors applied the simple procedure outlined in Algorithm 1, based on a regular geodesic grid, to remove terms that are not geo‐indicative (i.e. terms occurring in documents that are dispersed all over the world, and that are thus not related to specific locations). This algorithm assigns a score to each term from the vocabulary, and the lower this score is, the more geo‐indicative is the corresponding term. The obtained scores are used to perform term selection, prior to the training of language models from the documents.
Algorithm 1. Geographic spread filtering algorithm, originally presented by Laere et al. (2014).
Place a grid over the world map, with each cell having sides that correspond to
of latitude and longitude
for each unique term wk in the training data do
for each cell
do
Determine
, i.e. the number of documents in
containing term wk
if c > 0 then
for each
, i.e. the neighbouring cells of
in J do
Determine

if
and
are not already connected then
Connect cells
and

end if
end for
end if
end for
number of remaining connected components

end for
Given a clustering of the training instances into J regions, and given the vocabulary V obtained from the document collection after the removal of non‐informative terms, it is then possible to estimate language models for each cluster. The authors estimated separate language models from Wikipedia, Flickr and Twitter, using the textual contents from Wikipedia articles and from the tweets, and/or the textual tags assigned to photos in Flickr.
For each test document d, the authors calculated the corresponding a posteriori probabilities for each cluster
. Naïve Bayes, as explained in the previous section, was the method chosen to find the region C to which the document should be assigned. In order to deal with the issue of rare events, the authors relied on a Bayesian smoothing procedure with Dirichlet priors, corresponding to the following equation for estimating each
:
(12)In the formula, the probability of a term wi given the entire corpus of training documents D, i.e.
, can be computed from the ratio between the total number of occurrences for term wi, and the total number of term occurrences in the entire corpus.
To find the most probable region for a document d, the authors proposed to combine the language models estimated from the different sources. This corresponds to the following equation, where the family of sets S contains the three sets of language models (i.e. one for each cluster C), estimated from Wikipedia, Flickr, and Twitter:
(13)In this last formula, the weight of a given set of models m can be controlled by the parameter λm. The authors found that a small weight should be assigned to the Twitter models, because tweets often contain noisy information.
Due to memory limitations, the authors only considered the top 100 regions with the highest probabilities for each test document and for each set of language models
(i.e. the exact probabilities for certain regions
will be missing from the estimate that is used in Equation 13). The probability of a given document d being assigned to a region C is then approximated according to the following procedure:
(14)After discovering the area C that best relates with document d, the authors experimented with three different ways for choosing the geospatial coordinates for d, namely with approaches that involve: (1) the medoid; (2) a Jaccard similarity coefficient; and (3) a similarity score obtained though the use of a well known information retrieval system, named Lucene (https://lucene.apache.org), for indexing the collection of training documents.
The medoid is calculated by searching for the document
that is closer to all other documents in C, and assigning its coordinates to document d:
(15)In the previous equation,
is the geodesic distance between the coordinates of documents d and
, both belonging to a region C, for instance as computed through the great‐circle distance procedure.
The approach that leverages the Jaccard similarity coefficient is based on searching for the training document in the region C that is the most similar to d, according to this particular similarity measure, and then assigning its coordinates to d. The Jaccard similarity coefficient is computed through the ratio between the number of terms that are common between the documents, and the number of distinct terms occurring in both documents.
Finally, the procedure based on Lucene is similar to the previous one, but for using Lucene's internal similarity measure, which relies on a dot product between vector representations for the documents (i.e. Lucene leverages vector‐space model representations (Salton et al. 1975), in which documents correspond to vectors in a |V|‐dimensional space, and where the individual terms are weighted according to the TF‐IDF procedure (Salton and Buckley 1988), which will be described in the following section).
In their experiments, Laere et al. (2014) started by using the Wikipedia dataset from Wing and Baldridge (2011). However, they detected a number of shortcomings in the aforementioned dataset, such as the absence of a distinction between articles that describe a specific point location and articles whose geospatial footprint cannot be approximated to a single location, such as countries and rivers. For these reasons, the authors created a new dataset, where every test document has a precise location inside the bounding box of the UK. This dataset has a similar size to the one used by Wing and Baldridge (2011), containing 21,839 test articles and 376,110 training articles.
The authors also created two social media datasets, namely one from Flickr containing 32 million georeferenced photos associated to descriptive tags, and another one from Twitter containing 16 million georeferenced tweets.
When using training documents from Wikipedia, the baseline results corresponded to a median error of 4.17 km for the task of geocoding Wikipedia documents. The best results for the combination of training documents from Twitter and Wikipedia lowered the median error to 3.69 km, when geocoding Wikipedia documents. The best combination of training documents from Wikipedia and Flickr resulted in a median error of just 2.16 km, and when combining Wikipedia, Twitter and Flickr, the best results corresponded to a median error of 2.18 km.
Cheng et al. (2010, 2013) have, in turn, focused on the evaluation of document geocoding techniques when applied to the task of predicting Twitter user locations, based on the text of their tweets. The authors argued that effectively geo‐locating a Twitter user, based on the textual contents of his or her messages, is particularly challenging, given that Twitter status updates are inherently noisy, mixing a variety of daily interests (e.g. food, sports, chatting with friends, etc.) and often relying on the abbreviations and non‐standard vocabulary that is typical of informal communication over the Internet. Traditional gazetteer concepts and proper place names may therefore not be present in the content of tweets. Moreover, even if we could isolate the location‐sensitive terms, each user may have interests that span multiple places beyond their immediate home locations, meaning that the content of tweets may be skewed towards words and phrases that are more consistent with outside locations (e.g. New Yorkers may post about NBA games in Los Angeles).
The approach proposed by Cheng et al. (2010) combines a standard generative classifier based on word unigrams, as described in the previous section, together with a separate probabilistic method for identifying words in tweets having a local geographic scope (i.e. only these terms are selected as features for classification). The authors also proposed using a novel model smoothing approach, for refining the probability estimates.
Concretely, given the words extracted from a user's tweets (i.e. given a document d), the authors propose to estimate the probability of that user being located in a city C from a set of cities J, according to the following model:
(16)In the formula, P(wi) denotes the probability of the word w in the whole dataset of user's tweets, and P(C|wi) identifies, for each word wi, the likelihood that it was issued by a user located in city
. Both probabilities can be obtained simply by counting and normalizing events in the training data, and the city with the highest probability can be taken as the user's estimated location. In order to handle data sparseness in terms of the different unigrams that are used within tweets, the authors considered different approaches for smoothing the probability distributions, including standard Laplace smoothing as introduced in the previous section, and data‐driven geographic smoothing approaches, which take geographic nearness into consideration by looking at the neighbors of a city at different granularities (i.e. by smoothing the distribution considering the overall prevalence of words within states, or by considering a lattice‐based neighborhood approach for smoothing at a more refined city‐level scale).
In the case of state‐level smoothing, the authors start by aggregating the probabilities of a word wi in the cities located within a specific state, and consider the average of the summation as the probability of the word wi occurring in the state. Furthermore, the probability for word wi to be located in a city
can be obtained from a combination of the city probability and the state probability, according to:
(17)In the formula, C stands for a city in the state S, and 1 – λ is the amount of smoothing (i.e. a small value of λ indicates a large amount of state‐level smoothing).
The lattice‐based neighborhood smoothing procedure instead starts by dividing the map of the continental US into lattices of 1
by 1
square degrees. Letting wi denote a specific word, L a lattice, and CL the set of cities in lattice L, the per‐lattice probability of a word wi can be formalized as
. In addition, the authors considered lattices around L as its neighbors (i.e. they also used the nearest lattices in all eight directions). Introducing μ as the parameter of neighborhood smoothing, the lattice probability is updated as follows:
(18)In order to use the smoothed lattice‐based probability, another parameter λ is introduced to aggregate the real probability of wi being issued from the city C, and the smoothed lattice probability. Finally the lattice‐based per‐city word probability can be formalized as follows, where C is a city with a center point within the lattice L:
(19)When tokenizing the tweets into word unigrams, the authors eliminated all occurrences of a standard list of stop words, as well as Twitter screen names (i.e. tokens staring with @), hyperlinks, and punctuation. Moreover, when generating the word distributions, the authors only considered words that occurred at least 50 times, and disregarded variations of similar words. Specifically, the authors used an adapted version of the Jaccard similarity coefficient to check whether a newly encountered token is a different word type, thus handling informal spellings as a single unique token (e.g. awesoome, awesooome or awesome would be considered as a single word type). In order to further filter the list of words, avoiding noise items that do not convey a strong sense of location, the authors proposed a separate procedure based on previous work by Backstrom et al. (2008), in which the authors introduced a model of spatial variation for analyzing the geographic distribution of terms in search engine query logs.
Intuitively, a local word is one with a high local focus and a fast dispersion, i.e. a word that is very frequent at some central point and then drops off in use rapidly as we move away from the central point. Non‐local words, on the other hand, may have many multiple central points with no clear dispersion. These ideas were used in the model of spatial variation for the geographic distribution of terms, which assumes a center point for each term (i.e. the point at which the term should occur most frequently, with frequency then falling off according to distance from the center) and two parameters: (1) a constant c which identifies the frequency in the center point; and (2) an exponent α which controls the speed of how fast the frequency falls as we go further away from the center. The formula for the model, which computes the probability of a term wi being related to a place within a distance
from the center point
, is given by
. A larger value for α identifies a more compact geographic scope for a word, while a smaller value for α is associated to a more global distribution.
In the context of tweets, the authors proposed to determine the focus c and dispersion α for each tweet word wi, by deriving the optimal parameters that fit the observed data. For a word wi, given a center location for a city C together with the central point for the word, the frequency c and the exponent α, the authors computed the maximum‐likelihood value. Considering the city C, suppose all users tweet the word wi from the city a total of n times. The authors multiply the overall probability by
and if no users in the city C tweet the word wi, then they multiply the overall probability by
. In the formula,
is the distance between city C and the center point of the word wi. By adding logarithms of probabilities instead of multiplying probabilities, in order to avoid underflows, the complete equation for the likelihood value is as follows:
(20)In the previous equation, the set
corresponds to the set of tweets containing the word wi, and the formula therefore uses the cities C corresponding to those tweets.
Backstrom et al. (2008) proved that f(c,α) has exactly one local maximum over its parameter space, which means that when a center point is chosen, we can iterate c and α to find the largest value for the likelihood. Given the focus c and dispersion α for every word, one can label as local words all tweet terms with a sufficiently high focus and fast dispersion. However, instead of using thresholds over these values, the authors used a local word classifier (i.e. a model corresponding to a classification tree) to assign words to either a local or a non‐local class, using as features the values returned by the aforementioned model, together with coordinates for the geo‐center of the word and the count of the word's occurrences over a large corpus. A total of 3,183 words occurring in a large Twitter corpus were classified as local words, and they were then used in the computation of the distributions.
The experimental evaluation of the method from Cheng et al. (2010, 2013) was based on Twitter users located within the continental US. For building the testing dataset, the authors crawled Twitter data and selected tweets from users that, in their profiles, listed locations that have a valid city‐level label in the form of <cityName>, <cityName,stateName> or <cityName,stateAbbreviation>, considering all valid cities listed in the Census 2000 US Gazetteer from the US Census Bureau. The resulting dataset had a total of 130,689 users with 4,124,960 status updates, representative of the actual population of the US. For their test data, the authors gathered a separate set of active users with over 1,000 tweets in their timelines, who had listed their location in the form of latitude and longitude coordinates. The test dataset consisted of 5,190 users, also distributed across the continental US, with more than 5 million tweets in total.
Through their experiments, the authors found that, on average, 51% of randomly sampled Twitter users are placed within 100 miles of their actual geospatial location, through the analysis of just 100 of their tweets. The average error distance was, in this case, 670 miles. Increasing the amount of data results with more precise location estimates and, when leveraging all the available data (i.e. using over 1,000 tweets per user), the average error distance was 536 miles. Without considering model smoothing or feature selection based on locality and dispersion, only 10.12% of the 5,119 users in the test set were geo‐located within 100 miles of their real location, and the average error distance was 1,773 miles. When using feature selection without smoothing, approximately 49.8% of the Twitter users were geo‐coded correctly, and the results when using Laplace smoothing or state‐level smoothing corresponded to accuracies of 48% and 50%, respectively. The authors argued that the obtained results clearly attest to the importance of using proper feature selection and model smoothing approaches.
Han et al. (2014) also focused on predicting Twitter user locations, based on the text of their tweets (i.e. using Naïve Bayes classifiers that follow the same ideas presented on the previous section). The authors tested the use of numerous feature selection methods to extract geo‐indicative words (e.g. place names, dialectal words, local slang and local references) from the tweets, instead of using the complete term vocabulary for building the language models.
The Information Gain Ratio (IGR) was the author's first choice for a term selection metric, which can be computed as the ratio between the information gain of a term, relative to a set of classes, and its intrinsic entropy. As classes, the authors chose to use city‐level aggregations for the tweets. Following Quinlan (1993), the IGR of a particular term wi given a set of classes J can be estimated as shown next, where P(wi) is the probability of observing the word wi, whereas
is the probability of not observing the word wi.
(21)Another method that was tested for the task of finding geo‐indicative words was based on the notion of Geographic Density (GeoDen), under the assumption that one should consider selecting the terms that occur in dense regions. The gographic density of a term wi is given by:
(22)In the previous equation,
is the subset of cities from J that are associated to training documents where the word wi is used, while
is the great‐circle distance between cities Cj and Ck.
The authors evaluated the different approaches in the task of predicting the home location for Twitter users, and considering several different datasets, including the North American dataset from the study by Roller et al. (2012), which contains 38 million tweets from 500,000 users in 378 cities. The authors report a median error of 571 km when using the full‐text, 260 km when using the IGR as a feature selection criterion, and of 282 km when using the GeoDen function. The authors also state that, over this dataset, approximately 17, 26 and 25% of the Twitter users could be assigned to the correct city, respectively when using the full‐text, the IGR or the GeoDen strategies. In all three strategies, users could be assigned to the correct country with an accuracy of approximately 80%. One of the tests performed by these authors involved the use of a very large set of non‐geotagged Tweets, in complement to a dataset of Twitter messages collected from all over the world (i.e. each Twitter user was associated to the contents of his or her messages, considering both geotagged and non‐geotagged messages). The authors showed that the addition of non‐geotagged contents could lead to a better estimation of language models, significantly improving the obtained results (e.g. the median error dropped from 913 to 170 km in the tests with this sample of Twitter users from all over the world, when complementing the training data with information from the non‐geotagged tweets).
In another previous study, Adams and Janowicz (2012) demonstrated that, besides place names, some other natural language expressions can also be highly geo‐indicative. For instance the words traffic, income, skyline, government, poverty or employment probably refer to a large city, whereas park, hike, rock, flow water, above or view most likely occur in the context of a description for a national park. The occurrence of these words can thus inform models for the estimation of geographic locations, from textual contents.
In order to test if general terms can indeed provide good hints to discover the geographic location associated with a document, the authors relied on experiments with two different data sources, namely travel blog entries and Wikipedia articles in English. When pre‐processing the data, besides removing stop‐words and stemming the word tokens (i.e. reducing inflected and/or derived words into their base form), the authors also removed the occurrences of all the place names from every document, relying on the Yahoo Placemaker (https://developer.yahoo.com/geo/placemaker/) Web service to identify the places that were mentioned in the texts.
Adams and Janowicz (2012) divided the Earth's surface using a geodesic grid with a regular width and height, based on decimal degrees. Each training document was then assigned to the cell where its location is contained. A cell can be seen as the concatenation of all the training documents contained in its region, similar to what has been said for several of the studies that were previously discussed.
After the data pre‐processing stage, Adams and Janowicz (2012) applied the Latent Dirichlet Allocation (LDA) technique (Blei et al. 2003) to discover T latent topics from the corpus of documents D associated to geospatial coordinates. The result is a |T|‐sized vector θd, created for each document, with the corresponding probability for each topic. According to the authors, these vectors can be seen as representing the observed frequency of each topic at the document's geospatial location.
LDA is essentially another type of generative probabilistic model which assumes that documents result from a mixture of topics. Each document is associated to a distribution over the different topics, and each topic has a distribution over all the possible words. This model assumes that, in a document, words are chosen first by selecting a topic, according to the document's topic distribution, and then by selecting a word according to the selected topic's distribution over the words. Figure 3 illustrates the original LDA topic model, using the plate notation to capture the dependencies among the many variables, in a concise representation. The boxes are plates representing replicates of variables. The outer plate represents a set of M documents, while the inner plate represents the repeated choice of topics and words within a document containing N terms. In the figure, the variables z represent the specific topics associated to each word w in each document, sampled from the corresponding θ topic distribution, which is in turn parametrized with a Dirichlet prior α. The per‐topic word distributions are also parametrized with a Dirichlet prior β. For more information about the LDA topic model, please refer to the article by Blei et al. (2003).

Plate notation represention for the LDA topic model (Blei et al. 2003)
After finding the parameters of the LDA topic model, and after associating each document d to a topic vector θd, the following step was to estimate the topic vector for each cell in the discrete representation of the Earth, by averaging the topic values for each document in that given cell. Finally, the centroid geospatial coordinates of all the documents in each cell are associated with the cell's corresponding topic vector.
The authors calculated, for each different topic, a probability surface over the geographic space, by applying the Kernel Density Estimation (KDE) method (Carlos et al. 2010; Brunsdon 1995), and using the cells’ centroid points together with the corresponding topic values, except for all the cell points that have the topic value equal to zero. KDE can be seen as a generalization of histogram‐based density estimation, using a kernel function at each point instead of relying on counts over a discrete grid. More formally, the kernel density estimate for a point pi is given by the following equation:
(23)In the formula,
is the geospatial distance between points pi and pj, and |D| is the total number of georeferenced points. The kernel bandwidth is given by the parameter h, that controls the maximum area in which an occurrence has influence. It is very important to choose a correct h value, because if it is too low, the values will be under‐smoothed. The opposite happens when h is too high, resulting in values that will be over‐smoothed, since each point will affect a large area. The authors opted to set the bandwidth parameter to twice the value for the width of the geodesic grid squares that were used initially. Finally, K(.) is the kernel function, that integrates to one and controls how the density diminishes with the increase of the distance to the target location. Adams and Janowicz (2012) used the Epanechnikov kernel function, given by:
(24)Having the probability surfaces for each different topic, one can calculate a test document's location. The first step is to create a topic vector for the test document, resulting from LDA, and then we need to compute a weighted raster overlay via map algebra operations over the topic probability surfaces, where only the topics that have greater probability than random are considered, and where the weight for each topic is a product of the test document's topic weight and its normalized inverse entropy (i.e. a measure of the topic's geo‐indicativeness).
In order to test this method, sets of 200 held‐out documents were used from the travel blog and Wikipidia datasets. The best results were obtained when using the top 30 most likely locations to estimate a document's coordinates (i.e. taking the midpoint of these top 30 locations). Over the Wikipedia dataset, this strategy predicted 75% of the articles within 505 km of the true location, and 50% of the articles had errors under 80 km. For the travel blog dataset, half of the instances had their location predicted with an error in terms of distance that was smaller than 1,169 km.
5 Recent Approaches Based On Discriminative Classifiers
In a recent study, Wing and Baldridge (2014) improved on the results of language modeling methods for document geocoding, by using discriminative classification models. The authors combined a hierarchical division of the Earth's surface into rectangular regions with the use of logistic regression classifiers, effectively relying on a greedy search procedure, over the hierarchy, to reduce the time and the storage space required to train and apply the models.
Within the area of natural language processing, logistic regression models are commonly used to address binary classification tasks (Berger et al. 1996). Multi‐class problems can also be handled through these models, for instance through the use of the one‐versus‐all scheme (Rifkin and Klautau 2004), which involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. The probability of a logistic regression model assigning a binary class
is estimated according to:
(25)In the previous equation
is the document to be classified, represented by a vector of features (e.g. one feature for each of the n terms in the vocabulary, capturing the association between the term and the document), and
is a vector of weights over the features, that is chosen according to the following minimization problem:
(26)In the equation above, λ > 0 is a parameter controlling the amount of regularization during model training, and the weights are influenced by the training data
, with
and with
. The minimization problem corresponds to a convex function, so there is a unique global minimum with respect to θ. Unfortunately there is no closed form solution, so iterative methods (e.g. stochastic gradient descent) are commonly used. After model training, a maximum a posteriori rule can again be used to perform classification (i.e. we can assign the class with the highest probability value, as estimated by the model).
Discriminative classification approaches such as logistic regression models are generally unable to scale to encompass several thousand or more distinct classes. To overcome the limitations of discriminative classifiers, in terms of the maximum number of cells they can handle, Wing and Baldridge (2014) proposed using a hierarchical classification scheme, inspired by previous work from Silla and Freitas (2011).
To construct the hierarchy, the authors start with a root cell that spans the entire Earth. From there, they build a tree of cells at different scales, from coarse to fine. A cell at a given level is subdivided to create smaller cells at the next level of resolution that altogether cover the same area as their parent. Wing and Baldridge (2014) have specifically experimented with different approaches for discretizing the Earth's surface according to a hierarchy of cells, based on uniform grids or relying on a k‐d‐tree. In the case of uniform grids, each cell was sub‐divided according to a factor r (i.e. each cell is divided into r2 sub‐cells), also ignoring the empty cells (i.e. those that do not contain any training documents). In the case of k‐d‐trees, a sub‐division factor r is also considered, and if the level 1 sub‐division has a bucket size of b, then the level 2 sub‐division will have a bucket size of
, the level 3 sub‐division will have a bucket size of
, and so on.
Leveraging one of these hierarchical divisions for the Earth's surface, the authors used a local classifier per parent approach (Silla and Freitas 2011), in which an independent classifier is learned for every node of the hierarchy above the leaf nodes. The probability of a node Cj in the hierarchy is the product of the probability for that node multiplied by the probability of all the ancestors of the node, given by the following recursive equation:
(27)In the previous equation,
corresponds to the parent node for Cj in the hierarchy. In order to avoid computing the probability of each leaf cell in Equation 27 (i.e. finding the most probable leaf‐node would involve computing probabilities for all possible paths from the root to the leafs), the authors instead used a stratified beam search, where only the b cells with highest probability are kept at each level. A tight beam drastically reduces the number of classifications that are required.
In order to further increase the computational performance, the authors relied on a very efficient implementation of logistic regression classifiers (https://github.com/JohnLangford/vowpal_wabbit), which leverages an online procedure for model training, together with feature hashing (i.e. an efficient strategy for building fixed‐sized feature vectors representing each document through frequency counts of terms, which relies on a hash function for assigning each term to a specific position in the feature vector).
In their experiments, the authors compared the aforementioned procedure, based on logistic regression, against three baseline methods, namely: (1) a naïve Bayes classifier with Dirichlet smoothing; (2) a naïve Bayes classifier with Dirichlet smoothing and considering only the top N features according to their Information Gain Ratio (IGR); and (3) a non‐hierarchical logistic regression model, equivalent to relying only on the leaf nodes. The authors also relied on several different datasets, namely: (1) samples from the English, German and Portuguese versions of Wikipedia, taken from dumps produced in 2014; (2) Twitter datasets with users located within the US or from all over the world; and (3) the CoPhIR dataset of georeferenced images from Flickr, together with their corresponding descriptive tags (Bolettieri et al. 2009).
The document geocoding approach based on the hierarchy of logistic regression classifiers, and relying on a k‐d‐tree discretization, achieved the best results for the English Wikipedia (i.e. a mean error of 168.7 km and a median error of 15.3 km). The naïve Bayes baseline achieved a median error of 21.1 km, and worse results were obtained when using the IGR as a feature selection criterion. On the Twitter datasets, the best results corresponded to a median error of 170.5 km over the US dataset, and a median error of 490 km on the dataset covering the entire world. Both of these cases corresponded to the use of a hierarchy of logistic regression classifiers with a regular discretization, although results in terms of the mean error were slightly better when using a k‐d‐tree discretization.
In our own previous work, we have also evaluated a hierarchical approach based on discriminative classifiers, in this case relying on a different approach to discretize the geographic space, and relying on linear Support Vector Machines (SVMs) as the classifiers for each level in the hierarchy (Melo and Martins 2015).
The documents were again represented as feature vectors, where the features correspond to individual terms and where the weights were computed according to the term frequency times inverse document frequency (TF‐IDF) procedure (Salton and Buckley 1988). When using TF‐IDF, the weight for a term, in the context of a given document, can be obtained from the multiplication of a term frequency component (i.e. the number of times the term occurs in the document, normalized in order to deal with large documents and/or very frequent terms, for instance through the use of the logistic function) and an inverse document frequency component (i.e. the ratio between the number of documents in the training dataset, and the number of documents where the term occurs, penalized through the logistic function). Notice that this differs from the term weighting procedure used in the previous work by Wing and Baldridge (2014), which relied on term frequency alone.
As a discretization procedure, Melo and Martins (2015) relyed on the Hierarchical Equal Area isoLatitude Pixelization (HEALPix) technique (O'Mullane et al. 2000; Górski et al. 2005), which can be used to produce a recursive multi‐level subdivision of a spherical approximation to the Earth's surface based on curvilinear quadrilateral regions that, in each subdivision and similarly to the aforementioned HTM procedure, cover an equal surface area. The base resolution for HEALPix contains 12 regions, and each region can be recursively subdivided into four new ones – see Figure 4, adapted from the original illustration provided in the HEALPix website (http://healpix.jpl.nasa.gov), which shows how a sphere can be partitioned into 12, 48, 192, and 768 regions. There is a parameter r that controls the resolution of the HEALPix decomposition, which takes values that are always a power of two. The total number of regions n for a given resolution level r is given by
. Melo and Martins (2015) have specifically evaluated their geocoding method using a hierarchy of HEALPix discretizations where r is equal to 4, 64,256 and 1,024.

Orthographic views associated to the first four levels of the HEALPix tessellation (Górski et al. 2005)
The actual geocoding was based on a hierarchy of linear Support Vector Machines (SVM) classifiers, which make their decisions based on a linear combination of the features, plus a bias term. In the case of binary classification problems where the target classes
, if
is the predicted value for d, if the vector
corresponds to the weights associated to each of the n features, and if θ0 represents the bias term, then
.
As in the case of logistic regression models, multi‐class classification can be handled by first converting the problem into a set of binary tasks, through the one‐versus‐all scheme (i.e. a set of binary classifiers, one for each possible sub‐region, was associated to each level of the hierarchical representation of the Earth's surface). Given |D| training instances
, in two classes
, SVMs are trained by solving the following optimization problem in order to find the feature weights
and bias term θ0, under the constraints
, and
:
(28)In the formula, the parameters ξi are non‐negative slack variables which measure the degree of misclassification of the data instances di, and λ > 0 is a regularization term controlling model complexity.
After model training, and in order to geocode a given test document, the authors used a greedy procedure that starts with the application of the root‐level classifier for selecting the most likely region. This region is then subdivided in the next level of the hierarchy, where a new independent classifier again chooses the most likely region. The process repeats itself, until the most likely leaf region is chosen. The centroid coordinates of that leaf region are finally assigned to the test document that is being geocoded.
The document geocoding method from Melo and Martins (2015) was evaluated with articles from the English, German, Spanish and Portuguese Wikipedias, taken from dumps produced in 2014. The authors report a mean error of 83 km and a median error of 9 km for the case of the English Wikipedia, with slightly worse results in the other languages.
Liu and Inkpen (2015) proposed yet another approach based on discriminative models, in this case relying on multilayer neural networks. The authors have specifically evaluated two similar models for the task of estimating Twitter users’ locations based on their textual messages, namely one model that predicts the US state where the user is located, and a second model that predicts the latitude and longitude coordinates of the user's location.
In the multilayer neural networks that were used by Liu and Inkpen (2015), each neuron is connected to all neurons in the subsequent layer, and each neuron is also associated with a non‐linear activation function (i.e. a sigmoid function) that transforms its outputs. The considered models leverage multiple hidden layers (i.e. the authors proposed to use what is now commonly referred to as a deep neural network), and the transformed outputs of each layer are the inputs of the subsequent layer in the neural network. Model training is based on back‐propagation (Rumelhart et al. 1986) and stochastic gradient descent (i.e. the errors in the final output layer are back‐propagated to preceding layers and used to update the weights of each layer). Figure 5 presents the architecture of both models that were proposed, showing that they differ only in the definition of the output layers. The neurons forming each layer are fully interconnected, and a layer together with its reconstruction of the input corresponds to what is commonly referred to as a denoising auto‐encoder (Vincent et al. 2008) (i.e. to force the hidden layers to discover robust features, they are trained to reconstruct their input from a corrupted version of it). The input to both models is a vector space representation for the textual contents, consisting of frequency counts for the 5,000 most frequent term unigrams, bi‐grams and trigrams. The dimensionality of the hidden layers is equal to that of the input layer.

The deep neural network models that were proposed by Liu and Inkpen (2015)
The first model proposed by Liu and Inkpen (2015) uses three layers of denoising auto‐encoders. As shown in Figure 5, each layer of denoising auto‐encoders also serves as a hidden layer of a standard neural network that predicts the class that best corresponds to the input. The top auto‐encoding layer of the network works as the input for a logistic regression model whose output is a softmax layer. The softmax function is defined as follows, where the numerator zC is the input to the softmax function that corresponds to a class C, and where the denominator is the summation over all possible inputs (i.e. over all J possible classes):
(29)The softmax layer has the same number of neurons as the number of possible output labels (i.e. each document is assigned to a class C, from a set of J possible classes), and the value of each neuron can be interpreted as the probability for the corresponding label, given the input. The label C with the highest probability is returned as the prediction that is made by the model.
In the case of the first model, the probability of a label C given the input (i.e. given the result of the last layer of denoising auto‐encoders) is computed through the following equation, where
is the feature weight matrix, with J rows, and where
are the biases for each class
(i.e.
is the bias term for a class C):
(30)In the formula, N corresponds to the number of hidden layers (i.e. N = 3, given that the authors used three layers of denoising auto‐encoders), and dN is the output of the denoising auto‐encoder on top. To calculate the output for the i‐th hidden layer, with
, the following equation is used:
(31)In the formula, sigmoid(.) is the activation function, θi and
respectively correspond to the weight matrix and biases of the i‐th hidden layer, and d0 is the raw input (i.e. the n‐gram features) generated from a given text. The model returns the label
that maximizes Equation 31 as its class prediction:
(32)For their second model, Liu and Inkpen (2015) replaced the logistic regression layer, on top of the network, by a multivariate linear regression layer, which directly attempts to predict latitude and longitude coordinates. Specifically, the output of the second model is given by the following equation, where
, where
is the weight matrix of the linear regression layer (i.e. a matrix with two rows, corresponding to the latitude and longitude coordinates), where
are the biases of the linear regression layer, and where dN is the output of the denoising auto‐encoder that is located on top (i.e. the topmost hidden layer):
(33)In both cases, model training was done by layers, i.e. by first minimizing a squared error loss function for the first layer of denoising auto‐encoders, then for the second layer, and finally for the third. The authors also followed the suggestions of Glorot et al. (2011) in terms of adding statistical noise to the denoising auto‐encoders. After fitting the parameters of the intermediate layers, a fine‐tuning phase in which the top layers as well as all other parameters are adjusted, used different training criteria for each model (i.e. a negative log‐likelihood as the loss function for the first model, and the great‐circle distance between the true and the predicted geospatial coordinates, in the case of the second model). To prevent overfitting, the authors adopted an early‐stopping technique, in which training stops when the model's performance, over a validation set, no longer improves (Yao et al. 2007; Bengio 2012).
The authors reported on experiments with the datasets from Eisenstein et al. (2010) and from Roller et al. (2012), both of which also having been used by other studies surveyed in this article. To evaluate the first model, the authors defined a classification task where each user is assigned to one particular US state. Using the dataset from Eisenstein et al. (2010), the authors measured an accuracy of 34.8%, contrasting to accuracies of 30.1 and 27.5% that were, respectively, obtained with baseline methods corresponding to naïve Bayes or SVM classifiers. The second model was instead evaluated through the median and mean error distances, in kilometers, from the actual location to the predicted location. The authors measured mean error distances of 855.9 and 733 km, respectively over the datasets from Eisenstein et al. (2010) and from Roller et al. (2012).
Despite the interesting results, the authors argue that better models can perhaps be built, for instance by further exploring hyper‐parameter tuning, by increasing the dimensionality of the hidden layers, or through better feature engineering. In a way, the study from Liu and Inkpen (2015) represents an initial attempt at the use of deep learning methods over the task of document geocoding. Future endeavors may, for instance, consider the use of other types of neural network architectures (e.g. recursive neural networks combined with attention mechanisms), recently shown to achieve superior results in other types of text classification problems (Tang et al. 2015).
6 Discussion and Open Challenges
Table 1 summarizes the results obtained in the different studies that were surveyed in this article, attesting to the progress that has been achieved in recent years. We excluded from this table the articles from Woodruff and Plaunt (1994), Amitay et al. (2004), and Adams and Janowicz (2012), due to the fact that these authors used significantly different procedures for reporting the obtained results (i.e. these authors used very different corpora in their studies, and they did not report results in terms of the median distance towards the ground truth locations). Most previous studies reported on geocoding quality through experiments with datasets collected from Twitter or from Wikipedia, measuring the median and the mean geospatial distance from the predicted locations to the locations given in gold standard annotations (i.e. the geospatial coordinates originally associated with Wikipedia articles or with Twitter users). It is particularly interesting to consider the median distance in the context of evaluating document geocoding methods, given that the median is relatively robust to outliers, and given that distances are easy to measure and probably also easier to interpret than metrics of classification accuracy. Wikipedia and Twitter contents are also quite convenient for the evaluation of document geocoding methods, given that very large datasets, in multiple languages, can easily be collected, and given that the gold standard annotations are also directly available (e.g. geospatial coordinates for many Wikipedia articles are assigned and curated by human editors). Still, future work in the area should perhaps also use other sources of textual data (e.g. news articles, historical documents, etc.) together with Wikipedia and/or Twitter collections to facilitate the comparison against previous studies.
| Study | Dataset | Main contributions | Median | Mean |
|---|---|---|---|---|
| Wing and Baldridge (2011) | Wikipedia (dump from 2010) | Unigram models + regular grid + KL divergence | 11.8 km | 221 km |
| Wing and Baldridge (2011) | Smaller Twitter dataset (USA) | Unigram models + regular grid + KL divergence | 479.0 km | 967 km |
| Cheng et al. (2010, 2013) | Medium Twitter dataset (USA) | Unigram models | ‐ | 2,854 km |
| Cheng et al. (2010, 2013) | Medium Twitter dataset (USA) | Unigram models + feature selection | ‐ | 868 km |
| Cheng et al. (2010, 2013) | Medium Twitter dataset (USA) | Unigram models + feature selection + smoothing | ‐ | 862 km |
| Roller et al. (2012) | Wikipedia (dump from 2010) | Combined grid + centroid of most probable cell | 13.4 km | 176 km |
| Roller et al. (2012) | Larger Twitter dataset (USA) | K‐d‐tree grid + centroid of most probable cell | 463.0 km | 860 km |
| Dias et al. (2012) | Wikipedia (dump from 2011) | N‐gram models + HTM discretization | 25.8 km | 255 km |
| Laere et al. (2014) | Wikipedia (UK dataset) | K‐medoids clusters + feature selection | 4.2 km | ‐ |
| Laere et al. (2014) | Wikipedia (UK dataset) | K‐medoids clusters + feature selection + Flickr data | 2.2 km | ‐ |
| Han et al. (2014) | Larger Twitter dataset (USA) | IGR feature selection | 260.0 km | ‐ |
| Han et al. (2014) | Twitter dataset (world) | IGR feature selection | 640.0 km | ‐ |
| Wing and Baldridge (2014) | Wikipedia (dump from 2013) | Logistic regression + k‐d‐tree grid | 15.3 km | 169 km |
| Wing and Baldridge (2014) | Larger Twitter dataset (USA) | Logistic regression + regular grid | 170.5 km | 704 km |
| Wing and Baldridge (2014) | Twitter dataset (world) | Logistic regression + regular grid | 490.0 km | 1,670 km |
| Melo and Martins (2015) | Wikipedia (dump from 2014) | SVM classifiers + HEALPix discretization + TF‐IDF | 8.9 km | 83 km |
| Liu and Inkpen (2015) | Smaller Twitter dataset (USA) | Deep neural network + word n‐gram features | ‐ | 856 km |
| Liu and Inkpen (2015) | Larger Twitter dataset (USA) | Deep neural network + word n‐gram features | 377.0 km | 733 km |
It should also be noted that several of the studies presented on Table 1 used slightly different corpora in their experiments (i.e. different collections of documents from Wikipedia or from Twitter), and thus some of the results shown in Table 1 cannot be said to be directly comparable. For instance Wing and Baldridge (2014) reported on worse results for the English Wikipedia than in a previous publication by the same team (Roller et al. 2012), simply because a different (i.e. newer and larger) sample of articles from the English Wikipedia was used in these later tests. Nonetheless, the method reported by Wing and Baldridge (2014) is still perhaps the more accurate, because in tests with a re‐implementation of the geocoding method from Roller et al. (2012), using the newer dataset, the authors measured errors that were significantly higher. Still, some of the studies listed on Table 1 are indeed directly comparable, because the same datasets have been used (i.e. the entries in Table 1 that have the same values in the second column have used the exact same datasets in the evaluation experiments) and/or because the authors have made efforts to replicate the experiments of previous studies. The discriminative classification methods from Wing and Baldridge (2014) and Melo and Martins (2015) are relatively similar (e.g. both studies rely on a hierarchy of linear classifiers, leveraging the vector space model approach for representing the documents) and also achieve relatively similar results. These methods currently correspond to the state‐of‐the‐art in the area, and the results agree with the intuition that discriminative approaches generally outperform generative models in supervised text classification problems, particularly when large training datasets are available (Ng and Jordan 2002), although the training of discriminative models can also be computationally more demanding.
The results on Table 1 also show that geocoding Twitter users based on their messages is much more challenging than geocoding Wikipedia documents. These results should not be surprising, given that Wikipedia articles are longer, thus providing more context on which to base predictions. Wikipedia articles also tend to use more toponyms and words that correlate strongly with particular places, while tweets tend to discuss quotidian details, often using abbreviated and slang language to overcome the limit of 140 characters per message.
The lowest error reported on Table 1 corresponds to a median error of 4.2 km when geocoding Wikipedia documents through the highly tuned generative approach described by Laere et al. (2014). However, these authors used a dataset with Wikipedia pages referring only to locations within the UK, while the other studies listed in Table 1 used Wikipedia pages from all around the world. The mean and median errors reported on Table 1, which are perhaps too high for practical applications related to location‐based services and hyper‐local search, should be seen in the highly challenging context of dealing with locations and textual contents in a global scale.
Despite the interesting results that have been reported in the articles surveyed here, there are also many possibilities for future improvements, as well as for the application of these ideas within subsequent tasks related to geospatial text mining and retrieval. We believe that there are indeed many practical applications for the procedures surveyed in this article, although document geocoding should often be accompanied by a classification procedure to distinguish documents that should indeed be geocoded from those that do not discuss location‐related aspects (Anastácio et al. 2009). In the case of hierarchical methods such as that from Melo and Martins (2015), a first node in the hierarchy of classifiers could perhaps be used to check if a given document is indeed related to some geospatial region.
The document geocoding procedures surveyed in this article can, for instance, be used to compute document‐level priors for aiding in the task of resolving individual place references in the text (Santos et al. 2015; Speriosu and Baldridge 2013; DeLozier et al. 2015). As happened in the task of document geocoding, recent studies focusing on place reference resolution have also shifted from gazetteer matching and rule‐based spatial minimization methods into machine‐learned classifiers that use features of the text surrounding the different place names. Advances in document geocoding can be used to improve the resolution of individual place references, or even to aid in the more general problems of named entity disambiguation and text semantification (Cucerzan 2007; She et al. 2015; Hoffart et al. 2011; Mendes et al. 2011), either through the incorporation of novel/better features for capturing the geographic context, or through the direct application of similar classification methods to the task of disambiguating place references (DeLozier et al. 2015).
Generative approaches for document geocoding, such as the ones surveyed in this article, can perhaps also be further enhanced. Some of the recent advances in the area of Information Retrieval (IR) have specifically focused on language modeling approaches that go beyond the methods used in previous studies concerned with document geocoding, for instance by attempting to directly capture word burstiness (Xu and Akella 2010; Cummins et al. 2015). Previous IR studies have also evaluated the application of more advanced modeling approaches in the task of geocoding Flickr photos, for instance by using notions of proximity or hierarchical containment between geographic regions, in order to smooth language models (O'Hare and Murdock 2013; Murdock 2014), or by considering models that do not attempt to discretize the geographic space (Kling et al. 2014; Flatow et al. 2015).
Although this survey has focused on studies addressing the automated geocoding of textual documents, there are many previous studies instead focusing on multimedia resources, including photos (Workman et al. 2015; Islam et al. 2015; Hare et al. 2014), videos (Li et al. 2014) and even resources like music files (Zhou et al. 2014) or series of images from webcams (Jacobs et al. 2011). Several of these studies have proposed interesting methods for leveraging multimedia content descriptors, external resources and/or associated textual annotations. For future work, it would be interesting to see if some of the ideas proposed for geocoding multimedia contents could also result in improvements for the specific task of geocoding textual documents.
The discriminative classification approaches discussed in Section 5 of this article can perhaps also be improved through the use of text representations and feature weighting schemes that go beyond the classical vector space model and the TF‐IDF procedure. One can, for instance, compensate for document length through procedures such as pivoted document length normalization (Singhal et al. 1996), or compensate for term proximity by propagating term weights across terms that co‐occur within a given proximity context (Lebanon et al. 2007; Büttcher et al. 2006). One can also consider using supervised term weighting methods that favour terms that are better at discriminating between different geospatial regions, for instance through the inverse class frequency procedure and other variants of this idea (Lan et al. 2009; Lertnattee and Leuviphan 2012). Another related idea is to leverage graph‐based term weighting procedures, which can be made to account not only with term frequency and proximity, but also with different types of linguistic regularities (Rousseau and Vazirgiannis 2013; Blanco and Lioma 2012).
Instead of exploring more advanced term weighting procedures, other authors have instead attempted to encode term proximity and other types of linguistic regularities within the regularization component of discriminative classification methods, showing that these procedures can result in significant improvements for different types of text classification applications (Yogatama and Smith 2014a, 2014b). The main idea behind these model regularization procedures relates to instantiating potentially overlapping groups of input features corresponding to the linguistic regularities of interest to the problem at hand (e.g. in document geocoding applications, we could perhaps explore the idea that only a few sentences from a given document will be useful for predicting the document's geographic location, thus instantiating groups corresponding to the different sentences where a term occurs). During model training, the optimization procedure promotes the idea of group sparseness (i.e. predefined groups of weights are encouraged to either go to zero, as a group, or not). For future work, it would also be interesting to see if SVM or logistic regression classifiers, in combination with these data‐driven regularizers, could indeed result in significant improvements.
Another idea for future work concerns combining discriminative and generative classification models. Previous studies have shown that generatively trained classifiers often perform better when there are few training examples (i.e. the generative assumptions place some structure on the models that prevent overfitting), and they provide a principled way for treating missing information, or for supporting semi‐supervised learning. On the other hand, classifiers trained discriminatively often perform better with sufficient training data. Recent machine learning literature has explored several hybrids of these two approaches (e.g. through ensemble schemes) and, for future work, we can perhaps also consider exploring hybrid classification models (Raina et al. 2004; Wang and Manning 2012). Examples of promising approaches include maximum‐margin supervised topic models such as MedLDA (Zhu et al. 2009), which combine max‐margin prediction as in SVMs with the mechanism behind hierarchical Bayesian topic models, yielding latent topical representations that are more discriminative and suitable for classification tasks.
Recently, we have also witnessed an increasing interest in the use of topic‐based representations for textual documents that go beyond the vector space model approach used in some of the works mentioned in this article, for instance through the use of probabilistic topic models such as Latent Dirichlet Allocation (Blei et al. 2003), which has already been mentioned in this article and where the documents are represented as probability distributions over latent topics prior to their classification, or through approaches based on unsupervised embeddings for words and/or entire documents (Soyer et al. 2015; Liu et al. 2015; Le and Mikolov 2014).
In particular, unsupervised word embeddings trained by maximizing the prediction of contextual words have become quite popular within the field of natural language processing, having been shown to be highly effective in capturing fine‐grained semantic properties of words (Gupta et al. 2015), and also being commonly used as features within different types of classification problems, particularly when relying on multilayer neural networks such as those used in the study by Liu and Inkpen (2015). For instance in the work described by Mikolov et al. (2013), commonly referred to as word2vec's skip‐ngram model, the idea is to leverage large corpora to estimate the optimal word embeddings, maximizing the probability that the words within a given window size are predicted correctly, and leveraging a simple two‐layer neural network in which the top‐layer corresponds to a log‐linear model. After the model is trained, the words are mapped into a vector space such that semantically similar words have similar vector representations (e.g. strong is close to powerful). Interestingly, this approach has also been extended in order to learn representations for geographically situated language (Bamman et al. 2014), capturing a geographically informed notion of semantic similarity by mapping words in different regions into the same vector space. Still, word embeddings have not yet been explored as features in models for document geocoding.
Following on the success of word embedding techniques such as word2vec's skip‐ngram model, researchers have tried to extend these models beyond word level representations, specifically aiming to achieve phrase‐level or sentence‐level representations. A simple compositional approach involves using a weighted average of the embeddings for all the words in the document, losing the word order in the same way as the standard vector space models do. However, authors like Le and Mikolov (2014) or Soyer et al. (2015) have proposed more sophisticated approaches, consisting of unsupervised frameworks that learn representations for variable‐length pieces of text. In the paragraph vector approach from Le and Mikolov (2014), the representations are trained for predicting words in a document (i.e. the authors concatenate the paragraph vector with several word vectors from a document, and predict the following word in the given context). Soyer et al. (2015) instead described an approach that can be used to represent documents, in different languages, according to a common vector space, leveraging only a few aligned cross‐lingual sentences. These approaches can naturally be used to build effective document representations, where the dimensionality is much lower than that of typical representations based on individual word occurrences. These lower dimensional representations can facilitate the use of more advanced learning methods, for instance using non‐linear combinations of the input features in order to make predictions. Having cross‐lingual representations (Soyer et al. 2015; Faruqui and Dyer 2014) can also be of particular interest in the task of document geocoding, given that training datasets for languages other than English will be smaller in size, and thus by transferring knowledge from existing English corpora we can perhaps significantly improve the results for other languages.
Future studies on the subject of document geocoding, leveraging low‐dimensional representations for the textual contents such as those produced through modern embedding methods, can also consider experimenting with different models learned from training data, perhaps more expressive and/or more adequate for this task. For instance, models based on decision trees, or based on ensembles of decision trees, can be made to use variations of the information gain criterion that are better suited to georeferenced training data (Jiang et al. 2012; Li and Claramunt 2006).
Another idea concerns the use of metric learning approaches (Wang and Sun 2015; Kulis 2013), capable of estimating the true distance between the locations of documents, based on their textual representations. If the distance between documents can indeed be accurately estimated, then the document geocoding problem can be addressed through a weighted interpolation from the coordinates of the k most similar training documents (Jenness 2008), with weights corresponding to the estimated geospatial distances. A recent study from Rahimi et al. (2015) has successfully explored label propagation algorithms for geocoding Twitter users (i.e. state‐of‐the‐art results were obtained by propagating information on a similarity graph built from user mentions in Twitter messages, together with dongle nodes corresponding to the results of a geocoding method leveraging textual information). Similar transductive methods can, for instance, be combined with learned distance metrics between the textual documents.
7 Conclusions
Geography is an integral part of human communication, since every piece of information is created in a location, intended for an audience in the same or other locations, and may discuss yet other locations. Recently, geographical information retrieval has captured the attention of many different researchers that work in fields related to language processing and to the retrieval and mining of relevant information from large document collections. With the rise of unstructured information being published online, we have also witnessed an increased interest in applying computational methods to extract geographic information from heterogeneous and unstructured data, including textual documents. In this article, we summarized previous research on text‐based document geocoding, i.e. on the task of predicting the geospatial coordinates of latitude and longitude that best correspond to the entire contents of a given document, based on its textual contents. The previous studies described in our survey ranged from early document geocoding systems that use heuristics over place names recognized in the texts, to supervised machine learning methods. Current state‐of‐the‐art methods leverage discriminative classification approaches. We compared the different methods through experimental results reported on the original articles (i.e. based on tests over English datasets collected from Wikipedia or from Twitter), and we also extensively discussed open challenges for future work in the area.
Acknowledgements
This work was partially supported through Fundação para a Ciência e Tecnologia (FCT), through project grants with references PTDC/EIA‐EIA/109840/2009 (SInteliGIS), EXPL/EEI‐ESS/0427/2013 (KD‐LBSN), and EXCL/EEI‐ESS/0257/2012 (DATASTORM), as well as through the INESC‐ID multi‐annual funding from the PIDDAC programme (UID/CEC/50021/2013). We would particularly like to thank our colleagues, Ivo Anastácio, Duarte Dias, João Santos, Pável Calado and Mário J. Silva, for their comments on preliminary versions of this work.
References
Number of times cited according to CrossRef: 10
- Christos Rodosthenous and Loizos Michael, Using Generic Ontologies to Infer the Geographic Focus of Text, Agents and Artificial Intelligence, 10.1007/978-3-030-05453-3_11, (223-246), (2018).
- Morteza Karimzadeh, Scott Pezanowski, Alan M. MacEachren and Jan O. Wallgrün, GeoTxt: A scalable geoparsing system for unstructured text geolocation, Transactions in GIS, 23, 1, (118-136), (2019).
- Ozer Ozdikis, Heri Ramampiaro and Kjetil Nørvåg, Locality-adapted kernel densities of term co-occurrences for location prediction of tweets, Information Processing & Management, 10.1016/j.ipm.2019.02.013, 56, 4, (1280-1299), (2019).
- Benjamin Adams and Grant McKenzie, Crowdsourcing the character of a place: Character‐level convolutional networks for multilingual geographic text classification, Transactions in GIS, 22, 2, (394-408), (2018).
- Yingjie Hu, Geo‐text data and data‐driven geospatial semantics, Geography Compass, 12, 11, (2018).
- Patricia Murrieta-flores and Naomi Howell, Towards the Spatial Analysis of Vague and Imaginary Place and Space: Evolving the Spatial Humanities through Medieval Romance, Journal of Map & Geography Libraries, 13, 1, (29), (2017). the 11th Workshop GIR'17 Heidelberg, Germany Proceedings of the 11th Workshop on Geographic Information Retrieval - GIR'17 Geographic Information Retrieval ACM Press New York, New York, USA , (2017). 9781450353380 , 10.1145/3155902 20171211081851 http://dl.acm.org/citation.cfm?doid=3155902 Adrien Barbaresi Towards a toolbox to map historical text collections , (2017). 1 2 , 10.1145/3155902.3155905 20171211081852 http://dl.acm.org/citation.cfm?doid=3155902.3155905 the 11th Workshop GIR'17 Heidelberg, Germany Proceedings of the 11th Workshop on Geographic Information Retrieval - GIR'17 Geographic Information Retrieval ACM Press New York, New York, USA , (2017). 9781450353380 , 10.1145/3155902 20171211081851 http://dl.acm.org/citation.cfm?doid=3155902 Benedikt Budig and Thomas C. van Dijk Journeys of the Past A Hidden Markov Approach to Georeferencing Historical Itineraries , (2017). 1 10 , 10.1145/3155902.3155906 20171211081852 http://dl.acm.org/citation.cfm?doid=3155902.3155906
- Paolo Plini, Sabina Di Franco and Rosamaria Salvatori, One name one place? Dealing with toponyms in WWI, GeoJournal, (2016). the 10th Workshop GIR '16 Burlingame, California Proceedings of the 10th Workshop on Geographic Information Retrieval - GIR '16 Geographic Information Retrieval ACM Press New York, New York, USA , (2016). 9781450345880 , 10.1145/3003464 20161114111741 http://dl.acm.org/citation.cfm?doid=3003464 Daniel Blank and Andreas Henrich A depth-first branch-and-bound algorithm for geocoding historic itinerary tables , (2016). 1 10 , 10.1145/3003464.3003467 20161114111741 http://dl.acm.org/citation.cfm?doid=3003464.3003467




