Potential of deep learning segmentation for the extraction of archaeological features from historical map series

Abstract Historical maps present a unique depiction of past landscapes, providing evidence for a wide range of information such as settlement distribution, past land use, natural resources, transport networks, toponymy and other natural and cultural data within an explicitly spatial context. Maps produced before the expansion of large‐scale mechanized agriculture reflect a landscape that is lost today. Of particular interest to us is the great quantity of archaeologically relevant information that these maps recorded, both deliberately and incidentally. Despite the importance of the information they contain, researchers have only recently begun to automatically digitize and extract data from such maps as coherent information, rather than manually examine a raster image. However, these new approaches have focused on specific types of information that cannot be used directly for archaeological or heritage purposes. This paper provides a proof of concept of the application of deep learning techniques to extract archaeological information from historical maps in an automated manner. Early twentieth century colonial map series have been chosen, as they provide enough time depth to avoid many recent large‐scale landscape modifications and cover very large areas (comprising several countries). The use of common symbology and conventions enhance the applicability of the method. The results show deep learning to be an efficient tool for the recovery of georeferenced, archaeologically relevant information that is represented as conventional signs, line‐drawings and text in historical maps. The method can provide excellent results when an adequate training dataset has been gathered and is therefore at its best when applied to the large map series that can supply such information. The deep learning approaches described here open up the possibility to map sites and features across entire map series much more quickly and coherently than other available methods, opening up the potential to reconstruct archaeological landscapes at continental scales.

mechanized agriculture reflect a landscape that is lost today. Of particular interest to us is the great quantity of archaeologically relevant information that these maps recorded, both deliberately and incidentally. Despite the importance of the information they contain, researchers have only recently begun to automatically digitize and extract data from such maps as coherent information, rather than manually examine a raster image. However, these new approaches have focused on specific types of information that cannot be used directly for archaeological or heritage purposes. This paper provides a proof of concept of the application of deep learning techniques to extract archaeological information from historical maps in an automated manner.
Early twentieth century colonial map series have been chosen, as they provide enough time depth to avoid many recent large-scale landscape modifications and cover very large areas (comprising several countries). The use of common symbology and conventions enhance the applicability of the method. The results show deep learning to be an efficient tool for the recovery of georeferenced, archaeologically relevant information that is represented as conventional signs, line-drawings and text in historical maps. The method can provide excellent results when an adequate training dataset has been gathered and is therefore at its best when applied to the large map series that can supply such information. The deep learning approaches described here open up the possibility to map sites and features across entire map series much more quickly and coherently than other available methods, opening up the potential to reconstruct archaeological landscapes at continental scales.

| INTRODUCTION
The use of historical maps has a long tradition in archaeological research and has played an important role in a wide array of scientific disciplines . As a 'frozen' image of a territory, maps provide information about the landscape and society of the period in which they were created. Beyond that, they are also useful for analysing aspects of the landscape which have since been truncated or destroyed by the types of large-scale transformations conducted over the last century, in particular, mechanized agriculture and urban expansion. Archaeologists have been using these sources for a long time in order to carry out regressive analysis and reconstruct successive landscape phases through time employing old maps for the study of historical patterns in settlement, road networks and field systems (Bevan & Conolly, 2002;Chouquer, 1996;Crawford, 1926;Hoskins, 1955;Orengo & Palet, 2009;Vermeulen, Antrop, Hageman, & Wiedemann, 2001;Vion, 1989). Historical maps can also be used to identify archaeological sites or features which were reported, on purpose or accidentally, by the surveyors (Lape, 2002;Orengo & Fiz, 2008;Panich, Schneider, & Byram, 2018;Petrie et al., 2019;Rondelli, Stride, & García-Granero, 2013).
Systematic survey and mapping have been an essential and widely used instrument of statecraft for centuries, used to conquer, control, manage, tax, exploit, divide and protect areas. Since the late eighteenth century, the development of survey techniques on one side and political and ideological interests on the other pushed several European states to undertake systematic mapping of their own territories at an unprecedented scale and extension (Kent, Vervust, Demhardt, & Millea, 2020). This step change in European map production was almost immediately applied in their colonial dominions, starting during the nineteenth century, thereby reaching large parts of the world, as an inseparable companion of enlightenment, imperialism, agricultural intensification and the industrial revolution. In the aftermath of the First World War, imperial dominions extended through large parts of the Middle East and marked the beginning in the use of aerial survey techniques for large scale mapping.
The Cassini Carte of France, the British Ordnance Survey and the Russian mapping of Siberia and Central Asia are examples of grand projects that are well known and employed within archaeological research. In this context can be placed the two series used in this work: the Survey of India (SoI), which was initially developed in parallel with the expansion of British control in India during the nineteenth century (Edney, 2009;Sarkar, 2020), and the 1:50.000 series derived from the works performed by the Bureau Topographique du Levant (BTL, later renamed Service Géographique des Forces Française Libres du Levant) created in 1918 under the authority of the Service Géographique de l'Armée (Le Douarin, 2020). Despite significant differences in technical apparatus, many of these maps were produced to a very high standard, with a spatial accuracy which is almost comparable to that of modern maps at similar scales.
The vast amount of information resulting from the continuous systematic mapping projects conducted between the late eighteenth century and the middle of the twentieth century remains in physical archives. Many institutions are currently developing digitizing programs to make these maps more readily available. Maps have been digitized on demand for research purposes, and digital repositories are becoming available. However, the number of digital historical maps is still relatively small in comparison to the total coverage, and to collect the maps necessary to ensure coverage of a large study area usually requires access to several repositories and the digitization of archivestored originals often hosted in multiple institutions.
These map series offer considerable potential for archaeological and historical research and also heritage protection and management as they often record archaeological sites, historical monuments and other features of archaeological interest, either deliberately or incidentally through features such as place names, specific symbology or topographic expressions . These colonial map series were produced intensively during the nineteenth and early twentieth centuries, and they depict landscapes that have been substantially modified since their production. During the last half century, the adoption of mechanized agriculture, intensive irrigation, urban development and in some areas conflict and largescale looting has dramatically changed the landscapes reflected in these maps, making them much more valuable for archaeologists and historians. In many cases, they include archaeological features which are difficult to identify today and may be entirely destroyed.
Associated information such as toponyms is also very valuable, as they document historic knowledge that might also be lost today. In many regions, the quality and importance of the information contained in these map series should qualify them as one of the basic, most relevant sources for archaeological, historical and heritage research. However, this has rarely been the case, and, although many projects make use of these historical maps, there have been few systematic attempts to extract information as large-scale quantifiable georeferenced data.
Up to now, most uses of historical map collections have relied on the digitization of maps as raster image files and their more or less systematic georeferencing in GIS environments (Orengo, Krahtopoulou, Garcia-Molsosa, Palaiochoritis, & Stamati, 2015;Petrie et al., 2019). However, the most time-consuming part, the extraction of features of historic-archaeological interest, has had to be done using manual approaches. This process has involved the visual identification/location of features and their digitization using vector formats that could correspond to points (the fastest of the methods), lines or polygons (which provide extra information such as shape and area but require a higher investment of labour). There has been a recent increase in the development of approaches directed to the automatic vectorization of maps Shbita et al., 2020;. Those cases take advantage of current developments in machine learning (ML) and deep learning (DL) approaches to computer vision (CV), with neural networks (NNs) having a prominent role. These approaches largely remain experimental and complex and do not categorize elements of archaeological interest. Notably, the archaeologically relevant information is included within other categories of data such as topography, toponymy or specific map symbology and still requires manual extraction and analysis.
In this paper, we provide a first proof of concept for the automatic extraction of features of archaeological interest from large series of historical maps using DL approaches. For this purpose, we have selected two map series depicting areas of high archaeological potential that were produced by two different colonial governments:  archaeologists have taken the plunge, the complex alignment of sources, technical capacities and research questions required can produce disappointing outcomes. As a result, there is some understandable scepticism towards its practical utility (Casana, 2014;Palmer, 2020).
Despite their potential, historical maps have been left outside this approach. This is likely due to several factors: 1. A certain amount of preprocessing is necessary to apply ML methods, such as digitization and georeferencing.
2. Maps are not always easy to access, and there are few complete historical map series that can be freely accessed and downloaded in digital form.
F I G U R E 2 (1 and 2) location of the area of Syria covered by the maps used in this test. Different examples to represent potential archaeological mounds (3-5) and the presence of settlement ruins (6-8). Note that 'tell' (Arabic for settlement mound) may appear as the name of a mound feature or a toponym in the absence of an obvious topographic feature. This convention may be due to the placement of the names on the map, or a real difference in the location of the named village and the tell site, or because the tell has been destroyed in advance of the mapping of the region [Colour figure can be viewed at wileyonlinelibrary.com] 3. Maps are subjective sources made by surveyors whose interest was rarely the recording of archaeological sites. As such, they often lack strict, systematic parameters that can be used to identify archaeological sites. Sites can be represented through its topography but also conventional signs, both intended and unintended. That means that the same type of cultural element might be represented by different symbology in the same collection or even the same map. Also, the inclusion or not of a site mark in a map is highly dependent on the surveyor perception.
4. In contrast with current DL archaeological applications, which usually focus on simple shapes such as mounds visible in lidar-derived topographic data, features of interest in historical maps, even within the same object class, present inconsistent and irregular shapes. Their detection requires a much larger quantity of training data and the use of data augmentation techniques.
The first two factors that have influenced the use of historical maps can be at least partially overcome by choosing appropriate map series and working across multiple institutions. We might also expect accessibility of map series to increase over time as more institutions digitize their collections. Importantly, the age of many of the map series means they are no longer subject to copyright and it is possible to make them publicly available with limited restrictions on reuse. The last two factors are more challenging to overcome, but we believe that a systematic extraction of archaeological and heritage features from historical maps is not just possible but beneficial under certain circumstances.
It is important not to overlook the fact that many of the identified features need to be verified in the field. This presents additional problems because landscape change and inaccuracies in the recording, georeferencing and placement of tags can make the ground checking of map-recorded features complicated.
Our approach to implementing ML on historical maps is based on two separate steps.

| Georeferencing of high-resolution digitized historical maps
For the SoI map series, a detailed description of the georeferencing procedures that have been developed and implemented can be found in Petrie et al. (2019) and Green et al. (2019). A short summary is offered here.
Both ESRI's ArcMap (ESRI, 2020) and QGIS georeferencing tools (a plugin using GDAL in the case of QGIS; QGIS, 2020) were employed for the georeferencing process using WGS84 as the geodetic datum. Ground control points (GCPs) were obtained in ArcGIS through its basemap service and QGIS using high-resolution Bing and Google imagery services imported as XYZ tiles.
Because the maps were digitized using either a photographic camera or a barrel scanner and their preservation state was not ideal, we employed a minimum of 20 clearly identifiable GCPs distributed evenly across each map. These consisted mostly of canal, road, and railroad intersections, which were some of the few landscape elements that have been preserved since the early twentieth century.
GCPs for each map were evaluated using their RMSE values, and unreliable GCPs were eliminated to achieve the best possible result.

| CNN-based DL segmentation of features of interest in digitized historical maps
Although mounded shapes characterize many of the archaeological sites in our two test areas, these features do not follow a unique or standard form of representation. Mounds and other features of interest can be represented using a variety of symbols and toponymy (see Petrie et al., 2019). In this work, we test three different types of rep- The strategy adopted here makes use of segmentation approaches as, besides site location, we were interested in the shape and size of the features of interest, in particular mound representations. Rather than employing a single detector to classify the whole map series, we developed different detectors focusing on specific elements of interest. This strategy allowed us to have a more focused training process in which only a particular element per detector was tagged, avoiding confusion between classes.
Given the number of classifiers required to detect all objects of archaeological interest, we employed Picterra, an online ML platform that provides a simple and intuitive graphical interface for the selection of training data. Picterra uses a U-Net-based architecture (Ronneberger, Fischer, & Brox, 2015) for the ML object instance segmentation. Convolutional neural networks (CNN) are DL architectures that, among other uses, can identify and outline predefined objects classes from raster images through the patterns in pixel relations. This approach is well suited for identifying individual objects not necessarily identical but that share a similar representation on the maps.
Typical DL methods combine object detection to classify individual objects and locate each of these within a bounding box, and semantic segmentation, which classifies each image pixel into a category and instance segmentation, in order to differentiate between object instances. Picterra implementation uses a CNN architecture based on U-Net. The algorithm automatically performs a series of preprocessing steps including data augmentation, which aims at providing well-balanced and effective training data for the development of the model. The training of the models uses cloud-based distributed computing, which greatly speeds the process. In addition, by using proprietary postprocessing techniques on the output of the U-Net model, it separates the per-pixel classification results into separate objects, effectively outputting Mask R-CNN-like instance segmentation results without the overhead of a large and overly complex network that requires an abundance of data to be trained on. In this way, the training and testing of detectors can be achieved very quickly without the need to gather large amounts of annotated images. There are two other reasons why Picterra was considered an adequate platform for this research instead of developing our own open access detectors: (a) the research aim was to test the potential of DL for the detection of multiple map features, and therefore, a fast and efficient method allowed us to experiment until an adequate detector for each feature was achieved; (b) the symbology and representation of archaeological features can vary greatly between series and between maps in individual series, meaning other researchers will have to train their own algorithm that fits the specific features and symbology of the maps they are using. Library and the Bodleian Library also hold substantial collections of the 1 00 to 1-mile series, and although there is much overlap, these collections also complement each other. The US Army produced copies of the SoI maps, and these copies have been digitized by the University of Texas (US Army Service, 1955), which has made them publicly available.

| Map series and training of the algorithm
In the case of the SoI maps, these have been previously employed to support archaeological survey in South Asia  in particular in the Indian State of Haryana   The maps of the French Levant 1:50,000 series have been used as a reference by archaeological surveys for many decades (Braemer, 1984(Braemer, , 1988   • Detector 4 (Figure 4: 14) focuses on the conventional symbol referring to ruins. It resembles a grouping of 'L'-shaped marks, perhaps indicating walls. Rather than using the single 'L' shape, which would have resulted in the detection of a large number of false positives given the simplicity of the symbol and its common appearance in other map features unrelated to ruins, the algorithm was trained using the ensemble of signs used to represent a single site. This is a complex type of symbolic representation. Although single symbols (such as red triangles) would have presented a much easier target, these composite symbols are challenging because of (1) the simplicity of the 'L' shape, which forms a part of many other symbols including letters and (2) the variable and changing way in which they are employed to represent sites. We used eight training areas from four different maps which contain 235 features in total.

| RESULTS
Within the 47 maps of the SoI analysed, 13,130 features were identified through the DL process before using the size threshold, 322 features in the early 1900s maps and 12,808 in the 1930s maps.
Applying the size threshold resulted in 638 high probability features (162 in the early 1900s maps and 476 in the 1930s maps).
Comparing the results with the systematic manual identification, all detectors employed have managed to detect at least 90% of the features identified through the manual identification (Table 1). In the case of the SoI, the detectors missed only seven features in total ( Figure 3: 13-16 and 28), with the nuance that in a few cases the area identified is not large enough and, thus, once the threshold of 2 ha is

| DISCUSSION
The CNN-based automated detection and instance segmentation method presented here is able to produce a reliable approximation of mounds and other features in both the SoI and Levant map series.
These processes allow the production of digitized and geolocated areas of archaeological interest that can be used in the design of ground truth survey strategies and cultural heritage protection. It constitutes a quick and effective approach to develop preliminary information and initial hypotheses on the location, distribution and patterning of archaeological sites over large areas. These results could be combined with the analysis of remote sensing datasets to provide further support for interpretations made from map sources. Ultimately, however, field validation is still needed to confirm that a location is of archaeological interest.
Compared with the manual approaches commonly used, the automated detection results for the SoI maps are particularly effective in the identification of large mounds, which are strongly associated with archaeological sites. Of the 135 known archaeological sites depicted as mounds within the study area (Mughal et al., 1996), all but three were identified by the algorithm. Nonetheless, it has a less discriminant and interpretative capability than detailed human visual inspection. That limitation adds more noise to the dataset due to the existence of other types of small roughly circular features that are represented on the maps in a very similar way to settlement mounds, increasing the number of false positives. Size thresholds clearly have some potential for allowing us to overcome this problem but at the cost of missing some points of archaeological interest. A larger training dataset might help to identify better the shape of the features, which could increase the effectivity of the threshold.
However, the SoI-focused detectors offer more coherent results than those resulting from manual extraction made by a group of T A B L E 1 Summary of the accuracy obtained by the different detectors (see also Figures 3 and 4 Manual identification does not correspond exactly to automated detected features plus missed features-variation ranges from 0 (D1b and D3) to 8 (D2) cases for several reasons: missed during the manual inspection, features close to each other joined by the detectors or small features that fall under the threshold. b The numbers given here for the SoI maps are the result of applying a 2 ha threshold (see Figure 3).
analysts, particularly if they are not experienced or an effort has not been made to uniformize interpretations between team members.
The detectors tested in the Levant series have provided a first insight into two different types of archaeological information contained in historical maps: the toponymic reference and the conventional signs. Despite using a limited number of maps for the training, three for toponyms and four for complex site symbols, the results obtained show that the detectors missed relatively few of the targeted symbols and characters but did introduce a significant percentage of false positives. Thus, further work is still needed for these detectors to be employed effectively. The relatively small training dataset used for these maps and the much better performance pro- The use of computing platforms like Picterra provides a useful avenue for the implementation of automated detection to archaeological research. It is unreasonable to expect that all those archaeologists that could benefit from these approaches in academia, commercial archaeology or heritage management agencies will be able to build and train their own algorithms, especially given the computational capacity required. In that sense, the possibility to access ready-made instruments and platforms can be beneficial in terms of testing different approaches and sources but also offers the chance to involve more traditional archaeologists in the development of their own detectors and help them understand the potential of the application of these technologies in our discipline. In that regard, automatic detection is in a position now to start making a practical contribution to the discipline and to be implemented as another instrument in the archaeologists' toolkit, in a similar way in which GIS was assimilated over the last 20 years (Wheatley & Gillings, 2013 (Agnihotri, 1996;Gilmartin, 2015) and Syria analysed here.

| CONCLUSIONS
Automatic detection and instance segmentation of objects in digitized historical maps using ML CNN-based approaches offer an efficient way forward for the retrieval of unique information of archaeological interest. However, the use of these approaches needs to take into account: • • The detection and masking capacity of the detector in terms of counting precision and recall and shape accuracy. Some features will be more easily and unequivocally detected than others, and this limitation must be taken into account when using these data for archaeological analysis and interpretation. Sites themselves are rarely detected, but proxies that can be used to extract information about sites can be documented. Given the variations in map quality discussed above, this information can only be considered an approximation of the true number, size, form and location of the features of interest.
• The possibility of incorporating further datasets, both from other remote sources, such as satellite imagery, and through field-based ground checking. Given the inexact nature of counts, locations and shapes, the presence of a small percentage of false positives and the difference in accuracy and recording practices between individual surveyors, we argue that the best way to conceive of the results is through a probabilistic framework. This is particularly true of large-scale approaches where cross-referencing of information obtained through different methods and sources can be used to weight possible sites. The use of complementary approaches and sources has enormous potential to obtain probabilistic site distribution maps across large areas.
• The range and scale of the map series available. These approaches are most useful when applied to large map series where objects from many maps can be employed to train the different detectors and the time invested in training them will be compensated by their application to several hundred maps. Colonial map series, in particular, show similar symbology and survey approaches, and they extend across very large areas, often spanning several modern countries. These are factors that can make the development of multiple object-focused detectors worth the time invested in training them, particularly in comparison to manual approaches. The use of these techniques for small areas composed by a few maps is not recommended as it will be difficult to obtain enough training data to develop an efficient detector and the time required to do this may exceed that which would be needed for expert-led manual detection.
The positive results of this first application of object segmentation using the SoI and French Levant maps opens up the possibility of scaling up our analysis to larger areas covered by these colonial map series. Other large map series, such as those produced by Soviet cartographers across the USSR and parts of Europe and Asia (Davies & Kent, 2017), offer similar potential. We hope that in time, colonial map series can be used for the understanding and protection of cultural heritage and local cultures instead of the direct and indirect exploitation for which they were originally intended. https://wamstrin.wordpress.com/). We would especially like to thank the staff of the Map Room and Imaging Services at the University Library at the University of Cambridge for providing access to and high-resolution copies of the SoI 1 00 to 1-mile maps in their collection.
We would also like to thank Leo Rocher from Picterra for his availability and willingness to discuss several aspects of Picterra's technical processes. The manual extraction of data from the Levant map series covering Syria was undertaken as part of the Vanishing Landscape of Syria Project, which was funded by the Leverhulme Trust, Grant F00128/AR, and directed by Graham Philip, University of Durham.

CONFLICT OF INTEREST
The authors declare that no conflict of interest exist.

DATA AVAILABILITY STATEMENT
Though the code used in this paper cannot be shared given the use of a proprietary cloud computing platform to train and run the detection process, readers can access the training data and results obtained for each of the detectors, following this link, which requires a free Picterra's account (https://forms.gle/L8YngAd87eckYSDQA). This proof of concept aimed to test the use of DL approaches to the extraction of features of archaeological interest form historical map series. The paper's results confirm the potential of DL-based segmentation. Future research will gear towards the development of effective open-source detectors trained using larger collections of diverse feature types that will significantly improve the results presented in this paper. ORCID