Exploring human–nature interactions in national parks with social media photographs and computer vision

Understanding the activities and preferences of visitors is crucial for managing protected areas and planning conservation strategies. Conservation culturomics promotes the use of user‐generated online content in conservation science. Geotagged social media content is a unique source of in situ information on human presence and activities in nature. Photographs posted on social media platforms are a promising source of information, but analyzing large volumes of photographs manually remains laborious. We examined the application of state‐of‐the‐art computer‐vision methods to studying human–nature interactions. We used semantic clustering, scene classification, and object detection to automatically analyze photographs taken in Finnish national parks by domestic and international visitors. Our results showed that human–nature interactions can be extracted from user‐generated photographs with computer vision. The different methods complemented each other by revealing broad visual themes related to level of the data set, landscape photogeneity, and human activities. Geotagged photographs revealed distinct regional profiles for national parks (e.g., preferences in landscapes and activities), which are potentially useful in park management. Photographic content differed between domestic and international visitors, which indicates differences in activities and preferences. Information extracted automatically from photographs can help identify preferences among diverse visitor groups, which can be used to create profiles of national parks for conservation marketing and to support conservation strategies that rely on public acceptance. The application of computer‐vision methods to automatic content analysis of photographs should be explored further in conservation culturomics, particularly in combination with rich metadata available on social media platforms.


Introduction
Protected areas are considered a cornerstone for protecting species and ecosystems (Watson et al. 2014). In many countries, iconic national parks act as the flagships of the protected-area network. Historically, the establishment of national parks originated from the desire to preserve scenic landscape areas of national or regional importance (Lee 1972;Schullery & Whittlesey 2003). Even today, protected-area visitor rates are associated with access to scenic landscapes, available visitor activities, and biodiversity values (Neuvonen et al. 2010;Siikamäki et al. 2015;Hausmann et al. 2017). Recreational use of conservation areas may directly or indirectly help fund conservation on-site and gain political support for conservation (Di Minin et al. 2013;Whitelaw et al. 2014;Balmford et al. 2015), even if the relationship between recreational visits and nature conservation is sometimes complex (Bateman & Fleming 2017;Buckley 2018).
To develop the recreational use of protected areas in line with conservation goals, protected-area organizations in many countries actively gather visitor informa-tion. Depending on the organization, visitor information may be collected using registration forms at the park entrance, placing counters along the paths, or by conducting surveys or interviews on-site or online (Pietilä & Fagerholm 2019). Information about different groups of visitors may then be used to guide national park management and marketing actions, as well as conservation strategies (Kruger et al. 2017).
Although it is acknowledged that understanding the human dimensions of environmental issues supports nature conservation (Bennett et al. 2017;Sutherland et al. 2018), traditional on-site approaches for collecting information are time-consuming and costly. Therefore, user-generated online content is increasingly used as an information source in conservation science under the emerging subfield of conservation culturomics (Arts et al. 2015;Di Minin et al. 2015;Ladle et al. 2016) and the interest is also increasing among practitioners.
Social media data are a particularly interesting source of information for understanding human-nature interactions because they provide spatially and temporally explicit data on visits, together with rich textual and visual content (Toivonen et al. 2019). Although the textual content analysis can provide useful insights on nature conservation (Ladle et al. 2016), there have been calls for increased attention to visual communication in conservation culturomics research (Sherren et al. 2017;Ghermandi & Sinclair 2019). Focusing on photographs as the source of information allows the challenges arising from language and limitations of textual analysis to be avoided (Carter et al. 2013).
Social media photographs have already proven useful, for example, for obtaining information on visitor preferences or activities in national parks Hausmann et al. 2018) and on cultural ecosystem services across landscapes (Richards & Friess 2015;Van Berkel et al. 2018;Pickering et al. 2020). Laborious manual analyses of photographs are now complemented by automated visual content analysis methods. State-ofthe art computer-vision methods allow for information to be extracted from large volumes of photographs by classifying the content into predefined classes (such as landscapes), by recognizing discrete objects (such as species), or by grouping together similar images for human analysts. These approaches have recently been used to monitor species (Sharma et al. 2018) and to examine aesthetic preferences (Seresinhe et al. 2017(Seresinhe et al. , 2018 and human activities and preferences (Richards & Tunçer 2018;Gosal et al. 2019;Koylu et al. 2019).
We aimed to contribute to the application of computer-vision methods to visual content analysis in protected-area visitor monitoring. We evaluated the applicability of 3 computer-vision methods for extracting information on human-nature interactions in national parks with social media photographs. We aimed to answer questions that are typically analyzed by visitor surveys, such as the preferences of different visitor groups or geographical differences of activities. Our study area was Finnish national parks, and we used Flickr data for our exploration. We sought to answer the following questions: What information can state-of-the-art computervision methods extract from social media photographs? Do different visitor groups share different types of photographs from national parks? How does photographic content vary between different types of national parks?
To answer to our questions, we collected geotagged Flickr data from the 20 most popular national parks in Finland. We classified the users into national and international visitors based on their profile information. We applied t3 computer-vision methods to the photographs, namely, semantic clustering of photographic content, scene classification, and instance-level object detection and evaluated their applicability to visitor monitoring of protected areas. Using our findings, we considered the potential and challenges of using social media photographs and computer-vision methods to understand the use of and values associated with protected areas and in conservation more broadly.

Study Area
Finland has 40 national parks located from hemiboreal coastal zone to the tundra of the northernmost parts of Lapland. Visitor numbers are rising steadily. In 2019, the parks received more than 3.2 million visitors (https:// www.metsa.fi/web/en/visitationnumbers). In 2020, the numbers have surged due to the COVID-19 crisis and people wishing to visit nature. The parks are managed by Parks and Wildlife Finland (Metsähallitus). The organization has systematically collected information on national park use, activities, and preferences for several decades (Kajala et al. 2007) and established profiles of the parks (broader description in Appendix S1). Because the natural and seminatural landscapes are relatively similar throughout Finland, we wanted to see if the computer-vision methods used could reveal differences between national parks located across different landscape regions. We focused our analysis on 20 popular national parks based on the availability of Flickr photographs. We grouped the selected national parks into 4 broad landscape categories for further analysis (Fig. 1).

Downloading Flickr Data
Flickr is a social media platform for sharing images and video, and it is particularly popular among professional photographers and nature enthusiasts (Di Minin et al. 2015). The Flickr API allows open access to Flickr content in compliance with the restrictions set by photo owners (https://www.flickr.com/help/terms/api). Geotagged Flickr posts correspond relatively well to the popularity of Finnish national parks . We used data from Flickr instead of other platforms (such as Twitter or Instagram) because its terms of service allow for the application of computer-vision methods to analyze the visual content of photographs (Toivonen et al. 2019).
First, we searched the Flickr API (https://www.flickr. com/services/api/) for all geotagged Flickr posts located within 500 m of all Finnish national parks (n = 40) in January 2019. This returned 14,585 geotagged posts uploaded by 969 unique users from 2002 to 2019. Second, we downloaded the images at their original size up to the highest available size allowed by the application programing interface (1024×768 pixels). In total, 13,363 images were available for download. Finally, we selected 20 parks with the highest Flickr post counts (>100) as our final data set for content analysis: 12,759 images uploaded by 824 unique users. The amount of Flickr data and Finnish national park visitor counts are available in Appendix S2.

User Classification
We manually classified the 824 unique users in the data set as national (from Finland) or international and by gender based on the information available in public profiles. We detected the probable country of residence for each user, primarily based on the self-reported home location in the user profile. If the user had not reported their home location, we combined information from the user's name, profile descriptions, and linked websites to determine the country of residence. In some cases, forenames and surnames can give a good indication of the geographic region of origin (Longley et al. 2015), particularly when combined with other information. If profile information was not sufficient, we also considered the geographic distribution of photographs for determining the home location. For example, we classified users as locals if they mentioned a Finnish hometown in the profile description, used the Similarly, we classified users as internationals if the profile information referred to a place of residence outside of Finland. We recorded gender as male or female based on the username, profile picture, and other available information. For some users, it was not possible to detect the home location or gender due to limited or ambiguous information.

Automating Content Analysis with Computer Vision
We used 3 computer-vision methods for automatic visual content analysis of photographs taken at national parks. These methods use deep neural networks, a family of machine learning algorithms (LeCun et al. 2015). Semantic clustering involves using a pretrained neural network to extract a high-dimensional feature vector that represents the semantic content of the photograph, whose dimensionality is then reduced to enable plotting lowdimensional representations to explore similarities and differences between photographs and their contents. We used a neural network trained to classify objects into 1000 categories as a feature extractor. Scene classification involves classifying photographs into predefined categories, providing a set of potential category labels and their associated probabilities (Zhou et al. 2018). Instancelevel object detection detects individual instances of objects belonging to predefined categories and their locations in the photograph ). This method returns the predicted label of the object, its associated probability, and its predicted location in the photograph. The computer-vision methods are summarized in Table 1. All images required preprocessing because the computer-vision methods used required the input size to be of fixed dimensions. Images were resized to 224×224 pixels for feature extraction and scene classification, and to 512×512 pixels for instance-level object detection. Because most images did not have an aspect ratio of 1:1 (equal height and width), we resized the images to a fixed height of 224 or 512 pixels before cropping 224 or 512 pixels in the middle of the image. This kind of center crop, which assumes that the most important content is centered in the photograph, preserves the shape of objects in the image because the aspect ratio is not altered, although some objects at the edges of the photograph may be lost during preprocessing.

Semantic Clustering
We evaluated several neural network architectures and pretrained models for semantic clustering. The neural network architectures included VGG16 (Simonyan & Zisserman 2015), NASNet (Zoph et al. 2018), Xception (Chollet 2017), ResNet50 (He et al. 2016), and ResNeXt101 (Xie et al. 2017), which were trained to classify images into the 1000 object categories in the ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al. 2015). We evaluated the performance of each architecture and model qualitatively by extracting high-dimensional feature vectors, the size of which ranged from 512 to 2048 dimensions. We then used UMAP (uniform manifold approximation and projection) (McInnes et al. 2018), a dimensionality reduction algorithm, to reduce the feature vectors to 2 dimensions for visualization. The UMAP algorithm reduces dimensionality by learning to map points between high-and lowdimensional spaces, while attempting to preserve the structure of the high-dimensional graph (McInnes et al. 2018  to be used together with UMAP to semantically cluster the photographs based on their content.

Scene Classification
For scene classification, we used a neural network with the VGG16 architecture (Simonyan & Zisserman 2015) trained on the Places365 data set (Zhou et al. 2018) and implemented by Kalliatakis (2017). Places365 is a data set that contains 1.8 million images belonging to 365 indoor and outdoor scene categories. The data set was developed for scene classification, that is, the task of recognizing the type of a visual scene represented in an image. Places365 is a subset of the larger Places2 database, which contains 10 million images for 434 scene categories. For each image, we retrieved the top-3 predicted labels and their associated probabilities.

Instance-Level Object Detection
For instance-level object detection, we used a neural network with the Mask R-CNN architecture  trained on the Microsoft COCO (common objects in context) data set, which features 80 object categories consisting of everyday objects, such as persons, household items, and animals (Lin et al. 2014). We used a Mask R-CNN implementation by Abdulla (2017). Mask R-CNN provides each object detected and segmented from an image with a probability that reflects the confidence of the model about the prediction. To improve the results, we included only object instances detected with a confidence of 0.7 (70%) or higher.

User Groups
Out of the 824 unique users, we identified 62% as locals and 33% as internationals. Visitors were mostly from Europe, the United States, and Japan. For 5% of the users, who contributed 2% of the photographs, the exact country of residence could not be determined, but they were counted as international visitors in the final classification. Gender classification revealed a strong gender bias; 79% of users were men. Only 10% of the users were classified as women, who contributed 4% of the photographs. We could not determine gender for 11% of the user profiles. These profiles either had no clear indication of gender or they were organizational profiles. Overall, local men had posted 72% of all photographs. (Details on origin and gender classification in Appendix S3.)

Automatic Visual Content Analysis with Computer Vision
Semantic clustering with 2-dimensional UMAP representations of original images showed that photographs with similar semantic content were clustered (Fig. 2), indicating that the neural network could extract distinctive high-level semantic information from the photographs. Individual clusters feature photographs of seasonal activities, such as skiing and orienteering; objects, such as dogs, plants, and humans; and landscapes, such as sky with auroras or shoreline views. Photographs with human activities are clustered together, whereas landscapes form their own clusters. Furthermore, photographs of forests during winter and summer, as well as seascapes,

plots for national and international visitors and image feature plots for (b) Lapland Fells, (c) Eastern Hills, (d) Forests and Lakes, and (e) Archipelago regardless of user origin (the darker the color, the denser the cluster of semantically similar photographs).
form individual clusters, to name just a few examples. The photographic content also seems to cluster according to the landscape region. Further visualizations allow drawing comparisons between popular photographic content among national and international visitors, as well as different landscape regions (Fig. 3) and seasons (Appendix S6).
Scene classification predicted a scene category for each photograph (Fig. 4). The classifier identified 325 unique scene categories in the data set. Natural and seminatural scene categories, for example, forest path (9%), broadleaf forest (7%), snowfield (4%), and tundra (4%), were the most common. The 10 most common scene categories were found in 41% of the photographs (Fig. 4). Validating the classification results manually verified that they were mostly meaningful. Distinct scene classes were positioned in distinct areas in the 2-dimensional UMAP plot providing further validation (Appendix S8). The average confidence for the predictions ranged from 0.48 for the most likely category to 0.16 for the second and 0.084 for the third categories. Poorest confidence values were associated with close-ups and portraits of humans, which is not surprising given that the neural network was trained to classify visual scenes, not objects.
Instance-level object detection found objects in 7821 photographs. The most common object was a person, which was detected in approximately 37% of all photographs and in 60% of photographs that were predicted to contain some object. Most photographs contained a single person (49%) followed by 2 (20%) and 3 (10%) persons. Most photographs (4058) contained only one unique detected object, whereas the maximum number of unique detected objects in a single photograph was 11. To obtain some indication of the activities of users, we looked at the most common objects other than person recognized with instance-level object detection (Table 2 & Appendix S11). Many objects are related to activities, including backpack (present in 13% of photographs with objects), bench (9%), bird (7%), and boat (6%). Objects directly related to sport activities included bicycle (3%), skis (3%), sports ball (2%), frisbee (2%), and kite (1%).

Differences between National and International Visitors
The results revealed differences between national and international visitors to the parks. Photographs taken by the 2 groups largely overlapped each other in the visualization in Fig. 3, which suggests that both national and international visitors take photographs with largely similar content, but certain differences between these 2 groups may be identified by comparing Figs. 2 and 3. For example, almost all photographs of orienteering were taken by national visitors and in the same landscape region (Appendix S13). Photographs of forests taken in the summer were more common among the locals, whereas international visitors shared photographs of forests in the winter, selfies, and skiing. Similar differences appeared in the scene classification results when looking at the most confidently identified scene categories: national visitors post more photographs belonging to forest path and broadleaf forest categories, whereas international visitors shared photographs of ski slope, snowfield, and tundra (Fig. 4).
Objects were detected in 4938 photographs (61%) uploaded by national visitors and 1741 photographs (59%) by international visitors. We identified the most common objects from photographs taken by both groups (Table 2 & Appendix S11). A person was the most common category, present in 3706 photographs (37%) by national and in 1023 photographs (37%) by international visitors. On the average, photographs taken by international visitors featured more people than national visitors, regardless of the season or landscape region, although national visitors feature more persons during summers and in the Forests and Lakes landscape region (Appendix S14). Common objects detected among both visitor groups were related to physical activities (backpack, bicycle, and skis) and eating and picnicking (bench and dining table). Category dog reflects both dog walking and dog sleigh riding. The latter is a popular activity primarily among international visitors (Appendix S12). Other categories that reflect nature photography (bird and potted plant) were more popular among national visitors.

Differences between National Parks in Different Landscape Regions
The plots for each landscape region showed distinct clusters of photographs from each region (Fig. 2). A closer visual examination of clusters featuring orienteering and forest paths showed that they came mostly from the national parks in the Forest and Lakes region in southern and central Finland. Winter photographs were mostly taken in the Lapland Fells or Eastern Hills. Sea and lakeside photographs were predominantly from more southern landscape regions (Archipelago and Forest and Lakes). Many activity photographs with dogs, skis, or bikes were distributed across landscape regions. The cluster for orienteering (Fig. 2) overlapped largely with the Forests and Lakes region, forest path, and park scene categories and coincided temporally with orienteering events (Appendix S13). The results for scene classification revealed a similar trend (Fig. 4). Photographs classified to the forest path category came mostly from Forest and Lakes region, whereas most photographs classified as tundra or ski slope were taken in the Lapland Fells. Due to the visual similarity of certain landscapes in Finland, some photographs have clearly been misclassified, for example, tundra in the archipelago shores (Table 2 &  Appendix S8).

Discussion
We used 3 computer-vision methods to automate the visual content analysis of photographs from national parks, and to evaluate their usability in understanding differences between regions and visitor groups. To support the application of these methods in practice, we concentrated on models that were available off-the-shelf and pretrained to perform a given task. In other words, their application does not require provision of manually labeled data or advanced in-house programing. Our results showed that each of the methods provided a view of the photo content and could be useful for a range of information needs in protected-area user monitoring and management. Many photographs taken in Finnish national parks featured landscapes, and we used scene classification to classify photographs into predefined categories (Zhou et al. 2018). The model predicted scene categories that fit the landscape regions defined for Finnish national parks: tundra and ski slope were commonly predicted for photographs taken at Lapland Fells, forest path in the Forest and Lakes region, and creek in the Eastern Hills region, which featured prominent river landscapes. Although some of these predictions may sound trivial, they confirm that scene classification produces meaningful results and provides quantifications of the representation of these landscapes in the photograph content. In our case, scene classification provided information on the most photogenic landscapes in each landscape region and separately for national and international visitors. If photographs represent landscape values (van Zanten et al. 2016), our results suggest that the international visitors value winter landscapes and activities like skiing and dog sledging, whereas Finnish visitors appreciate summer forests, autumn colors, and activities like orienteering, biking, and cooking. Finding international visitors, for example, valuing snow and using commercial services more than the local visitors is in accordance with individual park-level visitor surveys (see https://julkaisut.metsa.fi/), but social media photo analysis provides more nuances at broader geographical scales and higher temporal resolution.
Instance-level object detection predicts instances of predefined object categories and their locations in the photograph . We found this approach to be useful for separating landscape photographs from close-up photographs and their combinations. In earlier works, visitor activities have been classified manually based on the contents of social media photographs . Identifying the objects present in photographs automatically contributed to this need. The most common objects (e.g., backpack, skis, boat, or bird) can be directly associated with activities that have been identified as the most popular in visitor surveys. Instance-level object detection can also be used to select photographs for further analysis. To exemplify, inspecting photographs with dogs revealed a major difference between national and international visitors in our data. Both groups share photographs of dogs, but almost all photographs of dogs taken by international visitors were taken on organized dog sleigh safaris in the Lapland Fells, whereas national visitors shared photographs of dogs mainly from forest walks (Appendix S12). This illustrates that analyzing the results of automatic visual content analysis can reveal differences between park activities and visitor groups and how visitors use services provided by the local economy.
Unlike the first 2 methods, semantic clustering does not assign photographs or objects detected in them into predefined categories. Rather, it is useful for automatically organizing large volumes of photographs without any prior knowledge of their content. This enables a rapid overview of visual content posted across all protected areas by revealing meaningful clusters of photographs featuring different landscapes, animals, and human activities. This information may provide protected-area managers with rapid situational awareness. In the case of Finnish national parks, semantic clustering enabled identifying subtle differences between visitor groups and national parks across the entire data set. We propose that this method can be used to obtain an overall understanding of the photographic content posted from even broad areas of interest. All 3 computer-vision methods provided complementary perspectives to the automatic analysis of social media photography. In our case, automatic content analysis of photographs confirmed previous insights from visitor surveys, such as preferred activities, but also provided a completely new level of detail compared with traditional visitor surveys. These insights include emerging or event-type activities (e.g., orienteering in some parks), differing preferences between Finnish and foreign visitors (e.g., interest in dog sledging and other commercially organized wintertime activities among foreigners), and differing seasonality in visual content among visitor groups. In well-managed parks, the local park management is often familiar with their most popular activities. The proposed methods can provide an equally detailed understanding of visitor activities at national and regional scales yet provide a fine-grained view at a temporal resolution of individual events. Considering high costs involved with traditional visitor surveying, our positive experiences suggest that these methods may considerably improve understanding of visits to protected areas and human-nature interaction in general, particularly in areas where detailed monitoring of visitors is not feasible.
Like many recent studies on green areas (Sherren et al. 2017;Ghermandi & Sinclair 2019;Toivonen et al. 2019), we used Flickr as our data source in this study. Other social media platforms, such as Instagram, may capture a broader variety of human activities (Hausmann et al. 2018), but are not available for download or allowing computer-vision analysis (Toivonen et al. 2019). Therefore, despite the biases in the user base and the more limited content, Flickr continues to be a relevant source of data for visual content analysis, particularly when applying automated methods. Because the most photographed object on social media platforms is often people, both analysis and reporting of results must follow appropriate ethical practices (Zook et al. 2017;Di Minin et al. 2021). Compared with manual analyses of the photographs, the application of computer-vision methods may be less intrusive because individual photographs are not viewed by a human except when verifying the output from algorithms.
Beyond visitor monitoring and social media analyses, computer-vision methods are broadly interesting to various needs of conservation science. They may make it easier, for example, to analyze phenological changes (Correia et al. 2020), observe the occurrence of species (Willi et al. 2019), or track illegal wildlife trade . These methods hold much potential for further development in terms of combining semantic representations of content with other sources of information. For example, semantic clustering could be enriched by combining semantic representations of photographs with metadata related to time, place, user profile, and camera type, allowing the resulting visualizations to incorporate information about both photographs and their context. Futhermore, analyzing the combinations of textual and visual content would likely provide an even more comprehensive picture of visitor preferences and activities in nature.
Our findings suggest that applying the computer-vision methods to social media photographs is a useful addition to the visitor monitoring toolkit in protected areas. Different methods provide complementary views to large collections of user-generated photographs by identifying landscapes or objects that stand in for specific activities or simply by organizing large volumes of photographs based on their semantic content. The proposed methods improve constantly as new architectures, models, and data sets are developed and made openly available, which allows them to be rapidly incorporated into the analysis workflows of conservation science. We thus propose that the application of computer-vision methods to social media data should be explored further under the umbrella of conservation culturomics.