What Happens Where During Disasters? A Workflow for the Multifaceted Characterization of Crisis Events Based on Twitter Data

Twitter data are a valuable source of information for rescue and helping activities in case of natural disasters and technical accidents. Several methods for disaster&#8208; and event&#8208;related tweet filtering and classification are available to analyse social media streams. Rather than processing single tweets, taking into account space and time is likely to reveal even more insights regarding local event dynamics and impacts on population and environment. This study focuses on the design and evaluation of a generic workflow for Twitter data analysis that leverages that additional information to characterize crisis events more comprehensively. The workflow covers data acquisition, analysis and visualization, and aims at the provision of a multifaceted and detailed picture of events that happen in affected areas. This is approached by utilizing agile and flexible analysis methods providing different and complementary views on the data. Utilizing state&#8208;of&#8208;the&#8208;art deep learning and clustering methods, we are interested in the question, whether our workflow is suitable to reconstruct and picture the course of events during major natural disasters from Twitter data. Experimental results obtained with a data set acquired during hurricane Florence in September 2018 demonstrate the effectiveness of the applied methods but also indicate further interesting research questions and directions.

ment , spatio-temporal analysis of Twitter data (Steiger, de Albuquerque, & Zipf, 2015) and social media visual analytics (Ngamassi, Malik, Zhang, & Edbert, 2017), the following needs for further research can be pointed out: • Two main types of information about crisis situations are usually desired: understanding "the big picture" (for humanitarian and governmental emergency management organizations) versus finding "implicit and explicit requests related to emergency needs that should be fulfilled or serviced as soon as possible" ("actionable insights" for local police forces and firefighters).
• Methods that can automatically identify actionable messages from a live data stream during emergency events, including their urgency and information category, are essential for crisis responders to launch rapid relief efforts.
• One research objective should be to develop novel methods and approaches towards the spatio-temporal analysis and exploration of social media data, also by leveraging existing geographic knowledge.
• The spatial dimension in social media data has been given particular attention, while other dimensions have not been fully exploited in data fusion.
• Visual analytics tools should be developed to help determining the progress of an emergency and to help identifying the next possible solutions that could be used to alleviate damages.
• The fusion of social media data with census data and remote-sensing imagery is currently mainly about simple overlaying and aggregation and could therefore be further investigated.
Motivated by these findings, a flexible exploration of Twitter data streams based on spatio-temporal analyses and taking into account multiple dimensions of tweets, that is space, time and message content, is focused in this work. Our proposed workflow is intended to provide both a "big picture" of the course of events within an affected area and the identification of actionable tweets. The big picture describes when, where and what real and virtual event-related topics are discussed and how they change over time and space, whereas actionable information is conveyed by single messages indicating that help is urgently required.
The workflow covers data acquisition, analysis and visualization, and is intended to provide a multifaceted and detailed picture of events that happened in the affected area. The involved methods for short text analysis and clustering are identified according to the criteria flexibility, low computational costs, applicability to short messages and the successful application in other related works.
Furthermore, each of the methods is intended to provide a different complementary view on the data. For instance, local hotspot analysis is applied to analyse the spatial distribution of crisis-related tweets within affected areas, whereas spatio-temporal tweet clustering and classification is used to analyse the discussed topics within clusters. In order to investigate the capabilities of the workflow to reconstruct and picture the course of natural disaster events, it is exemplarily applied to a Twitter data set acquired during hurricane Florence in September 2018.
The article is organized as follows. In the next section, related work in the context of spatial and spatio-temporal data analysis is discussed. From a methodological point of view, this mainly involves clustering techniques, whereas from an application-oriented perspective, event detection and spatio-temporal process analysis are focused. Based on the review, our workflow for the multifaceted analysis of Twitter data related to natural disasters is proposed in section 3. The components of the workflow are in-

| REL ATED WORK
Exploring and reconstructing the course of (sub-) events potentially involves many different tasks, such as crawling, filtering, localizing and ranking of information, the detection, tracking and summarization of events, text classification, semantic enrichment and topic modelling. Systems for event-related social media data analysis usually focus on a specific challenge and therefore comprise a fixed sequence of tasks for which a fixed set of methods is utilized. Abel, Hauff, Houben, Stronkman, and Tao (2012) introduced Twitcident, a framework and Web-based system for filtering and searching information about real-world incidents or crises. Information filtering is adapted to the current temporal context by event profiling and semantic enrichment of tweets. Faceted search and analytic tools allow to retrieve particular information from the enriched data. With a focus on automated classification of crisis-related tweets, a platform using Artificial Intelligence for Disaster Response (AIDR) is proposed by Imran, Castillo, Lucas, Meier, and Vieweg (2014). The Emergency Analysis Identification and Management System (EAIMS) proposed by McCreadie, Macdonald, and Ounis (2016) focuses on real-time detection of emergency events, related information finding and credibility analysis tools for use over social media during emergencies. Event Tracker (Thomas, McCreadie, & Ounis, 2019) aims to provide a unified view of an event, integrating information from emergency response officers, the public (via social media) and also volunteers from around the world. Components are real-time identification of critical information, automatic grouping of content by the information needs of response officers, and real-time volunteer management and communication.
All mentioned systems significantly contribute to deduce an overview of the situation in areas affected by natural or man-made disasters. However, we can observe that the spatial and the spatio-temporal information of the data is not fully exploited. Despite the grouping of content in Thomas et al. (2019), rather tweet-wise analyses are conducted. A further result of our review is that spatio-temporal analyses are usually done in the context of a specific application, for example earthquake damage assessment (Resch, Uslaender, & Havas, 2018), or small-scale events (Huang et al., 2018), and therefore are often not transferable to other event types. This motivates our generic approach and the exploitation of message content, space and time.
In order to be application independent, we chose a rather method-oriented structure of the subsequent review. Since clustering is one of the key techniques in the field of spatio-temporal data mining (Shi et al., 2016), common clustering techniques are reviewed in the following. Clustering is often utilized for event detection and for spatio-temporal process analysis. Therefore, related work in this context is also discussed.

| Clustering approaches
A statistical approach for hotspot analysis is the Getis-Ord G * -method (Ord & Getis, 1995). This local spatial autocorrelation approach is applied for disaster footprint estimation and damage assessment based on social media data in (Resch et al., 2018).
Kernel density estimation (KDE) methods estimate a density surface from a set of point-based locations by adding functions, for example Gaussians, centred at each data point. In Lee, Gong, and Li (2017) (Cheng & Wicks, 2014) is an approach to analyse data points (incidents) within a space-time cube. A cylindrical window, of varying radius (space) and height (time), is moved across all possible space-time locations. Based on the number of observed incidents compared to the number of expected incidents, clusters of interest are identified.
In density-based clustering algorithm for applications with noise (DBSCAN) (Ester, Kriegel, Sander, & Xu, 1996), the number of clusters is identified based on the quantity of densely connected components. Method parameters are the radius and the minimum number of neighbours within this radius. From these parameters, clusters with different shapes and similar densities are found. ST-DBSCAN (Birant & Kut, 2007) is an extension of DBSCAN discovering clusters according to non-spatial, spatial and temporal features.
Gaussian mixture modelling is a parametric approach for clustering data. Advantages of such model-based approaches are that they can handle differently scaled variables, for example resulting from different units of measure. Furthermore, since mixtures of different distributions are allowed, the joint analysis of continuous and categorical data is enabled. Thorough surveys of state-of-the-art mixed data clustering algorithms are provided in Ahmad and Khan (2019) and Hunt and Jorgensen (2011).
With special emphasis on the joint analysis of numerical and categorical data, a fast density-based clustering algorithm (FDCA) is proposed in Jinyin, Huihao, Chen, Shanqing, and Zhaoxia (2017). In a one-time scan through all data points, the cluster centres are determined automatically. A data similarity metric is designed to enable the joint clustering of numerical, categorical and ranking attributes.
However, the choice of the radius defining the local neighbourhood around each data point is not trivial and small changes can significantly affect the shapes and sizes of the clusters.
In contrast to the clustering approaches reviewed above, topic modelling techniques model tweets or documents as a mixture of topics. Latent Dirichlet allocation (LDA) (Blei, Ng, & Jordan, 2003) is a widely used probabilistic topic model, where each topic has a probability distribution over the terms contained in the documents.
Capturing representative topics from short texts with limited context is known to be challenging. Additionally, topic modelling approaches usually incur high computational costs while being not quite effective in handling parallel events.
A suitable technique for dimensionality reduction and data analysis is non-negative matrix approximation (NNMA) (Dhillon & Sra, 2005), also known as non-negative matrix factorization (NMF) (Cichocki & Phan, 2009). As pointed out in Casalino, Castiello, Buono, and Mencar (2018), NMF is known to provide meaningful interpretations of mined information and therefore is successfully applied in various domains including the analysis of tweets.

| Event detection
Supervised, unsupervised and semi-supervised methods for the detection of physically occurring events as well as for emerging or popular topics based on Twitter data are discussed in Ramachandran and Ramasubramanian (2018) and Hasan, Orgun, and Schwitter (2018).
The authors conclude that further work is required to propose effective measures to filter out spam and trivial events. Furthermore, geo-tagging, the additional incorporation of videos and images as well as the detection of spambots, rumours and false reports is yet to be explored in more detail. Detecting events based on bursty keywords alone as a measure is not encouraged, as it may also return noisy data.
In Cheng and Wicks (2014), STSS is applied for event detection using Twitter data. STSS looks for clusters within a data set across both space and time, regardless of tweet content. It is expected that clusters of tweets will emerge during spatio-temporally relevant events, as people will tweet more than expected in order to describe the event and spread information. The method successfully detects events such as football matches, train and flight delays, and a helicopter crash from Twitter data.
With a special emphasis on real-world events, an approach for geo-spatial Twitter event detection is proposed in Walther and Kaisser (2013). Target events are often expected on a rather small-scale, meaning that they happen at a specific place in a given time period, and are often covered by only few tweets. After a pre-selection of tweets based on their geographical and temporal proximity, clusters (event candidates) are constituted based on simple distance measures in space and time. 41 features that address various aspects of the event candidates are then used to rank the tweets and to make a binary decision as to whether a tweets cluster constitutes a real-world event or not.
In Berlingerio et al. (2013), event identification is done by utilizing a spatio-temporal clustering approach based on the Louvain method (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008). The original method, intended to analyse graph networks for community detection, was extended to also take into account the textual content of a tweet. The method successfully identified major events of the occupy Wall Street movement in 2011.

| Spatio-temporal process analysis
The detection of small-scale spatio-temporal events from geo-tagged tweets based on ST-DBSCAN is focused in Huang et al. (2018). After summarization of the word frequencies for each cluster, potential topics are modelled by latent Dirichlet allocation (LDA). ST-DBSCAN has shown to be a good method for the detection of several different small-scale event types, for example gunshot events. However, a careful selection of the clustering parameters is required.
An approach based on machine learning and geo-visualization to identify events in cities and trace the development of these events in real-time based on Twitter data is presented in Zhou and Xu (2017).
The stream data are pre-processed and grouped into one-hour intervals. In order to find tokens related to events, bursty word detection techniques are applied. DBSCAN is utilized for spatial clustering and random forests are trained to classify tokens as event-related or unrelated. Due to the low proportion of geo-tagged tweets within the processed stream, only major local events could be detected.
A statistical approach for studying the spatio-temporal distribution of geo-located tweets in urban environments is proposed in Santa, Henriques, Torres-Sospedra, and Pebesma (2019). Instead of detecting and characterizing unique events, repeated patterns are identified by utilizing negative binomial regression analysis, PCA and hierarchical clustering. This approach may help to characterize the usual Twitter usage behaviour as a starting point for anomaly or event detection.
In Liang, Lin, and Peng (2018), a method for discovering the spatio-temporal process of a typhoon using Sina Weibo microblog data is presented. Support vector machines are utilized for classifying the text messages into four crisis-related classes. Further spatio-temporal analyses are based on searching keywords. The authors point out, that more research is required to determine the most appropriate scale for spatial analyses.
In Resch et al. (2018), machine learning topic models (LDA) and spatio-temporal analysis (local spatial autocorrelation) of social media data are combined for hotspot detection, disaster footprint estimation and damage assessment. LDA is applied to a Twitter data stream in a cascading fashion in order to extract earthquake-and damage-related tweets. After statistical topic validation, hotspot analysis is conducted based on local spatial autocorrelation (Getis-Ord G * ).
In Martin, Li, and Cutter (2017), the spatio-temporal variability in social media response is examined and an approach to leverage geotagged tweets to assess the evacuation responses of residents is developed. The approach involves (keyword-and location-based) retrieval of tweets, creation and filtering of data sets, and statistical and spatial processing to extract and map the results. Spatial and temporal analyses, which are done separately in this study, are mainly based on counting related tweets (per time and/or per county). Evacuationrelated analyses, that is how many Twitter users were evacuated, their evacuation destination and return date, involve the identification and tracking of active local users during different event periods.
In contrast of directly using Twitter data, a framework for discovering evolving domain-related spatio-temporal patterns is proposed in Shi et al. (2016). Given a target domain, a dynamic query expansion is employed to extract related tweets which are then used to form spatio-temporal Twitter events. Spatial clustering is based on the use of multi-level constrained Delaunay triangulation to capture the spatial distribution patterns of Twitter events. An additional spatio-temporal clustering process is then performed to reveal spatio-temporal clusters and outliers that are evolving into spatial distribution patterns.
A method for spatio-temporal anomaly detection through visual analysis of geo-located Twitter messages is proposed in Thom, Bosch, Koch, Woerner, and Ertl (2012). The approach enables the interactive analysis of location-based microblog messages in real-time by means of scalable aggregation and geo-located text visualization.
For this purpose, a novel cluster analysis approach is used to distinguish between local event reports and global media reaction to detect spatio-temporal anomalies automatically. The main tool for data analysis and visualization is the so-called ScatterBlogs system.
Its purpose is to enable analysts to work on quantitative and qualitative findings by not only automatically identifying anomalies, but also summarizing and labelling event candidates, and providing interaction mechanisms to examine them.
In Hu, Wang, Guin, and Zhu (2018), an STKDE-based framework for predictive crime hotspot mapping and evaluation is proposed.
A statistical significance test is designed to filter out false positives in the density estimates. Furthermore, a new metric is proposed to evaluate predictive hotspots at multiple scales.

| PROP OS ED WORKFLOW
The main goal is to reconstruct and picture the course of events during natural disasters based on Twitter data. The workflow and the involved methods therefore should have the following properties: • The workflow should be application-and event-independent • The analysis of geo-located tweets posted within the affected area is focused • A complete tool chain ranging from data acquisition to data visualization is desired • Both, the processing of full crisis data sets and the incremental processing of incoming streamed data, are focused • The utilized methods shall provide a multifaceted, summarizing and also detailed picture of (sub-) events that happen in the affected area • Involved methods should be flexible, computationally tractable, applicable to short messages and successfully applied in other related works Our resulting Twitter data stream analysis workflow is depicted in Figure 1. Due to the sequential structure, the process can either be applied to a full data set or in a repeated manner, for example to hourly analyse the latest incoming data. Data acquisition is done by using the Twitter Streaming API. Capturing reactions from affected individuals is ensured by restricting crawling to a specified area of interest. The first step of data analysis is the filtering of crisis-related tweets. The tweets identified as crisis-related can directly be used for spatio-temporal and topic analyses. In case of spatio-temporal analysis, clustering techniques are applied in order to group posted messages with respect to time and space. Topic modelling is applied to automatically identify the discussed semantic topics in a set of messages, for example posted during one day or within a spatio-temporal cluster. Each identified topic is characterized by a set of describing keywords that serve as a summarization. Each message is assigned to multiple topics based on probabilities.
In parallel, the tweets can further be classified into crisis-related information classes, such as Donations and Volunteering, Infrastructure and Utilities and Affected Individuals. The latter class is an example for potentially actionable tweets that might directly trigger an alert mechanism that should be confirmed and further processed by a human interpreter.
Spatio-temporal analysis is conducted by applying methods for local hotspot detection, density estimation and clustering. The results can be visualized on interactive maps, map sequences and 3D plots. Furthermore, motivated by Alam, Ofli, Imran, and Aupetit (2018) and Thom et al. (2012), the class distribution within clusters and word clouds are useful tools to characterize and visualize tweet contents and the changes over time. The results from topic modelling methods applied to all crisis-related tweets (in contrast to cluster-wise analysis) is intended to provide a complementary view on the data.
Tweets not related to natural disasters are discarded in this work.
Following the approach in Santa et al. (2019), these tweets might be useful to characterize the usual local twitter behaviour when no event takes place and therefore may be utilized to support event and anomaly detection in future works.
In the following, the choice of the workflow design and the identified methods for data analysis are described in detail.

| Tweet filtering
According to Shi et al. (2016), we can distinguish between the analysis of unfiltered and of domain-related Twitter data. By focusing on the domain of natural disasters, that is, by filtering crisis-related tweets, the amount of data to analyse in the subsequent steps is likely to be reduced significantly. We therefore favour a domain-related ap- is therefore utilized in this work. With this approach, where filtering is understood as a binary classification problem, superior results compared to other methods, such as Naïve Bayes, support vector machines (Burel & Alani, 2018), random forests and logistic regression were obtained (Nguyen et al., 2017). Implementation was done in Python 3.6 by utilizing tensorflow 1 and keras. 2 Each message is pre-processed with a set of standard operations, that is lowercasing, multiple space replacing, normalizing URLs, usernames and hashtags, removing special characters, normalizing digits and tokenization. Each word is then transformed into a real-valued vector of 300 elements by using a pre-trained word embedding specifically trained on 52 million crisis-related Twitter messages (Imran, Mitra, & Castillo, 2016). Hence, each tweet is represented as a matrix of size n × k, where n is the maximum number of words per tweet occurring in the training data and k = 300 is the embedding vector length. In case the number of words is lower than n, zero padding is applied. Then, a convolutional layer applies m kernels of different widths w in parallel to the input matrix. We use standard values m = 128 and w = (3, 4, 5), because parameter variation did not affect the results significantly. Global max pooling is performed after each of these convolutional layers, and the results are concatenated to a vector. This new embedding is then fed into a fully connected layer to determine the final class.

| Tweet classification and topic analysis
The application of state-of-the-art methods to classify crisis-related  Resch et al. (2018) and Casalino et al. (2018), these are useful approaches to reliably identify specific topics discussed on Twitter by avoiding explicit classification. We decided to utilize an own Python 3.6 implementation of NMF using the sklearn 3 library in this work, since the NMF results have shown to offer a high degree of understandability of the extracted topics in several application areas such as bioinformatics, pattern recognition, image analysis, educational data mining and document clustering (Casalino et al., 2018). In contrast to use topic modelling for filtering a specific subset of messages (e.g. messages related to specific event types), we use it to reveal insights on the discussed topics and their spatio-temporal distribution within the fraction of crisis-related tweets.
As a starting point, the corpus (in our case the crisis-related tweets) is encoded in a term-document (or term-tweet) matrix, whose rows correspond to n terms in a vocabulary and whose columns are related to m tweets. NMF (Cichocki & Phan, 2009) basically is a dimensionality reduction technique which can be applied to decompose this matrix into two low-rank factor matrices constrained to have only non-negative elements and such that the product of these matrices approximates the term-tweet matrix. The rank k defines the number of hidden topics and has to be defined manually.
Compared to other traditional dimensionality reduction methods, NMF is able to uncover latent low-dimensional structures intrinsic in high-dimensional data and provides a non-negative, part-based, representation of data enhancing meaningful interpretations of mined information (Casalino et al., 2018).

| Spatio-temporal analysis
The filtered and classified tweets can then be further analysed by taking into account space and time. Since local hotspot analysis and STKDE and (ST-) DBSCAN clustering have shown to provide reasonable results in different applications related to tweet analysis (Hu et al., 2018;Huang et al., 2018;Lee et al., 2017;Resch et al., 2018;Zhou & Xu, 2017), they were identified as suitable methods for this work. Message content is not taken into account with these approaches. However, the tweet labels obtained by the CNN classifier can be used to visualize the content of the clusters. Furthermore, a visualization of the discussed topics in terms of word clouds (see for example (Thom et al., 2012) and (Casalino et al., 2018)) is applied. In the following, the selected methods are described in detail.

| Local hotspot detection
The goal of applying the Getis-Ord G * -statistics is to detect local coldspots and hotspots of crisis-related Twitter activity. This is accomplished by clustering high and low tweet occurrences and measuring the concentration of occurrences in specific areas. We used the tweets posted within congressional districts for our experiments.

A JavaScript implementation 4 and the open-source Geographic
Information System Software QGIS 5 for pre-and post-processing and visualization was utilized in this work.
The Getis-Ord G * -statistic measures the degree of association that results from the concentration of weighted points (or area represented by a weighted point) and all other weighted points included within a radius of distance d from the original weighted point (Ord & Getis, 1995). Hence, it is a tool for hotspot and cold-spot analysis, in which high and low values are clustered and the concentration of these values in a specific area is measured. A simple definition of the G * -statistic of a specific point i is given by where x j is the attribute value (e.g. number of tweets) for feature j, w i,j is the spatial weight between entity i and j, and n is the number of entities (e.g. points or areas). G i * is usually standardized by the sample mean X and the sample standard deviation S: with An entity with a high value might be interesting, but may not be a statistically significant hotspot in terms of the G * -statistic. To be a statistically significant hotspot, an entity will have a high value and be surrounded by other features with high values as well. The local sum for an entity and its neighbours is compared proportionally to the sum of all entities. When the local sum is much different than the expected local sum, and that difference is too large to be the result of random chance, a statistically significant G i * value (Z-score) results. G i * values around zero indicate a random distribution.

| STKDE density estimation
In contrast to the G * -statistic, space-time kernel density estimation (STKDE) might be directly applied to data points in the spatio-  where the density f is computed based on all points (x i , y i , t i ) for which the spatial distance d i and the temporal distance t i to the current point is lower than the thresholds h s and h t , respectively.
Hence, the indicator function I returns 0, if one of these thresholds is exceeded and 1 otherwise. For the kernel functions k s and k t , the Epanechnikov kernel (Epanechnikov, 1969) is applied, where each data point is weighted according to its distance in time (t i ) and space (d i ) to the current voxel (the closer the data point, the higher the weight).
Since information regarding the population is not taken into account with this approach, the resulting density estimates picture absolute occurrences of crisis-related tweets. This is a useful information that can support disaster (sub-) event detection. In case a normalization by population density is desired, the dual KDE approach proposed in Wang, Ye, and Tsou (2016) might be applied instead.

| ST-DBSCAN clustering
ST-DBSCAN (Birant & Kut, 2007) is a clustering approach based on DBSCAN (density-based spatial clustering of applications with noise) (Ester et al., 1996). In DBSCAN, the density associated with a point is obtained by counting the number of points in a region of specified radius around the point. Points with a density above a chosen threshold are constituted to clusters. Compared to so-called parametric approaches, such as Gaussian mixture models, no assumption about the cluster shapes is introduced.
Hence, DBSCAN has the ability to discover clusters with arbitrary shape such as linear, concave and oval. Furthermore, the number of clusters is estimated. ST-DBSCAN can cluster spatio-temporal data according to non-spatial, spatial and temporal attributes.
Besides a distance measure (e.g. Euclidean distance) between data points, a second similarity measure is introduced in order to describe non-spatial similarity between attributes, such as tem- (1)

| E XPERIMENTAL RE SULTS
In this section, the proposed workflow is applied and evaluated in detail. Besides tweet filtering and classification, special emphasis is In the following, the used data, conducted experiments and results of the three main processing steps of the proposed workflow, that is filtering, classification and topic modelling as well as spatio-temporal analysis are described.

| Data set
Hurricane Florence, the wettest tropical cyclone on record, made landfall as a category 1 hurricane on early 14 September, 2018. Due to weather forecasts predicting the trajectory of the hurricane to pass the area of North and South Carolina, a corresponding area (see Figure 2) was defined for Twitter data acquisition. From September 12 to 19, around 600,000 geo-located tweets from this area were acquired.
Since only for a small fraction of Twitter data a geo-location is available, the location-based crawling introduces bias by neglecting the significantly larger fraction of information without geo-location. On the other hand, filtering by keywords might be problematic as well, since unrelated messages containing the keywords will be retrieved, while important messages not containing any of these will be discarded. However, location-based filtering is required in order to provide tweets from directly affected individuals, whereas tweets from users who are not directly involved, but contribute to discussions, are discarded.

| Tweet filtering
The CNN model was trained with a balanced set of around 120,000 crisis-related and unrelated tweets covering various types of natural and man-made disasters. The resulting temporal distribution of related and unrelated tweets binned into intervals of two hours is shown in Figure 3. A daily tweet activity pattern with maximum values of around 10k − 12k until around midnight can be observed.
14.9% (around 30,000) of all tweets (around 600,000) are classi- revealed that general crisis-related aspects and the heavy 2018 California wildfires were discussed. However, most of these tweets turned out to be false positives.
The filtering approach was recently enhanced and further investigated in terms of cross-event and cross-type applications (Wiegmann, Kersten, Klan, Potthast, & Stein, 2020). Here, an average misclassification rate of 3,8% for 5 million unrelated tweets was achieved.

| Spatio-temporal analysis
In this section, the results obtained with local hotspot detection, STKDE density estimation, ST-DBSCAN clustering and NMF topic modelling are discussed.

| Local hotspot detection
As pointed out in Resch et al. (2018), useful spatial information cannot be derived without taking into account the population density.
We therefore normalized the number of tweets in each district with the number of inhabitants per square kilometre available in (ESRI, 2018). To be a statistically significant hotspot, an area will have a high normalized tweet occurrence and be surrounded by other areas with high values as well. The local sum for an area and its neighbours is compared proportionally to the sum of all areas. When the local sum is significantly different than the expected local sum, a statistically significant Z-score results.
In our experiments, we determined the G * -statistics with a spa- The two maps on the lower right side in Figure 4

| STKDE density estimation
In STKDE, densities are estimates for each point in a regular grid with a chosen spatial resolution of 5 km and a temporal resolution of 0.1 d. Optimal values of h s = 7.5 km and h t = 0.3 d for the spatial and temporal analysis bandwidth were empirically identified.
A 2D view of the results is shown in Figure 5. According to the estimated density, each point is colour-coded and adjusted in size.
As expected, the areas around Charlotte and Raleigh are identified as the densest crisis-related tweet activity areas. Less dense areas correspond to less densely populated areas, for example Durham, Fayetteville, Columbia and Wilmington. It is worth noting that the 2D view in Figure 5 does not represent the result of a 2D analysis in which all disaster-related tweets from different points in time are projected into the 2D plane. A 3D view on the same STKDE result in the spatio-temporal domain is depicted in Figure 6. In this Applying STKDE to tweets representing more specific topics is a useful option but might introduce challenges due to data sparsity.

| ST-DBSCAN clustering
So far, rather high-level and broad event developments in space and over time could be observed. According to Walther and Kaisser (2013), specific real-world events (as part or sub-event during major events) can often be expected on a rather small-scale, meaning that they happen at a specific place in a given time period, and are often covered by only few tweets. In order to capture these types of (sub-) events, ST-DBSCAN is applied on our set of crisis-related tweets.
A colour-coded scatter plot of cluster centres obtained with a spatial threshold of t s = 5 km, a temporal threshold of t t = 0.5 hr and a minimum number of tweets min t = 5 is depicted in Figure 7. Each of the 385 identified clusters is represented by a circle with a randomly chosen colour, where the circle size depends on the number of tweets contributing to a cluster. At first sight, the ST-DBSCAN clusters correspond to the densities of the STKDE results in Figure 6.
Dense STKDE areas, in which a lot of tweets were posted within a specific time range, also exhibit large ST-DBSCAN clusters. The crucial difference between STKDE and ST-DBSCAN is that the latter F I G U R E 7 3D view of ST-DBSCAN results derived from crisis-related tweets (spatial threshold t s = 5 km, temporal threshold t t = 0.5 hr and minimum number of tweets min t = 5). Each coloured dot represents a cluster and is scaled according to the number of tweets contributing to this cluster [Colour figure can be viewed at wileyonlinelibrary.com] assigns each tweet to a cluster, whereas STKDE provides a density value for each point in space and time. butions. Clusters with a few tweets might be related one specific topic whereas it can be expected that a large cluster, representing all tweets posted during one day within a city, covers several different topics. This motivates further analyses of the clusters and messages, which is described in the next section.

| Cluster analysis and interpretation
Two different approaches are chosen here: the visualization of class distributions and the visualization of word clouds for each cluster.

ST-DBSCAN identified 385 spatio-temporal clusters. Around 300
clusters contain only few tweets (around 10) and 100 clusters are rather large (approximately 100-1,000 tweets). In Figure 8, the CrisisLex information class distributions for the ten largest clusters, comprising around 500-4,000 tweets, are depicted. All tweets iden- For instance, the number of messages expressing sympathy and support (see cluster 61 with 556 tweets in Figure 8) has a peak on 14 September, the day of the landfall. After the landfall, more messages related to donations and volunteering (cluster 14 and 28) as well as infrastructure and utilities and affected individuals (cluster 11) can be observed. More significant changes, for example the absence of one or even more of the information classes cannot be observed here.
Similar to class distribution histograms, also word clouds visualizing the most frequent terms tend to be quite unspecific in case of large clusters. However, especially for clusters containing only a few messages, class distribution histograms are not appropriate. Instead, word clouds turn out to be quite useful and informative in this case.
Instead of reading messages, the word clouds immediately provide an overview about the frequently used terms. Six selected example word clouds are shown in Figure 9. Interestingly, we found that the small clusters can often be connected to real happenings, such as flood and tornado warnings, road and beach closures, power outages, damages, fallen trees and power lines as well as the need or availability of donations in a specific area. As shown in Figure 9 (f), also virtual events, such as a critical discussion regarding the handling of outdated emergency posts and news prioritizing emergency news sources, are found. Before tweet encoding, we applied some standard pre-processing steps, such as filtering retweets, lowercasing, removing user names, links, double spacings and numbers, tokenization, stop word removal and stemming. For NMF, we identified the following parameters to be suitable: k = 20 topics, regularization factor α = 0.1 and a regularization mixing parameter r = 20, which ensures a balanced combination of element-wise L1 and L2 regularization. We found 10 of the 20 identified topics to be quite related to human interpretable and crisis-related topics (see Table 1). The remaining topics are rather related to specific places, things people talk about while waiting (e.g. food), about their expectations when they can return back to their homes after being evacuated as well as mixtures of rather non-actionable or not emergency-related topics (according to manual inspection of random samples).

| NMF topic modelling
As an example, the spatio-temporal distributions of the topic flood for 15-17 September are shown in Figure 10. Each dot on the maps represents one or more tweets assigned to the topic. The larger the marker, the more tweets contribute to the topic in the corresponding region. From these plots, clear occurrence peaks in Charlotte and in Durham can be observed on 16 and 17 September, respectively. Besides these rather large-scale events that are obviously visible since a large fraction of the population is affected, also heavy impact flood events represented by only a handful of tweets, for example in Wilmington, are visible. A subsequent clustering and summarization, for example using word clouds, would also in this case help to capture and visualize the present information.
Further relations between the mapped NMF topics to real events reported on local news sites could be identified, for example in case of traffic-related problems, road and highway closures, and power outages.

| D ISCUSS I ON
The experimental results demonstrate that the proposed workflow has a great potential to meet our defined goals of capturing local spatio-temporal clusters tends to be a mixture of various topics difficult to interpret in detail. This motivates the application of topic modelling techniques. Neglecting information about space and time, discussed topics from the collection of all tweets are extracted. The spatio-temporal distribution of topics can then be visualized, for example on daily maps. In our experiments, this daily analysis is applied to Twitter data for the whole region affected by the hurricane. 10 meaningful crisis-related topics, such as Flood, Traffic and Situation Report, could be identified automatically by NMF (see Table 1). As demonstrated for the class Flood (see Figure 10), mapping these topics and observing topic changes over time enables the identification of real events. As a conclusion from our findings, the application of topic modelling is likely to also be a good approach to extract more fine-grained information from large ST-DBSCAN clusters.
Our findings indicate that the proposed workflow is a very effective tool. However, further experiments, improvements and adjustments are necessary. Applying filtering, clustering, topic modelling and hotspot analysis in a varied order is likely to reveal more detailed event-and user-specific information. For instance, a sequence of filtering crisis-related tweets, the classification or topic modelling-based extraction of damage-related tweets followed by a clustering (or hotspot detection) might be a good approach to map potential damages reported via Twitter.

| CON CLUS I ON S AND OUTLOOK
In this paper, a general workflow to capture large-scale and local natural disaster event dynamics from Twitter data is proposed. Even though the involved methods have shown to provide a meaningful and multifaceted view to the happenings during hurricane Florence in different granularities, this workflow can be seen as a starting point for various improvements, further experiments and interesting research directions. Due to the desired flexibility and transferability of the workflow, another sequence of methods might be more appropriate for specific applications. In turn, this flexibility would enable the mining of more detailed information, for instance, when topic modelling or a further clustering step is applied to a large cluster found by ST-DBSCAN.
Qualitative and exemplary results of STKDE, ST-DBSCAN, hotspot analysis, NMF topic modelling and cluster analysis based on word clouds and semantic class histograms are presented in this work in order to demonstrate the effectiveness and value of our workflow. However, more systematic and quantitative experiments are required in order to evaluate each workflow component in detail. Related to this, an interesting and important topic for further research would be how spatio-temporal analysis results can be meaningfully evaluated since ground truth information, in particular for very local events, is often not available and distributed over various data sources (news articles, feeds and blogs). Furthermore, workflow transferability to different event types has to be evaluated.
Two interesting but contrary research directions are possible for the workflow development. One possibility would be to add more interactivity, workflow flexibility and visualization options, for instance with a graphical user interface allowing to inspect results and to adjust method parameters according to specific user requirements. The other one would be to increase the degree of automation in order to obtain a fully automated and near real-time event detection and tracking system. For the latter, modules for continuous data acquisition, information localization, event detection, data storage and others have to be added to the workflow. Additional available information, such as tweet metadata and the directed network of "@username" mentions on Twitter, could also be exploited for data analysis.
With respect to the summarized open research directions in the introduction, the extraction and ranking of important actionable messages should be further investigated. For this, it would be helpful to gain more insights on the specific language characteristics for these types of short messages. Furthermore, the combination and fusion of analysis results with other information sources should be addressed in future research, since data solely obtained from a single source are likely to only represent a specific subgroup of citizens affected by a crisis and thus reveal just a partial view on the event.

ACK N OWLED G EM ENTS
The authors would like to thank the National Institute of Standards and Technology (NIST) and the organizers of the Text REtrieval

Conference Incident Streams Track (TREC-IS) for labelling hurricane
Florence tweets as well as Felix Juch for supporting the experiments for this paper. Open access funding enabled and organized by Projekt DEAL.