Satellite Video Remote Sensing for Flood Model Validation

Satellite‐based optical video sensors are poised as the next frontier in remote sensing. Satellite video offers the unique advantage of capturing the transient dynamics of floods with the potential to supply hitherto unavailable data for the assessment of hydraulic models. A prerequisite for the successful application of hydraulic models is their proper calibration and validation. In this investigation, we validate 2D flood model predictions using satellite video‐derived flood extents and velocities. Hydraulic simulations of a flood event with a 5‐year return period (discharge of 722 m3 s−1) were conducted using Hydrologic Engineering Center—River Analysis System 2D in the Darling River at Tilpa, Australia. To extract flood extents from satellite video of the studied flood event, we use a hybrid transformer‐encoder, convolutional neural network (CNN)‐decoder deep neural network. We evaluate the influence of test‐time augmentation (TTA)—the application of transformations on test satellite video image ensembles, during deep neural network inference. We employ Large Scale Particle Image Velocimetry (LSPIV) for non‐contact‐based river surface velocity estimation from sequential satellite video frames. When validating hydraulic model simulations using deep neural network segmented flood extents, critical success index peaked at 94% with an average relative improvement of 9.5% when TTA was implemented. We show that TTA offers significant value in deep neural network‐based image segmentation, compensating for aleatoric uncertainties. The correlations between model predictions and LSPIV velocities were reasonable and averaged 0.78. Overall, our investigation demonstrates the potential of optical space‐based video sensors for validating flood models and studying flood dynamics.


10.1029/2023WR034545
2 of 27 present unique challenges in the form of limited annotated data and complexity of both spatial and temporal information, which calls for advanced architectures suited for understanding the complex spatial and temporal relationships present in dynamic video scenes.Further developments in optical flow monitoring techniques, a facet of computer vision, are enabling the analysis of the motion of objects within a sequence of images or video frames.Specifically, non-intrusive optical flow-based techniques for estimation of velocities in rivers have been demonstrated as viable tool for acquiring spatially distributed flow velocity information in natural environments (Pearce et al., 2020;Perks et al., 2020).Estimating surface river velocities from satellite-based video is a promising domain for flood science, although issues around the low temporal resolution of satellite video, image pre-processing and satellite platform drift demand further investigation.
Two dimensional hydraulic models are an integral tool for understanding flood dynamics.Outputs from these models are actively used for flood risk management (Tsakiris, 2014), infrastructure design (Shrestha et al., 2022), disaster preparedness (Nkwunonwo et al., 2020) as well as in modeling the future of flooding under different climate change scenarios (Mishra et al., 2018), thus aiding in the design of long-term adaptation strategies.Inundation modeling using hydraulic models serves as the principal tool for understanding the intensity of riverine flood hazards.Despite their extensive applicability, two dimensional hydraulic models have rarely undergone rigorous validation against observed data to assess the skill of their predictions (Pasternack, 2011;Wing et al., 2021).In general, 2D modeling studies often use minimal data to assess model accuracy, meaning that models typically target the most basic performance benchmarks, and in many cases, models have simply not been validated (Molinari et al., 2019).Indeed, many regions around the world lack comprehensive and high-quality data of commensurate coverage that can be used to validate these models (Rollason et al., 2018).A handful of studies that exemplify the validation of 2D models include those by Bernhofen et al. (2018), Eilander et al. (2023), and Wing et al. (2017).These studies utilized satellite-derived flood extents to validate global flood models, which offer a macro-level understanding of flood risk and do not capture the fine-scale intricacies of local flood dynamics.Even fewer studies report the validation of two-dimensional hydraulic model simulations using velocities, such as Barker et al. (2018), Fischer et al. (2015), and Williams et al. (2013) who relied upon traditional point-velocity measurements for assessing model skill.In fact, whilst some two-dimensional models might accurately replicate flood extents, substantial deviations of simulated velocity might be observed when compared to field observations (G.Li et al., 2022).Further, there is wide recognition that both two-dimensional models and observations come with their own set of uncertainties such as those linked to extreme discharge measurements, terrain data accuracy and observation field data errors which can introduce discrepancies when validating models (Grimaldi et al., 2016;Schumann, 2017).A systematic investigation of the role of alternative spatially distributed data sets for validating 2D flood models is pertinent.
Remote sensing for flood inundation studies relies on two categories of sensors for monitoring surface water dynamics-microwave and optical sensors (Dasgupta et al., 2018;Grimaldi et al., 2020).Flood water pixel identification from optical satellite imagery has conventionally largely relied on spectral water indices (e.g., the Normalized Difference Water Index (NDWI) (McFeeters, 1996) and the modified NDWI (H.Xu, 2006)) as well as supervised and unsupervised classification.These techniques have been known to misclassify (overestimate) water bodies (Khalid et al., 2021).Machine learning methods, such as Support Vector Machine classifiers and Random Forest algorithms have also been adopted in several floodplain mapping studies (e.g., Mobley et al., 2021;Nandi et al., 2017) with significant contributions to near real-time flood hazard mapping (Ho et al., 2021).Big data analytics and computer vision techniques (specifically, deep learning) are now paving the way for automated delineation of flood extents with high accuracy (J.Wang et al., 2022).Deep convolutional neural networks (CNNs) have revolutionized binary and multi-class image classification and are especially relevant in time-sensitive applications such as flood inundation mapping (Shastry et al., 2023).CNNs overcome several key limitations of traditional machine learning in image classification tasks; they are highly scalable and can process large amounts of complex data with little human intervention.CNNs can also leverage on transfer learning; the use of networks pre-trained on extremely large data sets, then fine-tuned for new tasks, enhancing model generalization, and avoiding overfitting (Tan et al., 2018).Convolutional neural networks (CNNs) have benefited greatly from the rapid development of large labeled data sets, such as ImageNet, which offer high-quality training images at an unprecedented scale (1.3 million training images, 50,000 validation images and 100,000 test images spanning 1,000 classes), allowing generic features learned to be used in complex classifications of presumably disparate data sets (Huh et al., 2016;Ridnik et al., 2021;Yamashita et al., 2018).
Semantic image segmentation using CNN-based networks entails pixel-level identification, classification, and labeling.Applications of deep learning networks for semantic segmentation of floods in remote sensing images 10.1029/2023WR034545 3 of 27 have mostly been demonstrated on fully convolutional networks (FCNs) and to a lesser extent, encoder-decoder architectures.Hashemi-Beni and Gebrehiwot (2021) utilized FCN-8s to generate binary classification maps of flood inundated areas.Gebrehiwot et al. (2019) applied a CNN-based network (FCN-16s) to extract flooded regions from Unmanned Aerial Vehicle (UAV) imagery.Basnyat et al. (2021) utilized a modified version of the U-Net architecture for binary segmentation tasks on their flood detection system while Girisha et al. (2019) successfully utilized both FCN-32s and U-Net for semantic segmentation of UAV videos within an urban zone.
Transformers, a class of neural network architecture originally built to solve sequence to sequence problems in natural language processing (e.g., the transformer based-chatbot ChatGPT [Generative Pre-trained Transformer] (Y.Liu et al., 2023)), have now been adapted as a complement to CNNs for semantic segmentation of remotely sensed imagery attaining state-of-the-art performance (see, e.g., Gu et al., 2022;Z. Xu et al., 2021;Zhang et al., 2022).Transformers, unlike CNNs, rely on "self-attention" mechanisms which allows them to extract and use information from arbitrarily large contexts of the input data (e.g., pixels in an image) simultaneously.It enables the network to capture long-range dependencies and consider global context, making it well-suited for understanding complex patterns and relationships within the data.CNNs on the other hand can only exploit local information correctly due to their small convolutional kernel sizes.However, it's worth noting that due to the quadratic complexity of self-attention, Transformers can be computationally more expensive, especially for large images, and do not generalize well when trained on insufficient amounts of data, as compared to CNNs.Consequently, hybrid architectures that combine the strengths of transformers and CNNs to achieve a balance between global context modeling and computational efficiency for semantic segmentation tasks have been proposed (e.g., A. He et al., 2023;Q. He et al., 2023;Z. Zhou et al., 2023).Although the adoption of these architectures is still evolving, their robustness in understanding complex scenes which exhibit large variations within the same class and subtle differences between different classes, such as in video remote sensing, is yet to be explored.
Traditional methods for measurement of instantaneous water flow velocity, including impellor-type current meters, electromagnetic flow meters, acoustic Doppler velocimeters and acoustic Doppler current profilers (aDcp) remain limited during flood conditions due to logistical challenges such as inaccessibility of flooded areas, flow turbulence as well as limitations in instrument measurement ranges.Image velocimetry, a non-contact method used to measure fluid flow velocities by analyzing images of flow patterns, such as those from video, has gained traction as a method of collecting river velocity data (e.g., Dal Sasso et al., 2021;Pearce et al., 2020).Large Scale Particle Image Velocimetry (LSPIV) is a frequently adopted technique for water-surface velocity analysis and relies on tracking the motion of appropriate artificial or natural "seeding" particles, such as bubbles or debris, between successive images in a time series.LSPIV based on UAV video has been demonstrated in several studies (e.g., C. Chen et al., 2021;Lewis et al., 2018;W.-C. Liu et al., 2021).Legleiter and Kinzel (2021) successfully derived surface river flow velocities from satellite video to within 8.65% of independent radar gage based measurements, building upon the work of Kääb et al. (2019) who utilized near-simultaneous satellite still images (acquired with a nominal time lapse of ∼90 s between each other) to estimate river surface velocities in the Yukon River, Alaska.Depending on the video acquisition frame rate, video sequences of between 8 and 30 s are sufficient for image velocimetry over large spatial extents in rapid fashion (Legleiter & Kinzel, 2020;Pearce et al., 2020;Strelnikova et al., 2023).The next step in the domain of satellite video-based image velocimetry is the deployment of these velocity estimates in an operationally useful manner-the validation of hydraulic model predictions.
The aim of this investigation is to present the first attempt at validating flood model simulations using satellite video observations.The principal contributions of this work are threefold.First, we leverage on the temporal richness of satellite video to fine-tune a hybrid CNN-Transformer network for semantic segmentation of flood extents.Specifically, we evaluate the accuracy of six variants of a transformer encoder-CNN decoder architecture in segmenting our test video image.Second, we utilize optical flow techniques to analyze the motion of naturally occurring features between consecutive video frames to estimate the velocity of flow in our study reach.Finally, we utilize these flood extents and velocities to validate 2D flood model outputs, explicitly accounting for uncertainty in our 2D modeling.As non-intrusive optical flow measurement and deep-learning based techniques for delineation of flood pixels in near-real time evolve, and further constellations are established, satellite video presents a potential opportunity to exploit a temporally rich data source, capable of providing data to comprehensively validate flood model predictions.Though limited to cloud-free acquisitions, the unmatched temporal resolution of satellite video can compensate for the limitation of optical atmospheric imaging windows.

Overview
The investigation consisted of four stages (Figure 1).In stage 1, Jilin-1 satellite video-based flood extents are derived using a state-of-the-art deep-learning network.We adopt a hybrid architecture consisting of a transformer-based encoder and a CNN-based decoder, this allows us to integrate the strengths of both transformers and CNNs.Whilst CNNs excel at local feature extraction, transformers are stronger at capturing global context and long-range dependencies.This hybrid approach for semantic segmentation allows us to benefit from the strengths of each architecture.Specifically, we adopt SegFormer (Xie et al., 2021) as the encoder, a lightweight yet efficient and powerful model that has attained state-of-the-art semantic segmentation performance on popular benchmarks like the Cityscapes data set and ADE20K (Cordts et al., 2016;B. Zhou et al., 2017).We utilize a U-Net decoder which allows the model to leverage multi-level information, reconstruct spatial details, and efficiently integrate low-level and high-level features to generate accurate and fine-grained segmentation masks.We evaluate the segmentation capabilities of six different variants of the SegFormer encoder, from SegFormer B0-B5 and narrow down on one model series.We train the selected deep learning model, relying on transfer learning and Test-Time Augmentation (TTA) for improved flooded class prediction accuracy.TTA involves the creation of augmented copies of a test data set, following which the deep-learning network returns ensemble predictions of class labels which are averaged to produce segmentation maps.
In Stage 2, we utilize LSPIV to compute river surface velocity vectors over two cloud-free subsets of our video (Figure 2).Stage 3 involves simulation of a recent flood event using a hydraulic model (Hydrologic Engineering Center-River Analysis System [HEC-RAS] version 6.0) driven by uncertain discharge estimates derived from Bayesian analysis of stage-discharge relations (see Section 2.3.2).This HEC-RAS 2D model is calibrated using stage height from the gauge at Tilpa (Figure 2).Stage 4 validates the HEC-RAS 2D model predictions using both the deep-learning segmented images from Stage 1, and the velocity vectors estimated in Stage 2.

Study Area
The study area covers a 6.5-km-long section of the middle Darling River at Tilpa floodplain, located in north-western New South Wales (NSW), which is part of the Murray-Darling River basin in south-eastern Australia (Figure 2).A gauging station at Tilpa (Station number 425900) records water course discharge every 15 min from 1995 to present, for an upstream catchment area of 502,500 km 2 .Extreme multi-day rainfall, caused by a series of deep low-pressure systems, resulted in intense storms and major flooding in eastern NSW from 22 February to 9 March 2022.Between 19 January and 11 February 2022, the Darling River at Tilpa was above major flood level (11.5 m), with the flood wave peaking at 12.3 m on 31 January 2022 (Bureau of Meteorology, 2022).The town of Tilpa experienced extensive floodplain inundation.
The Darling River at Tilpa has a river style (Brierley et al., 2002) that is meandering, is planform controlled and features discontinuous floodplain.Creeks are connected to the Tilpa's main channel and fill with water when the Tilpa's stage is high.This complex floodplain configuration demands high-resolution 2D flood inundation modeling.We calibrated our HEC-RAS 2D model using stage height observations from the gauging station at Tilpa (Figure 2) and validated our model simulations at two locations (A and B, Figure 2), geographical extents that adequately represent the complexity of the floodplain and enabled us to capture fine scale flood hydraulics at a resolution sufficient to comprehensively assess inundation dynamics.Additionally, using these two locations, rather than the whole study area, overcomes the potential problem of jagged prediction patches, which are artifacts associated with the reconstruction of large sized deep learning-based prediction mosaics.This is a limitation when making flood pixel predictions using neural networks which generally results in data loss around the border of large image patches (Heller et al., 2018;Yuan et al., 2021).

Satellite Video
At present, two commercial companies operate constellations with satellite-based optical video sensors.Planet's SkySat offer satellite video products in panchromatic mode at a Ground Sampling Distance of 1.1 m, a frame rate of 30 frames per second (fps) with a duration between 30 and 120 s (Bhushan et al., 2021).Jilin-1 GF-03, launched by the Chang Guang Satellite Technology Company, is another high-resolution commercial satellite video constellation.Jilin-1 collects 0.5-1.2m 4K high definition color video at frame rates of up to 10 fps and revisit capability of up to six times a day globally (Y.Chen et al., 2022;European Space Agency, 2022).Currently, both platforms are limited to 30-120 s duration video captures.For our study, Satellite video (Table 1) was acquired on 5 February 2022 at 23:12 UTC by Jilin-1 GF-03.The video has a spatial resolution of 1.22 m and was acquired at five frames per second for a duration of 28 s, yielding 140 frames, with each frame measuring 12,000 × 5,000 pixels.

Rating Curve Uncertainty Estimation
The influence of rating curve uncertainties on streamflow time series estimates is particularly pronounced in natural river systems, especially during floods, when a rating curve is typically extrapolated beyond the maximum gauging in the rating, resulting in significant systematic errors (Horner et al., 2018).The propagation of errors from flood model forcing data (i.e., streamflow) to eventual model outputs warrants explicit consideration of uncertainties associated with the rating curve.Kiang et al. (2018) investigated different techniques for estimation of stage-discharge rating curve uncertainty concluding that the choice of methods is fully dependent upon the constraints of the specific application.Bayesian inference has however been suggested as a robust technique to handle independent gauging errors and provide precise discharge series uncertainty envelopes (Ocio et al., 2017) and was adopted here.
Rating curve uncertainty in this investigation is assessed using the BaRatin (Bayesian Rating Curve) method (Le Coz et al., 2014) which combines uncertain gaugings and prior hydraulic knowledge to derive uncertain stage-discharge relations.The BaRatin framework defines stage and discharge measurement uncertainties as Gaussian distributions with a mean of zero and is composed of three main components; (a) a measurement error model, consisting of prior estimates of parameters based on preliminary hydraulic analysis of a gauging station, (b) Posterior rating curves, which are a derived from a simulation consisting of gauging data, and (c) The application of Markov Chain Monte Carlo and Bayesian inference to sample the posterior distribution of the rating curve parameters relying on information contained in observed gaugings.The eventual rating curve equation is based on a matrix of hydraulic controls that relates discharge Q to stage h using power functions: In the above equation, M(r,j) is the matrix of controls, and the notation 1 I (h) denotes a function equal to 1 if h is included in the interval I, and 0 otherwise.Segments in the rating curve (N segment ) are user defined while segment limits k r , coefficients a j and exponents c j are inferred.
For our flood model simulations, we forced our hydraulic model using streamflow time series (Figure 3) based on two scenarios: (a) Q obs , observed discharged drawn directly from the Tilpa gauging station; and (b) Q maxpost , discharge computed from measured stage and the MaxPost rating curve (a rating curve that maximizes the posterior distribution of a set of parameters inferred from the Bayes theorem (see Kiang et al., 2018 for detail)).The Q maxpost streamflow timeseries is based on uncertain rating curves (with 95% confidence bounds), computed by Bayesian analysis of prior hydraulic controls and gagings with individual uncertainties.

Transformer-Based Encoder
Transformer based models were initially designed for Natural Language Processing tasks and excelled over CNNs and Recurrent Neural Network models (e.g., Long-Short Term Memory that process sequence elements recursively and can only attend to short-term context) thanks to their "self-attention mechanism."In the context of semantic segmentation, the self-attention mechanism helps the model understand the relationships between different spatial locations (pixels) in an image.By using self-attention, Transformers can capture long-range dependencies and understand global context, which is essential for accurate semantic segmentation, where the label of a pixel may depend on distant regions in the image.Although the field of deep learning is evolving rapidly, the most successful transformer architectures adapted for semantic segmentation tasks include Vision Transformer (ViT) (Dosovitskiy et al., 2021), Swin Transformer (Z.Liu et al., 2021) and SegFormer (Xie et al., 2021).Here we leverage on SegFormer, which uses a hierarchical Transformer architecture (called "Mix Transformer") as its encoder and a lightweight decoder for segmentation.SegFormer's encoder leverages tokenization, self-attention, and hierarchical aggregation to efficiently capture important visual information from input images, making it well-suited for semantic segmentation tasks.Here, we fine-tune six variants of SegFormer (B0-B5) with increasing model sizes offering improved performance at the cost of increased computational requirements.

Convolutional Neural Network (CNN)-Based Decoder
Convolutional neural networks (CNNs) are a widely used architecture in deep learning, initially proposed by Fukushima (1980) and refined by Lecun and Bengio (1995).CNNs, a class of artificial neural networks, are mostly defined as a series of layers, with the initial layers performing feature learning and the final layers performing classification.CNNs consist of three types of layers: (a) convolutional layers, which are the first layers to extract features from an input image whose outputs (feature maps) are then passed on to sequential layers; (b) pooling layers, which take feature maps as inputs and progressively reduce the spatial size of the feature maps, controlling model overfit; and (c) an activation function applied to the outputs of the CNN that enable the model to capture non-linear behavior in the input data (Hosseiny, 2021).
Whilst a plethora of deep-learning CNNs exist, the most prominent architectures that have attained state-of-the art performance in semantic segmentation of remote sensing images are U-Net (Ronneberger et al., 2015), Google's DeepLab (L.-C.Chen et al., 2016) and PSPNet (Zhao et al., 2017).In a comparative study of semantic segmentation of remote sensing images using the three aforementioned models, J. Hu et al. (2019) reported U-Net achieving the best accuracy.A more recent investigation by Y. Sun et al. (2022), who similarly carried out an intercomparison of the three CNN networks, found U-Net outperforming the other three models in the segmentation of remotely sensed images.
Originally developed for biomedical image segmentation, U-Net (and its variants) has gained prominence in diverse fields for its ability to leverage on data augmentation to efficiently learn from a small number of annotated images.U-Net's fairly simple architecture, consisting of a downsampling (encoder) and upsampling (decoder) path allows for precise localization of segmented pixels.The U-Net decoder is a crucial component that contributes to the strength of the U-Net architecture in semantic segmentation tasks.The decoder is responsible for upsampling the low-resolution feature maps from the encoder to the original input image resolution while also fusing multi-level features to create final segmentation maps.Thanks to the U-Net decoder's low memory requirements as well as its ability to be trained end-to-end (meaning both the encoder and decoder are learned jointly during the training process) we deployed a U-Net decoder for our segmentation tasks.

Neural Network Training
Training deep neural networks typically requires a vast amount of data, which may not always be available.Here, rather than train our model from scratch, we fine-tuned our encoder to improve its segmentation capabilities on our data set.Fine tuning entails taking a pre-trained model's encoder (in our case, SegFormer models B0-B6) and further training it on a specific data set, the objective being to adapt the learned features to perform well on a new task.For instance several studies have utilized pre-trained models, such as VGG16 or ResNets, which have learned to recognize general image features from ImageNet, then fine-tuned the models on much smaller data sets for specific image segmentation tasks (see, e.g., Hashemi-Beni & Gebrehiwot, 2020; Tong et al., 2020). 10.1029/2023WR034545 9 of 27 We leverage on the temporal richness of video to extract images used for fine-tuning our model.Although our video was acquired at a native frame rate of 5 Hz (resulting in 140 images in sequence), we subsample our video at a much lower frame rate of 1 Hz by retaining only every tenth frame from the original 5 Hz series resulting in a sequence consisting of 28 images.Skipping frames enabled us to decouple temporal information in our images.When frames are skipped, the information captured in one frame is temporally further apart from the information in the skipped frames.As a result, the sequential order of frames is disrupted, and the temporal information becomes less tightly coupled, reducing the strength of autocorrelation.Also, because there was minimal movement or low temporal variation in consecutive satellite video scenes, by skipping frames, we omit redundant or near-duplicate information, which leads to a reduction in autocorrelation since repetitive patterns are less prominent.To ensure that all our extracted frames were aligned and free from any motion artifacts as a result of the satellite platform's vibrations, we aligned our images in TrakeEM2 (Cardona et al., 2012)  At the training stage, the full-sized satellite images could not be loaded on to the network due to memory limitations, a common challenge faced when training deep learning models which require extensive memory to store input images, weight parameters and activations as images are channeled through the network.Therefore, the original images were split into patches of 256 × 256 × 3. To artificially diversify and increase the size of our training data set, we deployed data augmentation.Data augmentation is a technique that reduces generalization errors (model overfitting) by adding a range of deformations and noise to the training data.To implement data augmentation, we used the Albumentations library (Buslaev et al., 2020) in Python, where we applied vertical and horizontal flips, transposition, grid distortions, elastic transforms and random gamma to both the images and corresponding binary masks (Figure 4).A total of 5,660, 1,698, and 566 image patches (256 × 256 × 3) were used for training, validation and testing the network.
To initiate training, pre-trained SegFormer weights were imported and hyper-parameters set as follows: we used a batch size of eight and a learning rate of 1 × 10 −4 with a call back to reduce the learning rate by a factor of 2 × 10 −4 when the validation metric (Intersection-Over-Union/Jaccard Index) stopped improving.The deep learning model was implemented within the PyTorch deep-learning framework and trained on an NVIDIA GeForce RTX 2070 Super Graphical Processing Unit for 150 epochs.The trained network learned to associate images and masks and make an independent prediction on a test image which was independent of the 28 frames used in the original training/validation/testing pipeline.

Neural Network Evaluation Metrics
Model segmentation performance was assessed using two metrics, Binary Cross Entropy loss (L BCE ) and the Intersection Over Union (IoU; also known as the Jaccard Index).
L BCE , compares model predicted probabilities to the ground truth labels (see Section 2.3.1),which can either be 0 or 1.It then computes a score that penalizes the probabilities based on the distance from the expected values.L BCE calculates the difference between the actual and precited probability distributions for predicting class 1 (Jadon, 2020).The score is minimized, and a perfect value is zero: where Y denotes the ground truth label while  Ŷ is the precited probability of the classifier.
IoU is a simple ratio that compares how close a pixel is between the training sample and predicted regions (Bebis et al., 2016): where A and B denote the ground truth and the prediction segmentation maps respectively.IoU ranges between 0 and 1.If the model prediction is perfect, IoU = 1.The lower the IoU, the worse the predicted result.
Since our binary segmentation task involved a data set with an inherent class imbalance, we utilize L BCE and IOU as our loss functions during model training.Although we report other metrics including recall, precision and the F1 score (see Text S1 in Supporting Information S1, for a description of these metrics) we do not use them as loss functions during training as these metrics would strongly bias our results to the class that occupies a large portion of our images (flood imagery has disproportionately few pixels per image being identified as flooded).

Test-time Augmentation (TTA)
Although data augmentation has been widely used when training deep learning networks, a less common way of improving semantic segmentation prediction accuracy is TTA (see, e.g., Gonzalo-Martín et al., 2021; J. Liu et al., 2022;Moshkov et al., 2020;Wieland et al., 2023).TTA involves applying augmentation transformations on the test image to create an ensemble of predictions, which are then averaged, to improve prediction results.Assuming x to be our input image and τ a transformation operation, choosing Τ = {τ 1 , τ 2 , …, τ |τ| } as a set of augmentations at model inference time, we can formulate TTA as: where Θ target is the neural network trained on our target data set (satellite video frames).
In our application, we supplied the network with a full-sized satellite video scene, which is patched into 256 × 256 × 3 sized patches without overlap.The network then made predictions on multiple transformations of the image patches, creating an ensemble of predictions.The transformations were then reverted, a process known as dis-augmentation, following which predictions were averaged, thresholded then stitched back to the original full-size image using the Patchify library in Python.Image parsing which assigns semantic class labels, in this case flooded versus non-flooded pixels, for our trained deep learning network returned probability maps, which were thresholded to binarize final pixel classes.Assuming a base (true) segmentation probability map is returned by the network during testing, binary segmentation is done by thresholding as follows (Hong et al., 2021): where P ∈ {0,1} WxH is a predicted segmentation label and p(y) is the true probability map of the potential segmentation labels y ∈ {0,1} WxH for an image with height H, and width W. δ is the thresholding function with a threshold τ.We set a fixed τ value of 0.5 for our model training and testing.

LPSIV Velocity Estimation
Whilst a variety of image velocimetry algorithms have been deployed for river surface velocity estimation (see Perks et al., 2020 for a summary), we rely on the frequently used PIVLab algorithm (Thielicke & Sonntag, 2021;Thielicke & Stamhuis, 2014) for our LSPIV analysis.Our general LSPIV workflow is depicted in Figure 1.For ease of processing, we crop our original video, focusing on the reaches A and B (Figure 2).We sub-sample our videos to lower frame rates of 1, 0.5 and 0.25 Hz by retaining only the fifth, tenth, and twentieth frames respectively from the original 5 Hz series.By lowering the frame rate, the time interval between consecutive images is increased.Resultantly, tracked features move farther between frames, leading to larger displacements which translate to larger feature sizes in the PIV analysis, making it easier for the PIV algorithm to accurately track the features.We then stabilize all our images in TrakEM2 to counter any residual motion effects from the satellite platform.
We import and process our images in PIVLab.To optimize the quality of our velocimetry results, we pre-process the images by applying contrast limited adaptive histogram equalization (Pizer, 1990;Yadav et al., 2014), which enhances the contrast and improves the visibility of details in an image.We also applied an intensity high-pass filter which removed low-frequency background noise and enhanced the visibility of flow features in the images, making it easier for the PIV algorithm to accurately detect and track them (Thielicke & Stamhuis, 2014).A crucial parameter input to the PIVLab algorithm is the interrogation area (IA) usually measured in pixels.Whilst a large IA may improve the accuracy of velocity measurements, it can also lead to a loss of spatial resolution.
Conversely, a small IA may provide higher spatial resolution but could also be sensitive to noise or errors in particle tracking.A step-size, which is usually 50% of the IA, determines the spacing of PIV output vectors.Here, we utilized an IA of 128 pixels with reducing sizes over 4 passes (i.e., 128-64-32-8 pixels).For our PIV analysis, we utilized the FFT window deformation (direct Fourier transform correlation with multiple passes and deforming windows) cross-correlation algorithm (for details, see Astarita, 2008;Thielicke & Sonntag, 2021;Thielicke & Stamhuis, 2014).We postprocessed our computed velocities to remove spurious velocity vectors due to poor particle tracking, image artifacts, or other issues.Our post-processing specifically entailed setting a standard deviation filter which was used to filter out noisy or erroneous velocity vectors from the calculated velocity field based on their standard deviation values (set to PIVLab's default value of 8 in our study).Velocity vectors were georeferenced within PIVlab from an image coordinate system back into a projected coordinate reference system (GDA 1994 MGA Zone 55) and exported to ArcGISPro for analysis.

Flood Model
Two dimensional (2D) flood flow modeling was accomplished using HEC-RAS version 6.0, building on previous flood modeling studies using this code (e.g., Mokhtar et al., 2018;Navarro-Hernández et al., 2023;Pradhan et al., 2022).The model domain was 4.4 × 7.7 km.Topography was defined from a 1 m resolution LiDAR-based bare-earth digital surface model acquired when the river channel was dry (Geoscience Australia, 2022).A heterogenous 2D computational mesh was generated using a cell size of 30 × 30 m within the floodplain and 2 × 2 m between channel banks (Figure 2).Model upstream inflow boundary conditions were set using data from the gauge located at Tilpa, with 15-min interval gauged streamflow being used.We simulated two scenarios with different upstream boundary forcing: model M obs , based on observed discharge data; and model M qmaxpost using uncertain streamflow estimates (see Section 2.3.2.1).An energy slope value was used as the outflow boundary condition; the gradient, equivalent to the normal depth, was estimated by computing the bed slope along the terrain profile.Unsteady 2D flow simulations were conducted from 00:00 4 February to 23:59 6 February 2022 using HEC-RAS's diffusive wave equations since the flood wave was not highly dynamic (Brunner et al., 2020;Yalcin, 2020).
Model calibration data were available in the form of stage height from the gauge at Tilpa (Figure 2).2D model runs (see Section 2.3.2.1) were calibrated by adjusting a spatially discretized manning's roughness coefficient n over a parameter space between 0.025 and 0.033, with 0.002 increments until model simulations closely matched observed stage height data.Modeled stage height accuracy was assessed using two commonly used metrics (e.g., Alghafli et al., 2023;Moriasi et al., 2007), the percent bias (PBIAS) and root-mean-square error (RMSE).

Analysis: Flood Model Validation Using Observed Extents
We validated models M obs and M qmaxpost against flood masks derived from the observed satellite video frames.HEC-RAS modeled, and satellite video-derived flood extents were converted into binary masks (wet/dry) representing only flood extent, then overlapped, using ArcGISPro software.To preserve the detail of the model output's higher resolution, specifically close to the floodplain where model performance matters most due to the occurrence of complex inundation dynamics, and because comparison of validation performance needed to be done at the same spatial resolution, binary masks derived from the satellite video frames were resampled using the nearest neighbor method to 1 m.During resampling, binary pixel interpolation did not yield new additional values, avoiding false accuracy errors.Since we were resampling a binary raster with only one of two values (0 or 1), each new pixel in the target 1 m resolution raster was assigned the value of the nearest corresponding pixel, meaning extents of the binary raster remained unchanged because no new information was introduced.The overlap between the modeled inundation boundaries and the observed satellite video, were calculated according to the number of pixels that showed model agreement, overprediction and underprediction, as per the states in Table 2. Data from these computations were then used to calculate performance scores.
To analyze model skill in reproducing the observed flood inundation extent, we rely on well-established spatial performance measures which account for the most critical attributes of model simulation precision: model bias (E), hit rate (HR)-proportion of the observed flood event that was successfully predicted by the model, the false alarm ratio (F) and a rigorous composite measure (Critical Success Index [CSI]) which penalized model overpredictions (Bates et al., 2021;Bernhofen et al., 2018;Wing et al., 2017).
Error bias (E) which indicates whether the model is biased toward overprediction or underprediction: where F m is the total modeled flood extent.A Bias score of 0 indicates an unbiased model while positive and negative scores indicate a tendency toward overprediction and underprediction respectively.
The HR which measures proportion of observed flood that was simulated by the model, ignoring whether the observed flood extents were exceeded.HR can range from 1 (entire flood captured) to 0: where F o is the total observed flood extent.The HR ranges from 1 (entire flood captured) to 0.
The false alarm ratio (F), a measure of whether the model has a tendency to overpredict flood extent and can range from 0 (no false alarms) to 1 (all false alarms): The CSI, a measure of the accuracy of a flood model, accounts for both overprediction and underprediction and can range from 0 (no match between modeled and observed data) to 1 (perfect match between modeled and

Table 2 Confusion Matrix of Cell Descriptors in Binary Classification of Flood Masks
observed data).CSI omits regions that are dry in both predictions and observations since the flood model can accurately forecast these: where F m ∩ F o is the intersection of the modeled and observed flood extent, or number of correct predictions, and F m ∪ F o is the union of modeled and observed extent.The CSI ranges from 1 (best) to 0 (worst).

Analysis: Flood Model Validation Using Observed Velocities
Whereas the HEC-RAS 2D model is developed to estimate depth-averaged water velocities, satellite video-based LSPIV computations yielded surface river velocities.For a like to like comparison, we depth-average our LSPIV velocities.A well-established technique for converting surface velocities into depth averaged velocity is by utilizing published values of depth-averaging constants, also referred to as alpha coefficients, α (Biggs et al., 2021;Creutin et al., 2003).Rantz (1982) proposed α values of between ∼0.85 and 0.86 for natural channels.Hauet et al. (2018) recommend α values of 0.8 for water depths of less than 2 m and 0.9 for greater water depths with an uncertainty of ±15% at 90% confidence level within natural rivers.Vigoureux et al. (2022) suggested α values of between 0.85 and 1.2 for their LSPIV analysis and experimented with values between 0.8 and 1.0 for their depth averaging constants.To quantify the impact of choice of α coefficient on our estimated velocities, we utilize α coefficients of between 0.7 and 1.
We compare HEC-RAS model predictions with LSPIV velocities via linear regression where we fit a linear trend line and use R 2 and slope as indicators of model performance.The coefficient of determination, R 2 , value between HEC-RAS 2D model velocities and LSPIV remains unaffected when the LSPIV velocities are adjusted by varied α values.However, other statistical indicators are sensitive to the choice of α coefficient, including the mean velocity difference between model and LSPIV-based velocity (with an optimum difference of 0 m s −1 ), which we use to evaluate the accuracy of our validation.

Segmentation Accuracy
We evaluated the segmentation performance of six variants of our model based on SegFormer encoders (B0-B5) of increasing sizes, all pre-trained on ImageNet, to select an architecture that robustly segments flood pixels from a satellite video test image set.We retained the same hyperparameters of learning rate, batch size and number of epochs for all models during the testing phase to attain a fair comparison between the different backbones.
Quantitative results of model performance are detailed in Table 3. Overall, we find segmentation performance of all models to be satisfactory (IoU > 0.7).In general the performance of the deep-learning networks was on par with values reported in literature where hybrid CNN/transformer networks have been deployed for semantic segmentation (e.g., T. Wang et al., 2022).Unsurprisingly, we found the SegFormer-B5/U-Net network attained the most accurate scores across all metrics, except for the precision metric where the SegFormer-B3/U-Net network returned a better score of 0.9212.This can be explained by the higher number of parameters which leads to better segmentation accuracy, but at the cost of additional computational demand.Visual test results are presented in Figure 5 and further support the quantitative scores from model training and testing.

Flood Model Calibration
The 2D hydraulic model was calibrated to observed stage at the Tilpa gauging station by varying Manning's n roughness.In our investigation, both models were considered to be adequately calibrated with RMSE (0.46, 0.45 m) and PBIAS (0.77, 0.16) for models M obs and M qmaxpost respectively (Table 4), comparable to ranges reported in other similar studies (Timbadiya et al., 2011;Zeiger & Hubbart, 2021).Median error (Figure 6c) increased marginally for model M obs with both models showing a strong correlation between modeled and observed stages (R 2 of 0.72 and 0.71) for models M obs and M qmaxpost respectively (Figure 6).

Flood Extent Validation Using Deep Learning-Based Flood Extents
Results from validation of both models at locations A and B (Figure 2) using satellite-video derived flood extents are detailed in Table 5.The predictive accuracy of both models was tested by choice of uncertainty associated with model forcing data (streamflow) and the influence of TTA.We report both results in detail in the following sections.

Model Comparison Scores
We assess the performance of both models at validation locations A then B, followed by a summary of the key findings from both reaches.
At validation location A, the predictive skill of both model M obs and M qmaxpost was nearly similar, with CSIs ranging between 0.873 and 0.91.For both models, CSI declined when the models were validated against lower-quality observations, without implementation of TTA.When the model was forced using uncertain streamflow data, CSI averaged 0.918 as compared to 0.891 when using observed streamflow, a 3% improvement in performance.Average model bias declined marginally (by 3.3%) when forcing our model with uncertain streamflow.At validation location B, we observed lower values of CSI, ranging between 0.700 and 0.862, representing, on average, a 16% decline in model accuracy.Similar to validation location A, model performance scores improved across the board when TTA was applied.We however observe a consistent drop in model HR when TTA is applied at both validation location A and B.
For both models at both validation locations, underestimation of the flood extent is concentrated in the downstream end of the study reach (validation location B) where, on average, the HR was 85.5% as compared to 96.7% in validation location A. To contextualize these results, the skill of both models is in line with previous studies (e.g., Ongdas et al., 2020 using Sentinel-1B) who reported HR scores of between 59% and 77%.The models forced with uncertain streamflow data yielded superior results for all assessment metrics except for HR which declined whenever we applied TTA.Validation bias overall ranged between 0.808 and 1.152, with our scores being in line with those from other comparable validation studies in literature ranging from −1 to 5 (e.g., Wing et al., 2017Wing et al., , 2021

Test-Time Augmentation (TTA) and Model Performance Scores
Figure 7 shows an overview of the influence of TTA in the binarization of flood pixels in the observed data.Results differed substantially when TTA was applied during validation of both models M obs and M qmaxpost at both locations.At both reaches, CSI scores ranged from 0.7 to 0.91, and improved to between 0.803 and 0.941 when TTA was implemented.Overall, these CSI results are in line with other flood model validation studies which range from 0.1 to 0.9 (e.g., Mester et al., 2021;Wing et al., 2017).
HR scores generally followed a negative trend when TTA was implemented as compared to the other performance measures.Since this metric only considers the proportion of wet observed pixels, ignoring whether observed flood extents were exceeded, increasing the flood pixel class had an inverse impact on the flood capture.For both models at both validation locations, average HR scores decreased by 0.1% (at location A) and 4.8% (at location B).With TTA implemented, both models Models M obs and M qmaxpost reached a   nearly similar skill level across all metrics, indicating the importance of TTA in our segmentation strategy, which resulted in better agreement between model predicted and observed flood extents.

Flood Velocity Validation Using LSPIV
Figure 8 shows the comparison of satellite-derived velocity estimates against 2D model predictions, depth averaged using varying α coefficients (see Section 2.5.2).Gard (2008) provide criterion for assessing whether 2D model predictions are validated or not based on the correlation coefficient between measured and simulated velocities, with a correlation of 0.6-0.8being moderately strong and 0.8 to 1 being very strong.Further, Ballard et al. (2010) proposed that an R-value of 0.6 (R 2 = 0.36) constitutes a validated 2D model, with Pasternack (2011) recommending R 2 of between 0.4 and 0.8 between observed versus 2D model predicted velocities.Correlations between both model predictions and LSPIV velocities were reasonable and averaged 0.78.Water velocities predicted by both model variants at both reaches were generally within the observed variability of the LSPIV data.
We find that accounting for discharge uncertainty had minimal influence on velocity validation metrics.A key influence on the accuracy of 2D model velocity predictions, however, was choice of α coefficient.We report consistently low values of mean absolute error as α approaches 1, with an average value of 0.12 m s −1 .We partly attribute these results to the use of high-quality topography for our 2D modeling.These results affirm the findings of previous work, such as that by Lane et al. (1999) who similarly found that topographic specification has a larger role to play in constraining model velocity predictive ability as compared to inflow data.

Flood Extent Segmentation Using a Hybrid Transformer/CNN Network
Satellite video, unlike traditional still imagery, is composed of both static and motional context (temporal information), with static context being the scenes that remain relatively still or unchanged, with minimal or no movement over a period of time and motional context encompassing dynamic elements.Both static and motion-based contexts share a significant correlation, with image semantic segmentation particularly benefiting from the static context (e.g., Hashemi-Beni & Gebrehiwot, 2020;Leach et al., 2022) and the latter in video semantic segmentation (P.Hu et al., 2020;Y. Li et al., 2018).Although research into means for simultaneously exploiting both the static and motional contexts has been attempted (G.Sun et al., 2022), we did not pursue this line of inquiry for the reasons we outline forthwith.
While video segmentation can be valuable for analyzing dynamic changes during a flood event and for various research purposes, image segmentation provides a focused and practical approach for flood model validation tasks, where the primary goal is to assess model accuracy based on specific flood extents at critical time points.Hydraulic flood model predictions are typically produced as snapshots in time, representing flood extents at specific time points.By using a single image for validation, our validation process aligns naturally with the instantaneous nature of the 2D hydraulic model outputs.Additionally, state-of-the-art deep learning models, such as 3D convolutional networks or spatio-temporal transformers, can be computationally intensive for video semantic segmentation and may require specialized hardware to achieve real-time performance.Moreover, many current deep learning models process video frames independently, leading to limited temporal consistency in segmentation results.We therefore opted to exploit image semantic segmentation.
In evaluating the performance of our hybrid transformer-encoder, CNN-decoder structure, we evaluated segmentation performance with increasing SegFormer encoder depths (i.e., SegFormer B0-B5).The training and inference performance of our six models were largely comparable (Table 3), with the SegFormerB5 encoder yielding the best performance, thanks to it having the greatest number of trainable parameters (81M as compared to 60, 44, 24, 13 and 3M for the models B0 to B4).It is essential to clarify that our intention was 10.1029/2023WR034545 18 of 27 not to directly compare a standalone U-Net with a transformer-based model.Instead, we aimed to emphasize the performance of our hybrid architecture, which leverages the transformer's capabilities as an encoder and the U-Net as a decoder.Although it has been shown that transformers make strong encoders for semantic segmentation tasks (L.Wang et al., 2022), they come with the cost of increased computational requirements.For our decoder, we find the original (vanilla) U-Net architecture still allowed us to obtain acceptable performance.
In fact, while several variants of the U-Net architecture have been proposed (e.g., Attention U-Net, Inception U-Net, U-Net++) to improve segmentation performance, studies have found that the marginal segmentation performance gains of these new architectures may not be worthwhile as they come at significantly increased complexity and computational demand (e.g., Kugelman et al., 2022).Our results demonstrate that a vanilla U-Net decoder, combined with a transformer backbone can still provide comparable and competitive segmentation results.This finding has significance as the operationalization of these segmentation techniques will have greater impact when the complexity of model structures is low and accuracy high, which would encourage adoption in flood mapping studies.We note that although other hybrid structures composed of transformer-based encoder and CNN-based decoder architectures exist (e.g., TransUNet based on the ViT (J.Chen et al., 2021) and DC-Swin (L.Wang et al., 2022) based on the Swin transformer (Z.Liu et al., 2021)), it would be unmanageable to test them in this investigation.We speculate that any performance gains would be marginal at best.
Image segmentation performed by deep neural networks exploits two powerful features; (a) the capacity of deep learning architectures to overcome data scarcity by leveraging on previously trained networks and, (b) data augmentation techniques, which are used to increase the diversity and size of the training data set by applying various transformations to the original images which improves the model's generalization performance by reducing overfitting and capturing more representative features of the underlying data distribution.A recent study by Wieland et al. (2023) attributed improvements to their segmentation performance as a result of data augmentation, which helped introduce variations, as would be expected in real-world scenarios, that were otherwise not present in their training data set.Gonçalves et al. (2023) and Wieland et al. (2023) also found that adapting pre-trained models (trained on non-remote sensing specific databases such as ImageNet) which are fine-tuned on a custom data set, like in our case, led to superior model performance as opposed to initializing a model with random weights.Although domain-specific data sets of flood images with labeled masks exist, such as Sen1Floods11 based on Sentinel 1 and 2 (Bonafilia et al., 2020), SEN12-FLOOD based on a set of multimodal Synthetic Aperture Radar (SAR) and multispectral satellite imagery (Rambour et al., 2020), WorldFloods based on Senitnel-2 images (Mateo-Garcia et al., 2021) and FloodNet (Rahnemoonfar et al., 2021) training a good deep-learning model on data from disparate sensors remains a complex task due to issues including the fact that data from different sensors may have significant domain shifts, meaning that statistical characteristics (e.g., color distribution, resolution, lighting conditions) of the data can vary widely.Deep learning models are sensitive to such variations, and when faced with data from disparate sensors, they may struggle to generalize well to unseen sensor data.Further, data from disparate sensors may have inconsistencies in labeling or annotation standards with ground truth labels not being directly comparable between sensors, leading to challenges in creating a consistent and accurate training data set.
The uncertainty of inundation data used to validate our models has the potential to introduce observational bias.That said, our bias scores were all positive and small, indicating that satellite video derived flood extents either largely matched or were slightly overestimated by model predictions.Satellite-based observations, both optical and radar, face well documented limitations including cloud cover as well as uncertainties associated with the timing of sensor overpass relative to the advancement of the flood wave.Incorporation of some of these observational uncertainties (such as timing of the sensor overpass) would clearly offset some of the error related to our validation.Previous studies have relied on machine learning models (e.g., Tanim et al., 2022), thresholding techniques (e.g., Tiampo et al., 2022) and histogram-based models (e.g., Singh & Kansal, 2022) to derive flood extents for model validation.Departing from these well-established methods, our trained neural network extracted flood extents from red, green, and blue (RGB) images in a topographically complex natural river floodplain.Our results demonstrate that satellite video RGB imagery, which is essentially a stack of images rich in temporal dynamics rather than a snapshot in time, can attain acceptable accuracies for flood model validation and can be used in place of multispectral/hyperspectral imagery.Further, as satellite video remote sensing is emerging as a key technology in earth observation, there is great scope for advancing the synergy between deep learning algorithms and high spatial/temporal resolution RGB data in operational flood hazard mapping studies.

Effects of TTA on Segmentation Performance
Although implementing TTA introduced additional computational overhead during inference, since we had to process multiple augmented versions of each test image, the performance gains offered outweighed the computational cost.Similar to the findings of J. Liu et al. (2022) who utilized TTA during semantic segmentation of their remote sensing images, we found that the skill of both our models improved.In general, both models M obs and M qmaxpost were highly sensitive to TTA with significant disparity between model skill at both validation locations reported (Table 5).CSI, model bias and F scores all improved (10.3%, 67.7%, and 2.4%), while the flood capture/ HR decreased, indicating an increase in model overprediction as the segmented flood pixel class reduced.The sensitivity of model CSI performance to TTA was heighted more in validation location B where flood extent was less constrained by topography leading to more favorable F scores (mean of 0.004) as compared to location A (mean of 0.051).It is further manifest from our results that where model performance matters most, in zones of complex channels with localized spill, our HR, CSI and F scores remained high, demonstrating satisfactory flood extent prediction accuracy by our hydraulic model.Overall, we find that TTA helped our model to generalize better (with IoU >0.8) and reduced overfitting to our test image set.

2D Model Validation Using Satellite Video Derived Flood Extents
The performance of our 2D hydrodynamic models in reproducing observed flood extent varied significantly, highlighting a key result-that discharge data uncertainty is a fundamental driver of out-of bank flood processes and plays a crucial role in model evaluation.It is well established in literature that an assessment of the uncertainty of input boundary data is cardinal to any model validation exercise (Grimaldi et al., 2016;Hoch & Trigg, 2019).Streamflow data from stage-discharge relations has been reported to have errors as much as 20% or more during extreme floods (McMillan et al., 2012), further underlining the importance of quantifying discharge data uncertainty when deriving inundation maps from hydraulic models (Bales & Wagner, 2009).We observed that the input boundary condition (inflow hydrograph) had a strong effect on the modeled flood extent at both the lower and upper reaches of our study area where flood capture varied by an average of 5% when using uncertain discharge estimates to drive our model, which also explained the variance in other validation scores.Although inundation extent by itself may not be sufficient for the assessment of model skill, and is primarily effective in flat extensive floodplains (Hesselink et al., 2003), our results account for uncertainty typically ignored in most inundation modeling studies.At a minimum, we propose that an appropriate application of 2D model simulations for generation of flood inundation maps must account for discharge uncertainty, especially in highly variable terrain, if reliable extent maps are to be expected.
Although most of the differences between the skill of our models could be attributed to streamflow uncertainty, there remains some uncertainty linked to the satellite observation data, likely linked to the timing of image acquisition.Previous research (Horritt, 2006) found that satellite inundation data may have added uncertainty if the overpass does not coincide with the advancement of the flood wave, which was the case in our investigation as the satellite video was acquired during the receding phase of the flood event when the floodplain was gradually dewatering.Nevertheless, an improvement in identifying flooded pixels correctly when using uncertain streamflow estimates shows that our analysis adopted a reasonable compromise with regards to total uncertainties and that the presence of uncertainty in the observed satellite video did not significantly impact our findings.These conclusions further advance the evidence rendered by Grimaldi et al. (2016) who similarly singled out the uncertainties of upstream boundary conditions as the prime propagator of error when calibrating and validating hydraulic models.Therefore, evaluating the skill of models remains a challenging task due to the residual uncertainty in validation data coupled with the inconsistencies inherent in evaluation metrics, whose reliability varies depending on flood magnitude.

2D Model Validation Using Satellite Video Derived Velocities
When validating 2D model predictions using satellite video LSPIV velocities, the moderate R 2 values (ranging between 0.61 and 0.63) indicated that a substantial portion of the variance in the observed LSPIV velocities was explained by the model predictions.However, the fact that there was little improvement in model performance even when accounting for discharge uncertainty underscores the complexities involved in accurately capturing every aspect of fluid dynamics during flood events.Although our 2D model predictions performed on par with results from scientific literature (Barker et al., 2018;Pasternack, 2011), the uncertainty associated with the choice of a depth-averaging constant remains.Complexities surrounding spatial variations of flow velocities, channel geometry, bed roughness and the transient nature of river flows means that published depth-averaging constants must be used judiciously.Similar to other studies where LSPIV velocities were depth-averaged using α coefficients (e.g., Le Coz et al., 2010;Masafu et al., 2022;Vigoureux et al., 2022), we find that α value of between 0.9 and 1 to be more realistic.We acknowledge that the gold standard for defining an α coefficient is to use aDcp observations but field-based river velocity measurements are logistically challenging and often impossible during high flow.Here we aim to build upon previous studies that have estimated river surface velocities (e.g., Legleiter & Kinzel, 2021) and further show that an independent satellite-video data set can be used to validate 2D model simulations.

Limitations in Applying Satellite Video for Flood Model Validation
Our study evidenced the utility of satellite video data sets in assessing reach scale 2D hydraulic model simulations.
The likely future proliferation of nanosatellite constellations will lead to more opportunities to acquire satellite video.However, satellite video data sets come with limitations that can affect their applicability and accuracy in analyzing river dynamics.While video data sets capture dynamic changes over the time of recording, their spatial resolution is still limited compared to dedicated high-resolution still imagery.Moreover, limited spectral bands in satellite video means that some information relevant to river dynamics is not available, especially when compared to multispectral still imagery and SAR data.Near-infrared (NIR) bands are particularly beneficial in providing crucial information for delineating flood extents.SAR imagery, which operates in the microwave region of the electromagnetic spectrum and can capture data under various atmospheric conditions and at different times (night and day) make it highly reliable during stormy or cloudy conditions when optical sensors are ineffective.Similar to other commercial satellite data, satellite video is not yet open access and requires tasking for image acquisition.Further, the current satellite video imaging catalog is very limited compared to other publicly available still image satellite data.
Our use of temporally and spatially autocorrelated data from the same event for both training and testing can lead to overestimation of the deep neural network's performance.This is because the network was not challenged with data that significantly differed from what it had already encountered.This narrow data set scope limited our neural network's exposure to diverse flood characteristics like varying water levels, different terrain types, or urban versus rural flood dynamics.In urban settings, complexity increases due to diverse features like buildings, roads, and varied land use.Our hybrid deep neural network fine-tuned with rural catchment data, might not perform as well in urban landscapes without additional training on urban-specific features.To address the potential overestimation of our network's performance due to the use of autocorrelated data, we fine-tuned a deep neural network that had been pre-trained on ImageNet, a large and diverse data set encompassing a wide range of scenes and objects.This preliminary training provided our model with a foundational understanding of varied features and textures, which is beneficial for initial feature detection in flood scenarios.
Despite this initial step, the use of more diverse and distinct flood scenes for training and testing remains crucial.
While the pre-training on ImageNet partially compensated for the lack of diversity in our flood-specific data set, it cannot entirely substitute for the direct input of varied flood scenes.Future research could focus on incorporating data sets that include multiple flood events from different geographical locations and times.This will significantly improve the ability of deep neural networks to generalize and perform accurately across various real-world flooding scenarios.Additionally, the challenge of determining optimum threshold values in semantic segmentation was amplified by our data set's limited diversity.Addressing this limitation by testing and optimizing deep neural networks across a variety of flood events would aid in identifying more universally applicable threshold values.The development of more adaptable neural network models, capable of adjusting threshold parameters based on the characteristics of each unique flood event, remains a key area for future improvement.Nonetheless, our approach is pioneering in its use of satellite video for flood analysis, providing a valuable proof of concept and a baseline for future models that could be trained and validated on more varied data sets.

Satellite Video for Flood Risk Science: Current Status and Future Perspectives
High resolution full-color earth observation video presents a fundamental paradigm shift away from the conventional periodic snapshots offered by most current satellite platforms with imagery sensors.Although full-motion satellite video offers previously unavailable temporal insights for monitoring flood dynamics, this data will not necessarily replace conventional earth observation missions but rather they will augment current sensors to foster further advances in hydrological process understanding.Moving from single image analysis to processing video streams presents a unique challenge due to the sheer size of satellite video.However, parallel developments in data analytics such as machine and deep-learning techniques are making processing of such big data sets less computationally intensive.Satellite video sensors suffer from the key constraint of cloud cover which impairs their ability to provide quality imagery during potentially critical storms which are typically accompanied by cloudiness.However, as demonstrated here, for large catchments where a period of clear skies follows a storm and the passage of a flood wave has relatively long duration, daylight acquisition satellite video is feasible.Although spatial and temporal-based cloud removal methods in optical satellite imagery have been proposed (e.g., Huang et al., 2015;Z. Li et al., 2019), their application in satellite video scenes is yet to be explored and changes in cloud obscuration are likely to be minimal during a single video sequence.
Nanosatellite constellations, such as CubeSats, which are highly modular and inexpensive imagers, are rapidly allowing real-time delivery of RGB video with very high-frequency revisits, up to 15 min for some constellations (Ivliev et al., 2022;Liddle et al., 2020;Lomaka et al., 2022;Marinan et al., 2013).As satellite video becomes increasingly available to commission, new opportunities for applications such as direct estimation of discharge from space are now in the horizon.Legleiter and Kinzel (2021) used satellite video to estimate flow velocities of the Tanana River directly from space.The production of digital surface elevation models using satellite video and Structure-from-Motion (SfM) photogrammetry type techniques was proposed by d' Angelo et al. (2014Angelo et al. ( , 2016)).
Combining space-derived velocities with quality digital elevation and bathymetry data presents the potential for estimating discharge from flash floods and ungauged catchments where in-situ gauges might be too expensive to install and operate.The UK's Center for Ecology and Hydrology FluViSat study ("FluViSat-Hydrological Flow Measurements from Satellite Video, n.d.," https://www.ceh.ac.uk/our-science/projects/Fluvisat, last access: 16 August 2023) currently underway, is another application that will aim to demonstrate computation of surface velocities and river discharges using satellite video.Nanosatellite constellations, which are any satellite with a mass between 1 and 10 kg including Cubesats and SunSats, are widely believed to the future of low Earth orbit observations.These satellites are cheap to launch and provide high revisit capabilities as compared to other larger satellite missions.Such disruptive earth observation technologies, combined with rapidly advancing big data analytics will transform satellite video applications in flood hydrology.

Conclusion
Using satellite video-derived flood extents and velocities, we were able to assess the skill of a two-dimensional hydraulic model to prediction a flood event.Two sets of simulations were undertaken, with one set focused on simulating the flood event whilst accounting for discharge uncertainty while the other set did not account for discharge uncertainty.Flood extents were derived from satellite video scenes using a hybrid transformer-encoder CNN-decoder deep neural network.Leveraging on the transformer's self-attention mechanism and CNN's effectiveness at local spatial feature detection, we attained robust predictions of flood pixels.Implementation of TTA while delineating flood extents resulted in further improvements in segmentation performance.
Validation of both models using flood extents showed good model performance.Models were also validated using satellite based LSPIV; to the best of our knowledge, this study presents the first attempt of this nature to use satellite derived LSPIV velocities for validating 2D model predictions.Although there remains uncertainty with regards to choice of a depth-averaging constant (α), our linear regression models still showed that model-predicted and LSPIV velocities had a significant statistical relationship.This demonstrates the notable benefits of non-contact-based velocity estimation from space, especially during high flow conditions when the use of traditional river velocity measurement techniques is precluded.Although discrepancies between model and satellite-derived flood extents can be linked to other uncertainties, such as those associated with the satellite data itself, topography, and the model, this study emphasized the importance of accounting for uncertainty in discharge; a dominant yet often neglected source of uncertainty which is especially heightened during flood events.We find that accounting for discharge uncertainty had less impact on metrics associated with velocity-based validation than those associated with inundation extent.This is likely due to the fact that model velocity estimates are subject to greater spatial variability from factors such as hydraulic parameters, roughness coefficients and channel geometry.The wider implication of this study is the demonstration that high resolution satellite video has significant potential as a source of temporally rich data for the validation of velocity and extent predictions.With rapid advances in remote sensing sensor technologies and constellations, satellite video is likely to become more straightforward and cheaper to commission.In many regions of the world, where ground-based hydrological observations are not routine, this will open the door for wide-scale observations of flow velocities and extents in real-time, enabling further progress in the science of flood modeling.

Figure 2 .
Figure 2. The Darling River at Tilpa study area, located in the Murray-Darling basin (shaded inset), New South Wales, Australia, with a basemap of Jilin-1 satellite video acquired at 23:12 UTC on 5 February 2022.Panels A and B indicate flood model validation locations.Panel C presents the hybrid mesh used in hydraulic modeling.

Figure 3 .
Figure 3. (a) Stage-discharge relationship for the Darling River at Tilpa gauging station from February 2021-2022.(b) Posterior rating curves and associated uncertainties derived using the BaRatin method.(c) Discharge time series.Q obs is the observed discharge.Q high and Q low are discharges with associated stage (non-systematic and systematic stage measurement errors) and rating curve (parametric errors and structural/remnant errors) uncertainty.Q maxpost is the upper 95% confidence band of a streamflow timeseries based on an uncertainty analysis of the rating curve.The rug plot to the right depicts the distribution of Q obs and Q maxpost discharges.
using the Scale-Invariant Feature Transform algorithm (see detail in Text S2 in Supporting Information S1).Stabilized frames were exported to Intel's open-source image annotation tool Computer Vision Annotation Tool and were manually labeled.Labeled masks were then converted to binary format in Python (with background = 0, flood = 1).These annotated masks served as the ground truth for training and validation of our deep neural network.

Figure 4 .
Figure 4. End-to-end deep learning network training, validation and testing pipeline.

Figure 5 .
Figure 5. Results of the semantic segmentation of a sample satellite video patch using varied combinations of SegFormer encoders coupled with a U-Net decoder.The annotated mask serves as the ground truth with water pixels labeled blue, background pixels labeled white.

Figure 6 .
Figure 6.Hydrologic Engineering Center-River Analysis System (HEC-RAS) calibration results.Observed water surface elevations are compared against calibrated model M obs (a) and M qmaxpost (b) predictions over a time step of 15 min on 5 February 2022 (04:00-12:30).Coefficient of determination (R 2 ) and root-mean-square error represent the results of regression analysis of the data.(c) Raincloud plot, a boxplot with a half-side violin plot, showing the error distributions for both models.In the boxplots, the orange dot represents the median score, the box encompasses the second and third quartiles, and the top and bottom whiskers respectively represent the largest and smallest values within 1.5 times the interquartile range.

Figure 7 .
Figure 7. Semantic segmentation results at reach A (a) and B (b): The first panel to the top left shows the observed satellite video flood image followed by Hydrologic Engineering Center-River Analysis System (HEC-RAS) model flood outputs from Models M obs , M qmaxpost and corresponding binary segmentation maps from zoomed-in insets with no Test-Time Augmentation (TTA) then TTA applied.

Figure 8 .
Figure 8. Satellite video-based Large Scale Particle Image Velocimetry (LSPIV) velocities versus Hydrologic Engineering Center-River Analysis System (HEC-RAS) 2D model simulations for Models M obs and M qmaxpost at reach A (a, b) and reach B (c, d) respectively.

Table 1
Jilin-1 GF03C02 Satellite Sensor Specifications and Video Product Information

Table 3
Segmentation Accuracy for Increasing Sizes of SegFormer Encoders (B0-B5) Coupled With a U-Net Decoder

Table 4
2D Model Calibration Metrics for Models M obs and M qmaxpost

Table 5
Validation Metrics for Models M obs and M qmaxpost Against Observed Satellite Video Data at Validation Locations A and B (See Figure 2)