Unsupervised defect segmentation with pose priors

Single shot, semantic bounding box detectors, trained in a supervised manner are popular in computer vision‐aided visual inspections. These methods have several key limitations: (1) bounding boxes capture too much background, especially when images experience perspective transformation; (2) insufficient domain‐specific data and cost to label; and (3) redundant or incorrect detection results on videos or multi‐frame data; where it is a nontrivial task to select the best detection and check for outliers. Recent developments in commercial augmented reality and robotic hardware can be leveraged to support inspection tasks. A common capability of the previous is the ability to obtain image sequences and camera poses. In this work, the authors leverage pose information as “prior” to address the limitations of existing supervised learned, single‐shot, semantic detectors for the application of visual inspection. The authors propose an unsupervised semantic segmentation method (USP), based on unsupervised learning for image segmentation inspired by differentiable feature clustering coupled with a novel outlier rejection and stochastic consensus mechanism for mask refinement. USP was experimentally validated for a spalling quantification task using a mixed reality headset (Microsoft HoloLens 2). Also, a sensitivity study was conducted to evaluate the performance of USP under environmental or operational variations.

estimates.In other words, a structural inspection involves the classification of the defect, localizing the defect, and its quantification.To this end, researchers have adopted various convolutional neural network (CNN) architectures with the capability to classify, detect, and segment defects in inspection images, see Figure 1 (Çelik & König, 2022;Li et al., 2022;Pan & Zhang, 2022;Sirca Jr & Adeli, 2018;Spencer Jr et al., 2019).
Classification networks are typically composed of a CNN encoder followed by fully connected layers and a softmax activation function.In this scheme, an input image is first encoded, then the encoding is fed into a fully connected multilayer perceptron which outputs the predicted class probabilities.Classifiers are well-used in computer vision and found numerous applications in civil engineering (Kabir et al., 2008;Liu et al., 2019;Spencer Jr et al., 2019).Specifically in structural inspection, researchers have utilized classification networks for post-earthquake collapse classification, structural component and damage recognition, classification of bridge infrastructure, crack identification, and more (Gao & Mosalam, 2018;Lin et al., 2017;Rafiei & Adeli, 2017;Yang et al., 2018;Yeum et al., 2019b;Zhang et al., 2016;Zhang & Yuen, 2021).Classification networks are also a popular choice for crack detection.For example, researchers utilized a sliding window approach, effectively, the input image is divided into a grid of some fixed size, and each cell of the grid is fed into the classifier for prediction (Cha et al., 2017;Fan et al., 2019;Kim & Cho, 2018;Xu et al., 2019bXu et al., , 2019a)).Thus, the cells where the classifier predicts that a defect exists, become the detection.For specific tasks (e.g., crack detection) the sliding window may be preferred over bounding box detectors because bounding box labels may inadvertently capture large swaths of background (e.g., shear cracks, radial cracks, large perspective transformation) However, generally, sliding window classifiers are not often preferred because it is computationally expensive, and the resolution of the detection is limited by the size of the sliding windows.
Detectors, like classifiers, initially utilize a CNN encoder to process the input image, the output of which is used by some region proposal network (RPN) to localize the region of interest (ROI) in the image (Farhadi & Redmon, 2018;Ren et al., 2015).The predicted region usually takes the form of a bounding box.Compared to early detectors that employed the sliding window method, described above, state-of-the-art single shot detectors (SSDs) are orders of magnitude more efficient (Farhadi & Redmon, 2018).Detectors function by exploiting the spatial continuity constraint that is inherent to the CNN.In short, CNN encoders can encode class labels as well as the localization of the classes.Due to this shortcut, modern SSDs are supremely efficient, while remaining easy to customize due to the relative ease of labeling data and taking advantage of transfer learning/fine-tuning.These advantages have made detectors (e.g., YOLO, F-RCNN, EfficientDet) very popular in the civil engineering domain for everything from bridge inspection, and traffic studies to work-site safety (Cha et al., 2018;Charron et al., 2019;Chen & Gupta, 2017;Farhadi & Redmon, 2018;Ren et al., 2015;Spencer Jr et al., 2019;Tan et al., 2020;Yeum et al., 2019a).Due to the advantages and accessibility of detectors, some civil companies have begun testing detectors to implement into their existing workflows (Rubenstone, 2020).
Till this point, detectors have been utilized to "localize the ROI," which means finding ROI locations in the image coordinates.However, it is necessary to project the ROI to world coordinates (e.g., 3D coordinates) for damage quantification, such as measuring length or area.Consider inspectors had an accurate bounding box detector, and they took an image directly normal to a perfectly rectangular ROI.The bounding box detector will overlap the ROI, from which you can calculate the physical area of the ROI by obtaining the scale of images (e.g., mm/pixel).Take a step to either side and captured the same ROI at an angle, the ROI will appear as a parallelogram thus the bounding box containing the ROI will now include pixels that are not part of the ROI; such that the greater the perspective transformation the greater projected unwanted background regions.To resolve this issue, the authors need to only project those pixels belonging to the ROI class; detecting those pixels is called semantic segmentation.
Semantic segmentation is the association of each pixel in an image with a class label.The most popular semantic segmentation methods utilize a derivative of the U-Net architecture (Chen et al., 2018;Guo et al., 2021;Kendall et al., 2015;Ronneberger et al., 2015).Like a detector, the U-Net utilizes both the class and spatial information of an encoder.To recover the dense pixel-wise classification (semantic segmentation), a decoder is used to up-sample the encoded tensors (Ronneberger et al., 2015).Regardless of CNN architectures, a CNN's spatial continuity constraint enables the encoder to encode spatial information, which can be used by a learned decoder to produce segmentation masks.Semantic segmentation is usually necessary to quantify the damage on a physical scale (Kim & Cho, 2020;Li et al., 2019;McLaughlin et al., 2020;Mirzaei & Adeli, 2019;Wijnands et al., 2021;Yeum & Dyke, 2015;Zhang & Yuen, 2021).In our recent work, mixed reality (MR) devices are used to create a local 3D map of a bridge, and spalling is quantified by ray-casting a segmentation of the spalled area in the image to the 3D map (Al-Sabbag et al., 2022).
However, utilizing a semantic segmentation method presents some real-world challenges (Kendall et al., 2015).First, creating labeled segmentation data sets is timeconsuming, tedious, and expensive for training supervised segmentation models.This is especially the case for irregular-shaped ROIs with jagged edges (e.g., spalling defects in Figure 1), which are particularly difficult to label manually.Additionally, in a technical field, labeling often requires the labelers to possess domain knowledge, which can greatly increase costs, making many labeling tasks impractical (Al-Sabbag et al., 2022).Second, in a typical inspection using robotic platforms (e.g., drones), videos are captured and many frames contain the same ROI, captured from different camera positions and angles; it is challenging to specify unique segmentation ROI by weeding out incorrect segmentation results or merging suboptimal ones.The consequence is that the inspectors will have to asynchronously review a large number of segmentations to select the best result.
To address such real-world challenges of current semantic segmentation methods and their implementation, and enable the application of quantitative inspection, the authors propose unsupervised semantic segmentation with pose prior (USP), which best leverages novel developments in machine learning and capabilities of existing robotic inspections.The proposed method utilizes a series of images featuring the ROIs from various known camera poses.Images are first preprocessed such that each image in the series of images appears to be taken from the same pose through homography and then they are cropped with an identical bounding box.Preprocessed images are segmented using unsupervised semantic segmentation, and the resulting masks are fed into our stochastic consensus module.The stochastic consensus module attempts to find pixel-wise label consensus by sampling different pixel locations between the different segmentation masks.Then, found consensus pixels are utilized as seed points as the preprocessed images are re-segmented.This process is repeated until consensus between segmentation masks is reached.In this manner the proposed unsupervised semantic segmentation can provide accurate segmentation masks without using expensive labeled data; meanwhile enforcing consistency between segmentation results from multiple frames.The authors experimentally demonstrate the capability of the proposed USP using a case study of spalling segmentation.In this experiment, a sequence of spalling images and their pose information were collected using the Microsoft HoloLens 2, as a data collection platform (Al-Sabbag et al., 2022).Using pose information of a sequence of spalling images, the authors successfully applied the proposed USP to segment the accurate boundary of spalling for quantification.

WEAKLY SUPERVISED AND UNSUPERVISED SEGMENTATION
Supervised semantic segmentation methods require a densely labeled segmentation data set, which is time-consuming, tedious, and expensive.A common workaround to creating segmentation is to utilize weakly supervised or unsupervised learning methods, where weakly supervised methods require some alternative label and unsupervised methods require no labels whatsoever.The catch being, supervision, generally, outperforms weak supervision and no supervision.However, new research found that it has significantly closed the supervision gap which, in future, may lead to more widespread application of weakly supervised and unsupervised learned models (Geirhos et al., 2021).The following sections will introduce popular trends in weakly supervised segmentation and unsupervised segmentation.

Weakly supervised segmentation
There are many different methods in the field of weakly supervised segmentation, but by far the prominent methodology is based on a class activation map (CAM).CAMs were first proposed by Zhou et al., who substituted the fully connected layer of a traditionally learned encoder for global average pooling layers, which enable the recovery of an attention heat map of the predicted class (Zhou et al., 2016).It is possible because a CNN encoder is both able to encode class and spatial information, which are directly visualized as the CAM.CAM methods are considered as weakly supervised methods because they use a less labor-intensive label (classification) to produce a much more labor-intensive label (segmentation).This labor savings greatly increases the ability of CAMs to scale to larger data sets.The CAM concept has been extended by others, in gradient-based localization (GRAD-CAM) or Grad-CAM++.(Chattopadhay et al., 2018;Selvaraju et al., 2017).The various derivatives of CAMs seek to extend the functionality of the original CAM paper.For example, Grad-CAM sought to use gradients for all layers of the encoder, rather than only the last layer of the encoder and Grad-CAM++ extends Grad-CAM by allowing the activation of multiple instances of a class in an image (Chattopadhay et al., 2018;Selvaraju et al., 2017).
However, the Achilles heel of the CAM-based methods is that the localization is often limited to small discriminative regions (Bae et al., 2020).This describes the tendency of a supervised model to "learn shortcuts" because the model has little incentive to learn the full representation of the class, when the shortcut will do just as good, 99% of the time (e.g., as a dog's face is its most discriminative feature, CAMs tend to only activate on the dog's face, neglecting other features like the body and tail), resulting in under-segmentation and thus low segmentation accuracy.This problem is well-documented in the literature and many researchers have attempted to address this issue in various ways (e.g., self-supervised equivalent attention mechanism, stochastic inference) (Bae et al., 2020;Lee et al., 2019;Wang et al., 2020).Bae et al. proposed a combination of three techniques: threshold average pooling, negative weight clamping, and percentile for thresholding (Bae et al., 2020).Lee et al. proposed stochastic inference where parts of an image are stochastically selected, such that the CAM is generated for both descriptive and nondescriptive parts of an image (Lee et al., 2019).Finally, Wang et al. proposed a self-supervised Siamese network architecture, where one network takes the image and the other a transformed image, the CAM from the transformed image is regularized and used to generate a self-supervised loss (Wang et al., 2020).

Unsupervised segmentation
Unsupervised methods are characterized by the means to generate a gradient (e.g., loss) from the raw data set.There are two prominent methodologies for unsupervised segmentation: Generative Adversarial Networks (GANs) and feature clustering (Abdal et al., 2021;Bielski & Favaro, 2019;Goodfellow et al., 2014;Hwang et al., 2019;Ji et al., 2019;Karras et al., 2020;Kim et al., 2020;Liu et al., 2021;Zhu et al., 2017).GANs, first proposed by Goodfellow et al., at its simplest, are made of two components: a generator and discriminator (Goodfellow et al., 2014).A generator takes some random noise and transforms it into a "fake" data instance.The generated "fake" data instance, paired with a "real" data instance, is provided to the discriminator, which will attempt to identify which is the real and fake data instance.The discriminator's prediction generates a gradient that is propagated through the generator and discriminator.For the sake of completion, the generator usually takes the form of a decoder and conversely, the discriminator usually takes the form of an encoder.The original form of GAN proposed by Goodfellow et al. is not directly applicable to semantic segmentation.
Instead, derivatives of GANs, such as CycleGAN or Style-GAN architectures are popular for unsupervised semantic segmentation (Karras et al., 2020;Zhu et al., 2017).In general, a GAN-based unsupervised segmentation method, while not requiring labels, requires training utilizing a data set of the domain, the quality of which and hyperparameters can heavily impact model performance.As a result, the real-world applications of GANs have been relatively limited.Additionally, a popular assumption is that the ROI and background can be decoupled (e.g., the object to be segmented is in the foreground against some background) (Abdal et al., 2021;Bielski & Favaro, 2019;Liu et al., 2021).This assumption is not valid for most structural inspection tasks because the defect usually exists in the background surface (e.g., damage on a concrete surface).
Next, feature clustering is a well-established unsupervised machine learning method, for example, K-Nearest Neighbor (KNN) that utilizes similarity to cluster entries (Hwang et al., 2019;Ji et al., 2019;Kim et al., 2020).The intuition translates well to unsupervised semantic segmentation, which seeks to cluster pixels by their similarity.Hwang et al. proposed a two-stage expectation maximization framework, utilizing pixel sorting by spherical (von Mises-Fisher) K-means clustering for segmentation and the KNN to sorting loss which is propagated back to the encoder (Hwang et al., 2019).Ji et al. proposed a Siamese architecture where an image and its randomly transformed version are fed into the respective heads of the Siamese network.The Siamese network is composed of an encoder and a fully connected layer, where some weights are shared between the encoder and fully connected layers.Finally, the segmentation is produced by invariant information clustering, which attempts to maximize the mutual information between the original and transformed image (Ji et al., 2019).Kim et al. proposed a simpler method where a loss, generated from feature similarity and spatial continuity, is minimized (Kim et al., 2020).A unique part of the method proposed by Kim et al. is the ability of the method to seamlessly operate in a semi-supervised manner, which is accomplished by utilizing seed point priors.

Remarks
Semantic segmentation is a crucial process for creating accurate defect quantification, in the context of a structural inspection.Typically, training a semantic segmenter requires a large segmentation data set, which must be labeled at great cost by domain experts.Instead, weakly supervised and unsupervised segmentation methods bypass the need for expensive segmentation labels.Subsections 2.1 and 2.2 describe various weakly supervised and unsupervised semantic segmentation methodologies.In this study, the authors implemented the unsupervised feature clustering method by Kim et al. for damage segmentation.Unsupervised feature clustering is a relatively simplistic methodology, and its main advantage is being able to utilize seed point prior.This provides an effective short-term memory mechanism to propagate high-fidelity predictions.Unsupervised feature clustering enables the authors to incorporate multiple images via consensus and refine segmentation results through iterations.

Overview
To generate segmentations, USP requires images that capture ROIs and their respective poses (represented by a projection matrix in a map frame) and the reference frame (RF).In this work, the authors assume that data collection platforms (e.g., ground robots, smartphones, or MR devices) are collecting image data and their relative poses using a simultaneous localization and mapping (SLAM) system or a structure from motion (SfM).Then, users manually select a bounding box of a target ROI or it can be automatically detected by a bounding box detector such as YOLO or F-RCNN (Chen & Gupta, 2017;Farhadi & Redmon, 2018;Ren et al., 2015;Tan et al., 2020).The main task of USP is to segment and extract the boundary of the ROI from its bounding box.USP carries out three key processes: pre-processing, segmentation and outlier rejection, and stochastic consensus.A feedback mechanism is implemented to ensure convergence of segmentation illustrated in Figure 2. Sample resultant images, corresponding to each key process, are shown in Figure 3.Note that images in Figure 3 are obtained from the experiment validation in Section 4 for the case where spalling damage is a target ROI.
USP starts with a stream of images, gathered from an image collection platform that supports SLAM or SfM.For example, an inspection platform collects videos following the data collection path, shown in Figure 3.The images, which include the target ROI (i.e., spalling damage) ("Data Collection" in Figure 3), and their relative poses are stored.Then, users annotate the ROI using a bounding box detector or manually annotate its bounding box, to one of the images, denoted as the reference image.As the relative poses of the images collected are known and assuming the ROI is on a planar surface, they are transformed to the pose of the reference image which is the image with the ROI bounding box.Thus, the spalling damage is placed at the same location in all images.Then, the bounding box region in each image is cropped and will henceforth be referred to as frames.The frames are shown in "Prepros.Frames" in Figure 3. Next, all frames are naively segmented using unsupervised segmentation.Then, a similarity metric is used to identify outlier segmented frames; assuming outliers are not similar to the majority of the segmented frames ("Seg.& outlier rejection").If the number of inlier segmentations is sufficient, then they will be utilized in the stochastic consensus module to generate a sparse seed point mask to guide segmentation in the next iteration.Otherwise, unsupervised segmentation is re-run until enough inlier segmentation frames are generated.With the inlier segmentation frames, the stochastic consensus is executed to generate seed points, where there is semantic class consensus.The sparse seed points are utilized to re-segment the images, which results in more refined segmentation masks.Each time the seed point mask is used to re-segment the frames, an iteration counter is incremented.When sufficient consensus is found or maximum iterations are reached, an ensemble of inliers is used to create the output segmentation ("Final seg.mask").
Note that the USP utilizes the homography between the frames so if the damage is in nonflat regions or on a corner, USP can not be used.If the damage is in a corner or on a nonflat surface, the underlying 3D surface geometry must be understood and more sophisticated models are incorporated to make their quantification.

Pre-processing
The pre-processing starts with a sampled set (or sequence) of images, with known poses, containing the selected or detected ROI.To start, all images are transformed or rectified to the RF view.This is accomplished by exploiting the homography between images.A common way to compute the homography is using feature point correspondences through the application of the 4-point algorithm (direct linear transform) with RANSAC (Hartley & Zisserman, 2003).However, this method may be unreliable when scenes contain multiple surfaces (e.g., a spalling region inside is not on the concrete surface) or if there are insufficient features on the surface, which can be caused by a myriad of issues including a lack of distinct textures (e.g., smooth concrete) and poor lighting (e.g., bridge soffit).Thus, to ensure the robustness of the homography calculations, the authors leverage camera pose information to compute the homography between images.Most sensing platforms supporting SLAM or SfM fuse additional sensor inputs such as Inertial Measurement Units (IMU), Global Position System (GPS), and/or depth sensors to ensure that pose estimation remains robust in such environments (Macario Barros et al., 2022).Suppose that the   are 3D points that correspond to the corners of the ROI bounding box, chosen by the user on the RF (bounding box in Figure 4).They are computed by ray-casting the four corners of the bounding box (image coordinates on RF) to an underlying 3D mesh.The homography matrix from the RF to the others can be computed from Equation (1) (Li et al., 2013;Malis & Vargas, 2007).
Here, the RF is the camera , and the homography to it is calculated from the additional views, symbolized as camera , which is illustrated in Figure 4.A normal vector  is obtained by fitting a plane through   using least-squares error minimization.The map origin frame is an arbitrary coordinate system in the 3D space that is selected to be the origin when performing SLAM so that all poses are measured relative to this coordinate.  and   are the rotation matrices of image 's and 's poses, and   and   are the translation vectors of their camera centers, respectively. is the shortest distance from the plane to the camera .
After all images are transformed into the RF's perspective, the bounding box is used to crop each image.The crop size is set to be 1.5 times the bounding box dimension so that the background around damage can be included.These additional background regions serve two purposes: first to help separate the damage region from the nondamage one, and to facilitate outlier rejection, which is described in Subsection 3.2.If there is no error in the homography estimation, the ROI location across the rectified image patches are identical.
Finally, a Gaussian blur is applied to all images, which is to eliminate high frequencies in the images (Li et al., 2013).Empirically, a small blurring has a positive effect on the segmentation outcomes of the proposed method, this is expected as it increases the similarity of neighboring pixels that aids in the unsupervised semantic segmentation.

Unsupervised semantic segmentation and outlier rejection
After pre-processing, the series of images should appear like the RF image (e.g., the same perspective), see Figure 3.However, the content of these image frames is not the same because of the various lighting conditions and corruptions such as motion blur, which may afflict even the RF.Thus, USP runs segmentation on multiple frames, rather than This process starts with applying unsupervised segmentation to each preprocessed image frame.The outcome of this process is a segmentation mask for each image, which may vary greatly and is likely to include some wrong (outlier) segmentation.To merge them, a structural similarity (SSIM) score between the masks is calculated to identify inliers.The inliers are passed to the stochastic consensus method to generate seed points in Subsection 3.4.

Unsupervised segmentation
In this study, the unsupervised semantic segmentation network is adapted from "Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering" developed by Kim et al. in 2020.This technique attempts to recursively meet (e.g., minimize the loss from) three incompatible requirements; R1: pixels of similar features should be assigned the same label (e.g., the same semantic class should have the same label), R2: spatially continuous pixels should be assigned the same label (e.g., semantic classes should be large), and R3: the number of unique labels should be large (e.g., there should be many semantic classes).
The network is a simple light CNN model, featuring only convolution, normalization, and arguments of the maxima (argmax) for differential clustering, presented in Figure 5.The model is composed of a series of 2D convolution, ReLU activation, and batch-normalization (batch-norm) layers, referred to as the convolutional block (CONV Block).The user can specify the number of convolutional blocks, and the default number of convolutional blocks used in this study is three.One last convolution and batch-norm are applied before the argmax function and loss are calculated.Seed points obtained from stochastic consensus, when available after the first iteration, can be utilized by the model to train in a semi-supervised mode.
In traditional machine learning, an epoch is defined as the number of times a learning algorithm has trained on the entire training set, and inference is getting a prediction from a trained model.In this context, unsupervised semantic segmentation trains the model on each image during inference.As the model trains, the number of unique classes is reduced, till the minimum number of classes is reached or when the current epochs reach the maximum epochs.The minimum number of classes should reflect the number of semantic classes the image possess.The authors recommend that the number of expected semantic classes in an image is set to be the number of classes you expect plus one, due to the edge effects, described following paragraphs.The maximum number of epochs required is multi-faceted, so the authors recommend keeping this as default.
As discussed in Section 2, the loss function receives three inputs (including seed points), the two main inputs required are the response maps, which is a tensor of class probabilities (e.g., dimension (image width, image height, number of starting classes, default 100)) from the last batch normalization layer, and the argmax applied to the response map in the channel dimension.In the unsupervised segmentation mode, without seed points, the network minimizes the similarity loss  sim , and spatial continuity loss  con in Equation (2).In semi-supervised mode, the model additionally needs to minimize the seed point prior loss, which was referred to as a scribble loss  scr (Kim et al., 2020) ("scribble" indicates scribble lines manually drawn on target classes).And, if seed points are used, scribble loss  scr are combined with the similarity loss  sim and spatial continuity loss  con , via the weight factors () and () to calculate the total network loss , Equation (3).Kim et.al. set  and  as 1 and 0.5, respectively, this is likely to ensure sufficient segmentation guidance while preventing the optimization space from being overly restrictive.

𝐿 = 𝐿 sim + 𝛼 𝐿 con
(2) sim utilizes a sparse categorical cross-entropy between the response map and the argmax of the same response map, for each pixel.The purpose of  sim is to encourage the model to learn more efficient filters to better cluster similar features in the image.For example, if class probabilities for a pixel are equal, this would produce a high loss.This high loss would have the effect of encouraging that pixel to predict a single class with a high probability, resulting in many distinct clusters.Thus, the similarity loss fulfills requirements R1 and R3.
Next,  con is a mean absolute error (MAE) applied to the argmax-ed response map in the vertical and horizontal directions.For example, the horizontal component of  con is calculated by computing the MAE of each pixel with its neighbor to the right.A similar process is used to calculate the vertical component of  con , which is summed with the horizontal component to yield  con .This process is invalid at the edges of the response map because the padding of the network's convolution layers creates unexpected edge effects.Thus, it is recommended that the boundary of the segmentation is discarded.Thus,  con penalizes neighboring pixels that are not in the same class, which promotes clustering of similar pixels and fulfills requirement R2.
Finally,  scr is designed to allow the user to guide the segmentation.However, in this work, the authors utilize a memory mechanism between iterations of segmentation and the outlier rejection process obtained from multiframe data.The scribble loss is also implemented as a sparse categorical cross-entropy and penalizes predictions that are not congruent with those pixels indicated by scribbles.This allows high-confidence predictions from the current iteration to be utilized in subsequent iterations, which aids in the "training" of unsupervised segmentation.
The original implementation used the stochastic gradient descent (SGD) optimizer, to minimize the combined  sim ,  con , and  scr .Instead, the authors substituted SGD with the Adam optimizer, which improved convergence that translates to decreased compute time (Kingma & Ba, 2014).

SSIM matrix and inlier detection
A problem with unsupervised segmentation is that while the various semantic classes are assigned different labels, which argmax index indicates the ROI cannot be directly resolved.For spalling damage, the ROI is likely located in the central region, which is the foreground, and clusters close to a border of the bounding box are the background.However, spalling could be a concave shape, or different damage types like crack damage have a small region and is not across the central region.Thus, although the authors process frames containing the ROI, the authors cannot determine which cluster is the segmented ROI.To address this issue, image patches in the preprocessing step are cropped at larger than the original bounding box size, detected by damage detectors.This ensures that the background will include more area or pixels and this improves segmentation performance and allows one to discern the ROI from the background by simply sorting the number of pixels (e.g., the background will correspond to the most numerous pixel value).
The magnification factor, the amount the bounding box is increased, is determined by a simple geometric relationship.Consider the rectangular bounding box that includes ROI (e.g., spalling defect) where x and y are its width and heights in pixels.A larger region, expressed as a magnification factor () of the previous bounding box is cropped.The idea is the area of the background is larger than the maximum area of the foreground, which means  is greater than ( 2 − 1).Thus, the minimum is √ 2. This study uses a round-up value, 1.5.
The quality of the semantic segmentations can vary as the unsupervised segmentation method is nondeterministic because the network parameters are randomly initialized and stochastically optimized (e.g., Adam optimizer) at inference.This is to say that each frame does not produce the identical segmentation results in the segmentation step.Also, as stated earlier, USP incorporates multiple frames for improving the quality of the segmentation.However, some frames could be degraded by environmental factors such as low lighting and motion blur.Thus, it is necessary to reject incorrect segmentations (outliers) and prevent them from influencing the final segmentation result.Due to the nature of unsupervised methods, there is no ground truth to be compared if the segmentation results are correct.Thus, the authors utilize similarity comparisons between the segmented frame on the assumption that inlier segmentation should be a majority, and the similarity of inlier segmentations is higher than the similarity among outliers (e.g., wrong segmentations are more different than correct segmentations).To identify outliers, each segmentation result is compared to one another using the structural similarity metric (SSIM), which returns a value between 0 (no similarity) and 1 (same image) (Wang et al., 2004).
The SSIM metric was originally developed to quantify image quality between two images, for example, the quality of compressed images (Wang et al., 2004).Recently, the SSIM metric has become popular for machine learning applications and is valued as a spatial and visual similarity evaluation (Nilsson & Akenine-Möller, 2020).In this work, the authors utilize the SSIM metric on each pair of segmentation masks, such that if the SSIM of a segmentation pair is above some threshold they are considered similar.Inliers are defined by similarity to the majority of other images.
The SSIM threshold is used to discern between inlier and outlier segmentations and was found empirically to work well.Initially, the SSIM threshold is set at 0.6, which is increased by 0.05 each complete iteration (i.e., segmentation and outlier rejection followed by stochastic consensus).A higher SSIM threshold encourages the segmentations to converge (i.e., more congruent) with each iteration.After identifying inlier segmentations and if the number of inliers is greater than outliers, then only those inlier masks are selected to be processed in stochastic consensus, Subsection 3.4.

Stochastic consensus
The purpose of the stochastic consensus process is to generate seed points, which will be used as input in the next iteration of unsupervised semantic segmentation.Seed points are a form of pseudo-labeling, where high-confidence predictions are used as supervision for enhancing the training process.As mentioned in Subsection 3.3.1,seed points are used to generate a scribble loss in Equation (3).
As the output segmentation masks in each iteration are used to create pseudo-labels, the authors expect that labels might be incorrect or need to be improved.To ensure there is sufficient supervision and minimize incorrect labeling, the authors sample a small percentage of high-confidence pixels (e.g., seed points) of the inlier segmentation masks, this is referred to as seed point coverage.If all inlier masks, at a pixel location, predict the same class, then that location will become a seed point for that class.These seed points are fed back into the "guided" unsupervised segmentation network by incorporating multi-frame consensus.
USP utilizes a counter that is incremented and used to keep track of the iterations when the seed points mask is generated.The counter serves as a contingent break condition when the iteration reaches the maximum or when no outlier frames are detected.Upon reaching such a break condition, stochastic consensus will return a single segmentation from the set of inlier segmentations.The set of inlier segmentations is treated as a voting ensemble, where the prediction at each pixel location is the majority prediction.

EXPERIMENTAL STUDY
Spalling is one of many defects that inspectors look for during routine bridge inspections.The dimension of the spalling is key parameters for determining the condition of a bridge component (MTO, 2008).In this experiment, the authors will test the performance of USP using spalling damage in an in-service bridge.The experiment was conducted at the Gardiner underpass located on Wickman Road, Toronto, Ontario, Canada, which is a pedestrian-accessible road tunnel running under the Gardiner Expressway.The underpass contains spalling damage on the sides of the tunnel that are accessible on the ground level, in Figure 6.In this test, Microsoft HoloLens 2 (HL2) was used to collect a sequence of images of two different spalling regions from different angles.The HL2's spatial mapping and tracking capabilities enabled the recording of the poses of the camera when each image was captured, as well as a 3D spatial mesh of the environment.The HL2 uses its own proprietary SLAM algorithm that has been demonstrated to produce sub-centimeter accuracy for both indoor and outdoor settings (Al-Sabbag et al., 2022).
From each set of images, a RF was chosen by the inspector in the field, using a custom HL2 interface, allowing manual selection of the bounding box that encompasses each spalling region.Note that this process could be automated if the bounding box detector is continuously applied for all frames.The corner of the selected bounding box are then anchored onto the spatial mesh, which will be referred to as 3D bounding box points.The 3D points of the bounding box are used to fit a plane and the distance from the fitted plane to the reference image is found.Then, the homography was computed using Equation (1).
Two spalling defects (D1 and D2) were found on the wall of the bridge tunnel and tested in this experiment.The authors captured eight images using the HL2 and their resolutions are 2272 by 1278 pixels.Figure 7a shows those images in temporal order (left to right and top to bottom) and the image with the red border was chosen as the RF.The bounding box of the spalling region on the RF is manually annotated in Figure 7b (marked in red).The box in blue is the region to be cropped, which is 1.5 times larger than the bounding box (see Subsection 3.3.2).
Once homography has been estimated between RF and the other frames, the images are transformed to the RF and cropped per the selected bounding box. Figure 7c shows the cropped images of D1 and D2 (D1 and D2 patches), and their resolutions are 695 by 375 pixels and 491 by 240 pixels, respectively.Finally, a Gaussian blur, kernel size of 5, is applied to those patches to further reduce the high-frequency noise, which aids in the convergence of the segmentation.The Gaussian kernel size is selected to decrease the  con by increasing the similarities between adjacent pixels, however, this may not be ideal for images where the features are not sufficiently distinct (i.e., very small, very dark images).
After preprocessing the images, the patches are inputted into the unsupervised segmentation and outlier rejection process.A workstation with a single Nvidia Titan V GPU is used to compute the segmentation.Due to hardware limitations, each frame was processed sequentially rather than in parallel.The authors have created a Kubernetes deployment with GPU-enabled Docker containers, along with multi-thread capable scripts to support mobile computer servers for future deployment.In this experiment, the authors found in the first iteration each frame takes between 10 and 15 s, and 3 and 5 s.For eight sequential frames, total processing times can range between 100 s and 150 s.Model parameters are kept consistent for both D1 and D2, summarized in Table 1.
Three iterations of unsupervised segmentation and stochastic consensus were performed for both D1 and D2.It was observed that, in most cases, the model will converge, without outlier frames, by the second iteration (first iteration with seed points).Figure 7d shows samples of inlier and outlier segmentations for each spalling defect; it can be simply observed that the first couple of segmented frames are inliers and thus are more similar than the following couple that are outliers.Each color in the segmentation represents a different class and is randomly generated for visual effect.Next, the inlier frames are used to generate seed point priors, shown in Figure 7e.The seed points generated from the same cluster are marked as the same color.Through the generated seed points, the segmentation starts to take shape; where higher point densities represent higher confidence and sparse regions indicate greater uncertainty.These seed points are then used by the unsupervised segmentation for improving the segmentation in the next cycle.
After the final cycle, the inliers are used to create the segmentation by ensembling the segmented frames.Figure 7f shows the ground truth and predicted spalling segmentations for D1 and D2 where the segmentations are displayed overlaid on each RF.
To evaluate the segmentation accuracy, the authors compare the ground truth segmentation (manual) and predicted segmentation using the metric mean intersection over union (mIoU).To evaluate the impact of the proposed USP, the result of an ablation study is included in Table 2.The ablation study considers the average accuracy of unsupervised segmentation of a single frame (Single frame), of inlier segmentations (Inliers), and an ensemble of inlier segmentations (by a majority vote).
The results show that unsupervised segmentation of single frames can generate erroneous or outlier segmentations, which decreases the average accuracy of segmentations.Using the proposed outlier rejection mechanism, outlier or incorrect segmentation can be eliminated that increases the average accuracy of segmentations.Using known poses priors to generate an ensemble of inlier segmentations also has a positive effect on the accuracy of segmentations.Finally, using high-confidence pixel predictions (seed points) to refine additional iterations of unsupervised segmentation provides the best overall results.
The authors conducted additional experiments from four image sequences that capture different damages (Figure A1a-d in the Appendix).The first and second columns (four images) in Figure A1 (in the Appendix) are the samples of raw images and their frames, which are manually selected to encapsulate the ROI (damage).The third and fourth columns are ground-truth and predicted segmentation overlaid on each RF, similar formats as Figure 7f.The bounding box can be utilized as a bound, such that predictions falling outside the prescribed bounding box are automatically removed.Table 3 shows the mIoU accuracy of each case.Overall, the USP can well separate the damage boundary from the background.In summary, USP improves segmentation accuracy by utilizing image poses and high-confidence pixel predictions to refine unsupervised segmentation of spalling damage.By utilizing a randomly initialized model and no additional images, the authors hoped to highlight the impact of the USP methodology.

SENSITIVITY STUDY
In this section, the authors conduct a sensitivity study of USP under environmental variations.Poor lighting and a low feature environment are a reality of real-world structural inspections, which can negatively affect the performance of sensors.For example, poor lighting can cause motion blur in video frames or a low feature environment (e.g., clean surface, repetitive patterns) can result in poor pose estimation.Incorporating multiple frames is one of major benefits in USP to be insensitive to those sources of errors.In this sensitivity study, the authors explore the effect motion blur and pose estimation error on the segmentation result from USP.Note that the conditions during the experiment in Section 4 were favorable, thus clear images can be captured and the defect and its locations can be localized with high accuracy.Thus, the authors simulate those two effects using the image augmentation library (imgaug library; Jung et al., 2020), which is widely used for image augmentation in machine learning training (Jung et al., 2020).

Motion blur
Motion blur is caused by rapid camera, object movement, or long exposure, characterized by streaking in the direction of movement.Poor lighting environment (e.g., underneath of a bridge) requires long exposure time to capture clear images, causing motion blur.Generally, motion blur will have a negative effect on segmentation performance because it reduces the sharpness of the ROI boundary and quality of the features in the image.
To simulate motion blur, the authors utilize a motion blur image corruption function in imgaug, which was proposed by Hendrycks and Dietterich (Hendrycks & Dietterich, 2019;Jung et al., 2020).Hendrycks and Dietterick simulate motion blur severity by applying a Gaussian blur of increasing kernel size in a random direction.It uses five severity levels, of which the authors utilize severity 1 and severity 2 corresponding to radius and sigma pairs of 10, 3 and 15, 5, respectively (Hendrycks & Dietterich, 2019).Sample images processed by imgaug library's motion blur are shown below in Figure 8.
To test the effect of motion blur on USP, synthetic motion blurred images are included in the same data set used in Section 4. The experiment is repeated from 10% to 90 % images affected by each motion blur level (severity 1 and 2).
Then, the ROI on images is segmented and compared with their respective ground-truth using the mIOU metric.A summary of the results is shown in Figure 9. Solid and dashed lines represent the motion severity 1 and 2, and grey and black color represent test data sets, D1 and D2.For example, D1 severity 1 indicates segmentation performance of damage 1 (D1) when a certain percentage of images (x-axis) are affected by motion blur with severity 1.
From Figure 9, a couple of trends can be observed.First, a broad decline in mIOU is observed as the percentage of images affected by the motion blur distortion climbs above 50%.This is an expected result because the greater number of images affected by motion blur, the more difficult it would be for stochastic consensus to find inlier segmentations and seed points to propagate.Next, between motion blur severity 1 and 2, the more severely affected images have on average 4.8% lower mIOU, although their difference is not significant.This is also an expected result and matches our intuition that the more severe the motion blur the greater distortion of the ROI and corruption of features in the image.

Pose estimation error
There can be many factors that affect the performance of pose estimation including but not limited to sensor performance, pose estimation from SLAM or SfM, and the external environment (e.g., background textures).In the application of USP, pose estimation errors will manifest as a perspective distortion error from inaccuracy homography estimation between the RF and the other images.
To simulate pose estimation error, an imgaug's perspective transform function was used to apply for random errors in perspective transformation.Sample images with different perspective transformation errors are shown in Figure 10. Figure 10b,c show that ROIs locations in the cropped frames slightly deviate from the original location in Figure 10a.The percentage indicates how far the perspective transformation's corner points may be distanced from the image's corner points, from which a random four-point transformation is applied (Jung et al., 2020).
Similar to the motion blur tests, images with perspective transformations errors are included in the data set and the experiments are repeated by increasing the percentage of those images.A summary of the results is shown in Figure 11 and all legends are the same as the ones in Figure 9, except for distortion types.
Like motion blur, a mild decrease in mIOU is observed as more images affected are included in the data set.And striation between the less distorted and more distorted images can be observed.These trends are expected because the severity of perspective transformation is correlated with greater change of the geometry and location of the ROI, and a higher percentage of affected images reduces the effectiveness of stochastic consensus.

Remarks
In both the motion blur and perspective distortion study, it can be observed that the USP is resilient up to greater than 50% corrupted images.This is likely caused by outlier rejection, which can remove incorrect segmentation frames.

F
Example of classification, detection, and segmentation of defect (spalling): The classification task (left) is to determine whether spalling damage exists or not.The detection task (middle) identifies the location of the spalling (bounding boxes) on the image.The segmentation (right) is to estimate the exact pixel boundary of the spalling (dark regions).

F
I G U R E 3 Sample images in each key process of USP, described in Figure2.F I G U R E 4 Image rectification process using the poses from camera  in (a) and camera  in (b), estimated relative to the map origin frame.

F
Unsupervised semantic segmentation model.just the single RF; the result of which is fed into the outlier rejection process.

F
Overview of the test bridge: The experiment was conducted on the opposite oversized gravel shoulder.

F
Outcomes of the USP process for D1 (left column) and D2 (right column): (a) Sequence of the images collected (RFs are marked with red), (b) Annotated bounding box (red) and cropped (blue) for a frame, (c) Pre-processed frames (RFs are marked with red), (d) Samples of inliner (first couple) and outliers (last couple) segmented frames, (e) Seed points from stochastic consensus, after one iteration of USP, and (f) Ground truth (left) and predicted (right) segmentation overlaid on RF.TA B L E 1 Hyperparameters for the segmentation network and stochastic consensus.

F
I G U R E 8 Sample images corrupted by synthetic motion blur: (a) original image (D1) with no motion blur, (b) image with severity 1 motion blur, and (c) image with severity 2 motion blur.

F
Segmentation sensitivity to motion blur.

F
I G U R E 1 0 Sample images corrupted by pose estimation errors: (a) original image (D1), (b) image with 5% perspective transform, (c) image with 10% perspective transform.F I G U R E 1 1 Segmentation sensitivity to pose estimation errors.
TA B L E 2 mIoU (percentage) in different scenarios for D1 and D2.