Multiattribute multitask transformer framework for vision‐based structural health monitoring

Using deep learning (DL) to recognize building and infrastructure damage via images is becoming popular in vision‐based structural health monitoring (SHM). However, many previous studies solely work on the existence of damage in the images and directly treat the problem as a single‐attribute classification or separately focus on finding the location or area of the damage as a localization or segmentation problem. Abundant information in the images from multiple sources and intertask relationships are not fully exploited. In this study, the vision‐based SHM problem is first reformulated into a multiattribute multitask setting, where each image contains multiple labels to describe its characteristics. Subsequently, a general multiattribute multitask detection framework, namely ϕ‐NeXt, is proposed, which introduces 10 benchmark tasks including classification, localization, and segmentation tasks. Accordingly, a large‐scale data set containing 37,000 pairs of multilabeled images is established. To pursue better performance in all tasks, a novel hierarchical framework, namely multiattribute multitask transformer (MAMT2) is proposed, which integrates multitask transfer learning mechanisms and adopts a transformer‐based network as the backbone. Finally, for benchmarking purposes, extensive experiments are conducted on all tasks and the performance of the proposed MAMT2 is compared with several classical DL models. The results demonstrate the superiority of the MAMT2 in all tasks, which reveals a great potential for practical applications and future studies in both structural engineering and computer vision.


Multiattribute multitask problem
In recent years, there is an increasing trend of using machine learning (ML) in structural health monitoring (SHM) (Lin et al., 2022;Rafiei & Adeli, 2017b, 2018;Soleimani-Babakamali et al., 2022).In particular, applying deep learning (DL) in vision/image-based SHM (Chun et al., 2022;Jang et al., 2019;Li et al., 2022;Pan & Yang, 2022;Zhang & Lin, 2022;Zhao et al., 2022) indicates a significant performance improvement over traditional computer vision (CV) methods, for example, edge detection based on extracted vision features (Cha et al., 2017).However, many studies (Cha et al., 2017;Dorafshan et al., 2018;Xu et al., 2018) are mainly concerned with the existence of structural damage in the images and directly treat the problem as a single-attribute classification where each image only has one label, for example, damaged or undamaged.In reality, vision patterns in structural images provide abundant information far more than the damage state, for example, the scale of the object, the type of structural component, and the severity of the damage.This is similar to a multiattribute recognition problem (Chen et al., 2012;Lampert et al., 2009), where such attributes can be very informative toward rapid postdisaster reconnaissance efforts and ultimately decision making.Moreover, these attributes may have hidden relationships with each other, that is, a certain hierarchy.For example, knowing that a structural reinforced concrete wall is heavily cracked or has significant spalling of the concrete cover of its reinforcement is positively correlated to the fact that this wall is indeed damaged.Therefore, it is important to rethink the framework of the vision-based SHM to form it as a multiattribute classification problem considering the interattribute relationships.Furthermore, based on domain expertise, prior knowledge on attributes of direct interest to the SHM can be prioritized, for example, compared to knowing the color of a building, whether it is cracked or not is of higher priority to SHM.For each of these specific attributes, it is aimed to recognize the most correct label to describe its characteristics among several choices, for example, the attribute of structural component type may have labels such as "beam," "column," or "wall."Therefore, each attribute can be treated as a single classification task and each image will go over multiple tasks.
Analogous to the CV domain, besides the image classification task, object detection (Liang, 2019;Pan & Yang, 2020;Yang et al., 2022) and segmentation (Sajedi & Liang, 2021;Zheng et al., 2022) are two important applications in vision-based SHM.Technically, these two tasks sometimes can be treated as two downstream tasks from the classification task, for example, using pretrained parameters from the classification task to fine-tune for localization and segmentation (He et al., 2017;Liu et al., 2021).There have been many studies of object detection and segmentation in vision-based SHM (Cha et al., 2018;Zhou et al., 2022).However, most of these studies considered either task independently.Similar to the multiattribute classification, the results of localization and segmentation tasks can be redefined as two special attributes, that is, the object location is represented by coordinates of the bounding box of the target object and the pixel-wise labels annotate the class of each pixel of the image.Finally, by adding these two attributes to the multiple classification attributes, the vision-based SHM can be comprehensively reformed and unified as a multiattribute multitask problem.It should be noted that vision-based SHM has a broad scope beyond the above-mentioned tasks.Other practical vision tasks such as structural response (e.g., displacement) monitoring and external load (e.g., vehicle load) estimation, defined as global-level tasks (Dong & Catbas, 2021), also have the potential to be unified into the multiattribute multitask setting, where their outputs can be treated as new attributes or as downstream tasks of the classification.However, in this study, only classical vision tasks (classification, localization, and segmentation) are covered and discussed in this paper.
Based on an extensive literature search, there are no related studies conducted comprehensively considering multiple structural attributes and different vision tasks simultaneously, and only a few multitask studies (Bull et al., 2022;Hoskere et al., 2020;Huang et al., 2019) are reported, which conducted early explorations on smallscale data sets.This can be mainly attributed to the shortage and difficulty in obtaining such kind of multiattribute data sets, especially acquiring multiple labels for each attribute is much more costly than single-attribute labeling, which significantly increases the labeling efforts of human annotators.In CV, besides common benchmark data sets for validation of new algorithms and models, for example, MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky & Hinton, 2009), and ImageNet (Deng et al., 2009), several multiattribute data sets were also developed and open-sourced, for example, AwA (Lampert et al., 2009), Clothing Attributes Dataset (Chen et al., 2012), and CelebA (Liu et al., 2015).Compared to these data sets, it is more challenging to construct multiattribute structural image data sets for the following reasons: (1) Structural images are recognized in a more abstract way.In CV benchmark data sets, the contents in the images are usually objects, for example, animals and ships, whose vision patterns are regular and easy to be understood.On the other hand, in structural images, the description of some attributes, for example, the damage severity of a column, is based on the subjective judgement of experts via their experience, which is difficult to be uniform from the vision patterns perspective.
(2) Labeling structural images requires specific domain knowledge, for example, proficiency in recognizing structural component and/or damage types, which increases the cost and difficulties to obtain a largescale data set.This point is similar to barriers facing building medical image data sets (Irvin et al., 2019).
Missing and wrong labels for multiattribute tasks are unavoidable in the annotation process.For example, some attributes in structural images may be ambiguous (e.g., certain damage patterns of a reinforced concrete wall may fall within the range of an intermediate damage severity, i.e., between minor to moderate damage) causing human annotators to make mistakes or even skip labeling.
For some large-scale data sets, due to time and resource limitations, only part of the data or their attributes are labeled while the remaining are kept unlabeled.These difficulties lead to missing/wrong labels for certain attributes, where conventional ML/DL methods are incapable of effectively handling them.
To the best of the authors' knowledge, in structural engineering, especially in vision-based SHM, there is no open-sourced multiattribute multitask benchmark image data sets aside from the published PEER Hub ImageNet (-Net) (Gao & Mosalam, 2020).Therefore, establishing such data sets is important not only to structural engineers to expand the achievements in previous studies (especially for multitasks; Bull et al., 2022;Hoskere et al., 2020;Huang et al., 2019) with large-scale data volume and highvariety tasks, but also to CV activities, in general, to explore the state-of-the-art techniques in practical engineering applications.

Multitask learning (MTL)and transformer
The multiattribute multitask problems can be better solved by adopting MTL techniques (Long et al., 2017;Ruder, 2017).By sharing representations between related tasks, MTL can improve the model with a better generalization to all tasks.The MTL is beneficial because it has an implicit data augmentation mechanism, which enables a better representation by averaging noise patterns.For certain tasks with noisy data or only limited high-dimensional data, as in the vision-based SHM problem, MTL can utilize additional information, for example, relevance or irrelevance from other tasks, to make the model's attention focus on key features.Furthermore, MTL can be treated as a regularizer, which reduces the risk of overfitting (Ruder, 2017).Until now, there have been various MTL methods developed for different scenarios.For example, Long et al. (2017) proposed a multilinear relationship network (MRN), which has multiple heads represented by multiple fully connected (FC) layers for different tasks.Through placing tensor priors on multiple task-specific FC layers, the model is able to learn transferable features and multilinear relationships between tasks jointly.Lu et al. (2017) redesigned a compact MTL architecture following a bottom-up approach.The architecture starts with a thin network and then it widens dynamically during the training, which results in a tree-like deep network, where similar tasks are grouped at the top task-specific layers.Misra et al. (2016) proposed the cross-stitch units, which shared representations between different tasks as linear combinations.Through training, these units can learn the optimal linear combinations and find the bestshared representations for a given set of tasks.Kendall et al. (2018) derived a principled loss function considering the homoscedastic uncertainty of each task, which improves the MTL performance on the scene understanding tasks with a unified architecture for semantic segmentation, instance segmentation, and per-pixel depth regression.In short, through developing new architectural designs, adaptive training methods, parameter sharing, adapted loss functions, and so forth, MTL achieves a variety of useful implementations.
The important questions in this study include how to accommodate the MTL to the proposed SHM problems and how to efficiently implement it to achieve state-of-theart performance.Considering the internal relationships between tasks, for example, hierarchy and causality, the transfer learning (TL) techniques (Gao & Mosalam, 2018;Pan & Yang, 2009) and the latest powerful transformer (Vaswani et al., 2017) network are innovative supplements to the original MTL methods.In vision-based SHM, abundant studies (Feng et al., 2019;Gao & Mosalam, 2018;Lin et al., 2022;Mosalam et al., 2019) have demonstrated that through TL, the DL models can inherit better parameters from pretraining on a large-scale information-rich source domain, which helps the models utilize the abundant features and domain relationships and also reduces the dependency on a large amount of data in the target domain (Gao & Mosalam, 2022).As mentioned above, a new architectural design is helpful in promoting the MTL performance, and transformer is such a new network structure.The transformer does not include any recurrence and convolution operations; instead, it is constructed by an encoder-decoder architectural type and heavily focuses on the attention mechanisms, for example, self-attention and multihead self-attention (MSA).Compared to the classical convolutional neural network (CNN), self-attention can access a larger receptive field, which can better capture the hidden relationship between inputs and also cross-tasks (Vaswani et al., 2017).The transformer was first developed in the natural language processing (NLP) field but later was shown to achieve equivalent or even better performance than CNN with multiple variations in certain scenarios, for example, Vision Transformer (ViT) (Dosovitskiy et al., 2020) and Swin Transformer (Liu et al., 2021).Therefore, a novel computational framework integrating MTL, TL, and transformers is a good candidate to solve and achieve state-of-the-art performance for multiattribute multitask vision-based SHM problems.

Objectives and contributions
Four major contributions of this paper are listed as follows: (1) The vision-based SHM can be reformulated into a multiattribute multitask problem to ubiquitously cover the vision tasks, for example, classification, localization, and segmentation.(2) A multiattribute multitask structural image detection framework along with a data set called "Next PEER Hub ImageNet" (-NeXt) is established, and a general multiattribute data set forming algorithm is proposed.(3) A novel hierarchical framework called "multiattribute multi-task transformer" (MAMT2) is proposed, which integrates the knowledge of MTL, TL, and transformers to achieve the state-of-the-art performance.(4) Extensive benchmark experiments are conducted and analyzed under different approaches and scenarios, which provide a reference for future relevant studies.Gao andMosalam (2018, 2020) developed a general structural image detection framework, namely -Net, including several basic tasks for the purpose of automated damage assessment, where each task represents one structural attribute.According to the logic of the hierarchy framework in -Net, the structural image is processed layer by layer following tree branches, namely Pixel, Object, and Structural.However, previous work on -Net focused on the multiclass classification problems among eight key attributes, namely (1) scene level, (2) damage state, (3) concrete cover spalling condition, (4) material type, (5) collapse mode, (6) component type, (7) damage level, and (8) damage type, and these attributes were treated as eight independent tasks, which does not fully explore and utilize the internal relationships between all attributes.

MULTIATTRIBUTE MULTITASK SHM PROBLEM STATEMENT
Based on the previous -Net studies, a new multiattribute multitask SHM problem is redefined herein.First, the same eight classification tasks are maintained as the fundamental ones.Subsequently, damage localization and segmentation tasks are complemented to them as downstream tasks.Finally, these 10 tasks are reorganized into a similar but simplified framework, named -NeXt, as shown in Figure 1.In this paper, damage localization and segmentation tasks for only the concrete cover spalling, as a representative example, are presented.Similar to the original -Net, the number of tasks of -NeXt can be expanded based on demands, and more typical tasks, for example, locating and segmenting damage patterns of reinforcing bar exposure, steel corrosion, and masonry crushing, are listed as future extensions.The definitions of the 10 tasks are summarized below.Based on -NeXt hierarchical relationships and intertask (attributes) dependency, a structural image may go through multiple tasks where the output (label) of each task is treated as a single structural attribute.Thus, a sequential set of attributes is obtained for the purpose of structural health assessment.For details about the task definitions, for example, additional physical interpretations, refer to Gao andMosalam (2019, 2020).

THE 𝝓-NEXT DATA SET
With the above-defined tasks in the -NeXt framework, a well-labeled data set is established, which is a multitask version of the original -Net data set.In some scenarios, an online image may have been described/noted as one form of labeling, by different experts with respect to different attributes, for example, expert A describes its scene level (task 1), and expert B finds its damage state (task 2).Therefore, an efficient label merging algorithm is developed herein to merge labels of the same image to a multiattribute multitask setting.Gao andMosalam (2019, 2020) open-sourced the -Net with eight independent data sets on https://apps.peer.berkeley.edu/phi-net/.Along with the image data, multiclass labels are provided for the first eight classification tasks defined in Section 2. As mentioned above, spalling damage is the focus of the damage localization and segmentation tasks herein.Therefore, only spalling-related images are selected from the -Net for these two tasks.Besides, newly collected spalling-related images are included.For tasks 9 and 10, the labeling tool "Labelme" (Wada, 2018) is used.Finally, the corresponding bounding boxes of the spalled regions and pixel-wise labeling are obtained.However, the collected and labeled data are still conducted in an independent manner, that is, given one image, the user does not know all attributes at the same time.In other words, one image can have multiple copies in different parts of the data set.Therefore, how to find the same image is a key goal of the developed algorithm.The Message-Digest algorithm 5 (MD5) check method (Rivest, 1992)    + 1 to  +  +  are  localization plus  segmentation tasks.In the computational loops, the MD5 checksum values (  ′ ) of all images from different sources/data sets (denoted by different ) are computed, and all images with the same MD5 value are grouped for merging their labels.

Data collection and labeling
For classification attributes ( ≤ ), if one attribute has more than two labels from different sources, a majority voting mechanism is used; otherwise, it is kept.As more complex and informative attributes, that is, the localization ( <  ≤  + ) and segmentation ( +  <  ≤  +  + ) attributes, the coordinates of the bounding boxes and pixel-wise labels are directly stored.If no label is found for a certain attribute, the label "NA" is assigned representing the missing information.Finally, labels for each attribute are merged and saved to a JavaScript Object Notation (JSON) file.It is noted that the proposed label merging algorithm not only works for existing data, but also can be easily used for newly collected data.
In this study,  = 8,  =  = 1, and all labels are further encoded.For the first eight classification attributes, the labels are encoded to integers from 1 to  corresponding to  predefined classes, for example, in three-class classification,  = 3 and  ∈ {1, 2, 3}.The "NA" label is encoded as "−1," which can be easily filtered out during training and not be taken into the loss computation.Besides, one spalling localization and one spalling segmentation tasks are covered.Finally, through processing the data from the -Net and the additional images, 37,000 images with multiple labels are obtained.

Benchmark data set split and statistics
For benchmarking purposes, splitting training and test sets is one essential step in the label merging algorithm.However, as mentioned in Gao and Mosalam (2020), adopting a fixed training-to-test splitting ratio is inappropriate for multiattribute problems to avoid label imbalance issues.Unlike a single-attribute task, a coarse and fixed split may disrupt the distributions of labels in the training and test sets and lead to a biased split in certain attributes, for example, all labels of collapse mode attribute (task 5) are in the training set, and images in the test set have no meaningful labels (i.e., NA) related to task 5. Therefore, following the recommendation from a previous study (Gao & Mosalam, 2020), herein a training-to-test split range is maintained within 8:1 to 9:1 across all attributes (tasks) and the label distributions are kept identical between training and test.Since most data are directly from the already built -Net, which has been assigned to either training or test set, the split ratio check is only conducted for the newly added images, which have been judged by the MD5 check, as presented in the pseudo-code in Algorithm 1.

MAMT2 FRAMEWORK
To comprehensively consider the multiattribute multitask problems, a unified framework across classification, localization, and segmentation is proposed, namely MAMT2, refer to Figure 2. The following two subsections describe the details of this framework.

TL-based MTL
The realization of the classification is different from localization and segmentation, that is, identifying discrete class labels of images is more straightforward than obtaining bounding boxes and pixel-wise labels of the target objects.
The localization and segmentation tasks usually require certain networks, for example, regional proposal network (RPN) (Lin et al., 2017), to locate the region of interest (RoI) first.Therefore, one of the novelties of the proposed MAMT2 is to design the framework in a two-step manner, which conducts classification and other tasks (localization and segmentation) asynchronously and hierarchically.As illustrated in Figure 2, first, the coupled multiattribute classification problems in -NeXt are solved using a shared backbone network followed by multihead branches, where each branch corresponds to one specific task (attribute)  ∈ {1, 2, … , 8}.Each branch is constructed by a multilayer perceptron (MLP) classifier and the number of output neurons is equal to the number of classes in each classification task   .Suppose there are  (1) pairs data {(  ,    )},  ∈ {1, 2, … ,  (1) }, where  is the ground truth label, that is,  = ,  ∈ {1, 2, … ,   }, and denote the classifier prediction as ŷ  .The joint cross-entropy loss,  (1) , is computed as follows: where () and ( ŷ) are the probability distributions of the ground truth and the predictions, respectively.Previous studies (Gao & Mosalam, 2018, 2020, 2022;Pan & Yang, 2009) have demonstrated that the shared features among the tasks contributed to complementing information and are useful in enhancing accuracy and reducing training costs.It is noted that the MAMT2 is a general framework, where the backbone can be any type of network, for example, CNN, recurrent neural network (RNN), or Transformer.In the CV domain, CNN serves as a standard network for the past decades, and most studies in the vision-based SHM area adopted CNN as their best choice.Recent studies (Dosovitskiy et al., 2020;Liu et al., 2021) show that transformer-based networks, for example, ViT and Swin transformer, start to achieve equivalent or even better performance than CNN in classification, localization, and segmentation tasks.Therefore, for the purpose of achieving a concrete benchmark performance, the Swin transformer, one of the state-of-the-art backbone networks for many CV applications (Liu et al., 2021;Xu et al., 2021), is selected in this study.
Based on the concept of TL, the shared backbone network used in classification can share key features and information for localization and segmentation.Therefore, the transformer backbone trained from the multitask classification is directly inherited by the localization and segmentation tasks.To fully benefit from the multiscale features extracted from each stage in the transformer, the backbone is further connected with a feature pyramid network (FPN) as illustrated in Figure 3.The 1 × 1 convolutional (Conv) operations are performed on feature maps from the last layer of each stage to project their depth to a uniform space  1 .Additionally, these extended features from stages 3 to 1 are fused with their last features (i.e., stages 4 to 2) along with up-sampling operations.Subsequently, another round of Conv operations with  2 3 × 3 filters are performed, and finally, a stack of multiscale feature maps (aka pyramid feature maps) is generated, that is,  1 to  4 .In this study, both  1 and  2 are taken as 256.Similar to the classical object detection framework, faster R-CNN, these pyramid feature maps are fed into an RPN (Ren et al., 2015) to generate a set of rectangular object proposals, which are parameterized to the same number of reference boxes, named anchor, on the raw input image.Following the same proposal selection procedure (Ren et al., 2015), multiple RoIs are determined on the feature maps for all scales, where each RoI is a rectangular window region of the extracted feature maps.To improve the efficiency of small target detection and relieve the misalignment issue in the feature maps, the pyramid RoIAlign operation (He et al., 2017) The classification loss   is a categorical cross-entropy loss.Suppose there are  (2) data for the localization and segmentation problem, the number of objects in the -th data is  , , and the total number of objects is   .Denote the ground truth class label as , the probability distribution of the prediction as ( ŷ),   is computed as follows: (3) To compute the regression box loss   , the bounding box coordinates of the prediction and ground truth are parameterized via anchor box coordinates, that is,  * and , respectively, (Equation ( 4)), where , , , ℎ represent the box center coordinates, width, and height, and superscript * represents the ground truth and the subscript  represents the anchor.Suppose there are  , anchors generated in the proposal of the -th data and the total number of anchors is   , the regression box loss is computed by smooth  1 loss (Equation ( 5)) between  * s and s for all anchor boxes (Equation ( 6)).More details can be found in Ren et al. (2015).(4) Suppose in the -th data, there are   RoIs extracted, and each RoI has a shape of  ×  ×  corresponding to  ×  binary-value masks of the  classes.Each pixel in the mask is either 1 or 0 representing the existence of the object or not.The mask loss   is computed by an average pixel-wise binary cross-entropy loss of class ŷ determined from the classification branch   as follows: where is the total number of pixels in the mask corresponding to class ŷ,   represents whether the -th pixel belongs to the class ŷ, and ( r ) is the probability distribution of the -th pixel prediction.

Swin transformer backbone
There are two major advantages of adopting the Swin transformer in this study: (1) its hierarchical mechanism and (2) its shifted window multihead self-attention (SW-MSA).These two advantages are discussed below.
F I G U R E 4 Connection between six blocks in stage 3.

Hierarchical mechanism
First, Swin transformer constructs hierarchical feature maps by starting from small-size patches and gradually merging neighboring patches in deeper layers.This mechanism, illustrated in Figure 3, reduces the feature map size while increasing each patch's receptive field (region of the input image that is mapped back from the current feature map), which helps the network facilitate long-term relations.Following the workflow in Figure 2, given an image input  ∈  ×  × 3 (: height, : width), and define an image patch having a pixel resolution of  × , the Swin transformer first partitions the input image into   ×   nonoverlapping patches.Each patch,   , is concatenated through the third channel, and the input is then transformed into a feature shape of (   ,   , 3 ×  × ).In this study, the input image size is rescaled to 224 × 224, and 4 × 4 patch size is adopted.Thus, the input is partitioned to 56 × 56 patches (224∕4 = 56), and then transformed to have a feature shape of (56,56,48), where 48 = 3 × 4 × 4.These features are processed through four similar stages where their differences are: (1) use of a different number of Swin transformer blocks and (2) linear embedding layer is only used in stage 1 and patch merging is placed instead for the remaining stages before the transformer blocks.In stage 1, a linear embedding layer is followed to project features into a predetermined dimension , that is, (   ,   , ).In this study, the projected dimension  is taken as 96 inherited from the Swin-T model (Liu et al., 2021), that is, features are further projected to a (56, 56, 96) feature space.
The features are subsequently fed into the consecutive Swin transformer blocks, and repeated for the remaining stages with patch merging operations.Each Swin transformer block has the same configuration with layer normalization (LN), shortcut connection, and two-layer MLP except for the type of self-attention, that is, window multihead self-attention (W-MSA) or SW-MSA, which are introduced in Section 4.2.2.From Liu et al. (2021), the Swin transformer block always repeats even times, for example, 2, 4, and 6.In this study, 2, 2, 6, and 2 repeat times are adopted for stages 1 to 4, respectively.The example of the connection between the six consecutive transformer blocks in stage 3 is illustrated in Figure 4, where W-MSA and SW-MSA are repeatedly used.
To realize a hierarchical representation, the input patches of stages 2 to 4 are merged through merging operations to shrink the feature size from the shallow to the deep layers of the network, that is, from the input to the output.The patch merging operation downsamples the input by a merging rate of  by grouping  ×  regions and concatenating the patches depth-wise to expand the depth by  × .One example of a 4 × 4 patch merging using a 2 × 2 merging rate is illustrated in Figure 5.It first separates the patches into four 2 × 2 groups with indexing, then merges the subparts with the same index into one group, and finally concatenates them in the depth channel.This operation shrinks the feature dimension by 2 and expands the depth by 4. In this study, a merging rate  = 2 is adopted.To avoid overly expanding the depth (×4 after each merging), a 1 × 1 convolution operation is performed depth-wise to further reduce the depth expansion by a factor of 2. Therefore, the feature shapes of stages 1 to 4 in the MAMT2 backbone are (56, 56, 96), (28, 28, 192), (14, 14, 384), and (7, 7, 768), respectively.As illustrated in Figure 3, through patch merging, the transformer blocks form a pyramid-like feature representation, which can utilize both fine and coarse features (from small and large patches) to learn the scale, to extract multiscale semantic and contextual information, and to ultimately improve the performance in image localization and segmentation.This is consistent with the multiattribute multitask objectives.

Shifted window multihead self-attention
In the field of vision-based SHM, for example, benchmark tasks defined in -NeXt, global information is as important as local information.For example, task 1 (scene level) can be regarded as depending on the global information that utilizes the scales of different objects in the image for the prediction; on the contrary, task 10 (damage segmentation) can be viewed as utilizing the local information from the damage pixels usually surrounding the damaged regions.Therefore, a general detection framework, such as MAMT2, should simultaneously consider both global and local information, and the W-MSA and SW-MSA adopting MSA computation in the transformer backbone along with the above-mentioned hierarchical mechanism satisfy this requirement.
To obtain global self-attention, MSA computes the patch relationships between one patch against all other patches, which results in a quadratic complexity with respect to the number of patches.The image resolution in the -NeXt data set is 224×224, and it is inefficient, and accordingly impractical, to adopt the standard MSA under such resolution.Therefore, window-based methods, that is, W-MSA and SW-MSA, are used in the present MSA computation (Figure 6).The standard W-MSA partitions the input data/features into multiple equal square windows, where each window with a size of  including  ×  region of the feature maps is obtained by the previous stage, and then self-attention is computed within the window, refer to Figure 6a.
Similar to the definitions in Section 4.2.1, Ĥ, Ŵ, and  are the height, width, and channel number of the input window region of the data/feature map X .Given an  ×  window size, the input is first flattened to   ∈ ℝ ×  , where  = Ĥ ⋅ Ŵ/  2 ,   =  2 ⋅ .Suppose there are ℎ heads adopted in MSA computation,   is further split to ℎ subsets along its channel dimension, that is,   = { 1  , … ,  ℎ  }.For the -th head,    ∈ ℝ ×  ∕ℎ ,  ∈ {1, … , ℎ}, is embedded to a high-dimensional space by a linear model with parameter   ∈ ℝ   ∕ℎ×  , and the projected window feature    is as follows: Denote three weight matrices   ∈ ℝ   ×  ,   ∈ ℝ   ×  ,   ∈ ℝ   ×  , where   =   =   =   ∕ℎ (refer to the classical setting in Vaswani et al., 2017), and the corresponding Query (Q), Key (K), and Value (V) of the feature    with the -th attention head are computed as follows: =      ,   ∈ ℝ ×  (10) The result of the -th head attention,   ∈ ℝ ×  , becomes, where Softmax function converts the input vector of numbers into a vector of probabilities, and  is the relative position bias (Liu et al., 2021).In the above, ℝ indicates the space of real numbers and superscript  indicates the transpose.Concatenating outputs for all heads, the complete attention score  is obtained, that is,  = concate{ 1 ,  2 , … ,  ℎ } ∈ ℝ ×  .Subsequently,  is projected back to the dimension of   by another linear model with parameter   ∈ ℝ   ×  , and the final attention output  is given as follows: Using W-MSA significantly reduces the computational complexity and cost (Liu et al., 2021).However, the pixels and patches are nonoverlapped between windows, which causes the loss of certain relations between windows.To address this issue, W-MSA and SW-MSA are alternately used in the consecutive transformer blocks as illustrated in Figure 4. Through shifting the window partition, new information from adjacent patches can be accessible in the attention computation, which strengthens the crosswindow connections, for example, in Figure 6b, the  ×  window shifts to the bottom right by ⌊∕2⌋ pixels, where ⌊•⌋ indicates the floor function of its argument •.After window shifting, the number of windows and window configuration are changed.Not to increase the computational burden, an efficient batch computation approach proposed in Liu et al. ( 2021) is adopted.Through cyclic shifting certain patches from the top left to the bottom right (e.g., patches A, B, and C) in Figure 6b, the rearranged image/feature maps have the same number and window configuration used in the previous W-MSA computation, and the attention can be computed in the same way as the W-MSA.To avoid mixing up the information from the moved patches with adjacent parts within the window, the masking mechanism is adopted as well, Figure 6b.
With a deeper network from stage 1 to stage 4, both W-MSA and SW-MSA are repeated multiple times.The deeper layers are not only having larger receptive fields, but also maintaining local information of the data during the forward propagation.This is found to be effective in performance enhancement for image classification, object detection, and semantic segmentation problems (Liu et al., 2021;Xu et al., 2021).

EXPERIMENT SETUP
In order to set the benchmark performance on the -NeXt data set and verify the effectiveness of the proposed MAMT2 framework, a series of experiments including classification, localization, and segmentation are conducted, and the effects of different approaches and models are compared and discussed.The implementation of these experiments is based on TensorFlow and PyTorch and performed on CyberpowerPC with single GPU (CPU: Intel Core i7-8700K@3.7GHz6 Core, RAM:32GB and GPU: Nvidia Geforce RTX 2080Ti).

Classification experiments
In the classification experiments, six comparison cases are designed and separated into three groups: baseline, hierarchical transfer learning (HTL), and MTL.For all cases, the evaluation metric is the overall accuracy, which is computed as the ratio of the number of correct predictions to the number of total tested cases.
The stochastic gradient descent with a piece-wise decayed learning rate is adopted where the learning rate is divided by 10 when test accuracy is trapped at a plateau.As a common method to enhance model performance, especially under small data sets, TL and data augmentation are applied.All baseline models are pretrained on Ima-geNet (Deng et al., 2009).For data augmentation, the training images in each batch are transformed with a random combination of the following six cases: (a) horizontal translation within 10% of total width, (b) vertical translation within 10% of total height, (c) rotation within 5 • , (d) zoom in less than 120% of the original size, (e) zoom out less than 80% of the original size, and (f) horizontal flip.

HTL
As shown in Figure 1, certain relationships between each task/attribute are defined based on domain knowledge.
To utilize this kind of hierarchy information, one relevant approach, named HTL, is adopted for comparison.In HTL, the knowledge between tasks is transferred through the designed hierarchical paths, which aims to improve the model performance in hard tasks (e.g., 7 and 8) via knowledge and information from easy tasks (e.g., 1 and 2).For benchmarking purposes, four transfer paths only including tasks 1, 2, 3, 7, and 8, are designed, Figure 7, and it is expected that the harder tasks 7 and 8 can benefit from the knowledge in the other easier ones.The intuition behind this is that task 1 (scene level) contains the most images compared to other tasks, and it can provide more information for subsequent tasks.Moreover, the damage state is the most important attribute in the vision-based SHM, and it has a direct relationship with spalling, damage level, and damage type.For simplicity, only the ResNet50 model is adopted in this experiment and the remaining settings are the same as the baseline.Along with the path, the model in a current task is inherited from the previous task, and then it is used as the pretrained model for the next task.Each task is repeated three times to select the best model.

MTL
To further investigate the advantage of using a transformer-like network in handling the interrelationships among tasks, the performance of MAMT2 is compared with a conventional MTL model using CNN as the backbone instead of transformer, which is denoted as C-MTL.Similar to MAMT2, for a batch of image samples in an iteration, the parameters of the CNN and the corresponding classifiers are trained simultaneously using labels from different tasks and are optimized by the joint cross-entropy loss, Equation (1).Moreover, similar to the HTL settings, ResNet50 is adopted for the C-MTL backbone and the ImageNet pretrained parameters are utilized before training, and the remaining settings are kept the same.The eight classifiers are FC layers that have the output dimensions according to the number of classes for the eight considered classification tasks.

Localization and segmentation experiments
In the localization and segmentation, two comparison cases are considered, namely Mask R-CNN and MAMT2.Because the -NeXt labels follow the style of MS COCO (Lin et al., 2014), as one of the most powerful and prevalent detection networks, the Mask R-CNN (He et al., 2017) is selected to represent the baseline performance in this study.In addition, since a regular image size of 224 × 224 is used in this experiment, the RPN in the MAMT2 is inherited from a pretrained Mask R-CNN model based on the MS COCO data set (Lin et al., 2014).
Based on the definitions in MS COCO data set and previous studies (He et al., 2017;Lin et al., 2014;Liu et al., 2021), the standard evaluation metrics for both localization and segmentation are average precision () and average recall () with variant intersection over union (IoU) thresholds, that is, s.The IoU is defined as the ratio of the intersection area and the union area of the ground truth and the predicted bounding boxes in localization or masks in segmentation.For one data sample, define true positive () as the number of bounding boxes/masks with IoU larger than , false positive () as the number of boxes/masks equal to or less than , and the number of missed detection (i.e., there is a box/mask in ground truth but the model does not detect it) as false negative ().Both precision and recall are computed as follows: The conventional way to compute  is through integrating the area under the precision-recall curve when the IoU threshold  = 0.5, which is denoted as  50 .Besides, for the benchmarking purpose, following the work in He et al. (2017) and Liu et al. (2021), standard COCO metrics including a set of  variations, that is,  75 ,   ,   , and   , are adopted. 75 is similar to  50 , which is computed using  = 0.75.  ,   , and   are used for evaluating the model's detection ability in small (ground truth bounding box resolution < 32 × 32), medium (32 × 32 ≤ ground truth bounding box resolution ≤ 96 × 96), and large (ground truth bounding box resolution > 96 × 96) objects, respectively.While computing these three metrics, the  values are computed 10 times using a set of IoUs from  = 0.5 to 0.95 with an increment of 0.05.In addition, a set of  variations is computed, that is,  50 ,  75 ,   ,   , and   . 50 and  75 are computed under  = 0.5 and 0.75, respectively.Similar to   ,   , and   ,   ,   , and   are computed as the average values of recall for  = 0.5 to 0.95 with an increment of 0.05 by only considering the data with the above-mentioned three scales ground truth bounding boxes.
Besides the standard evaluation metrics, mean IoU (mIoU) is also computed and compared for both localization and segmentation tasks.This is also widely used in previous relevant SHM studies (Sajedi & Liang, 2021;Zhao et al., 2022).Herein, mIoU is calculated by averaging IoU for all classes.

6
EXPERIMENTAL RESULTS

Classification
The test accuracy of the considered six models, discussed above, is shown in Table 1, where the best results are highlighted in bold.Benefiting from TL and data augmentation, the baseline models show a relatively good performance from an accuracy range of 70%s to 90%s.ResNet achieves a leading position in most tasks while VGG19 obtains more competitive performance in tasks 6 and 8.The training accuracy of all baseline models was almost 100%, which indicates that the current level of model complexities is able to handle these tasks, and slight model modifications or elaborated training techniques may help further enhance the accuracy.
In HTL, domain knowledge is added into the path, and the results for all paths are presented in Table 2. Compared to the ResNet baseline, in general, task 8 can gain nearly 1% to 2% enhancement in all cases, but there exists an accuracy down-streaming in tasks 3 and 7 in paths B and C. Comparing the four paths, for the purpose of improving hard tasks, path A is thought to be the best leading to improvement over the baseline in all tasks, that is, 0.3% in task 3, 1.4% in task 7, and 2.8% in task 8.However, its performance in task 8 is still lower than VGG19, which is mainly attributed to the ResNet50 baseline.In short, HTL using shareable task-dependent characteristics can obtain some degrees of improvement, but still have the risk of rendering slightly worse results in some cases, and thus its performance is highly dependent on both the predefined path and the source domain model.Therefore, with limited choices of path and model, the HTL can not fully capture and exploit the intertask relationship.In addition, HTL requires fine-tuning the model multiple times, which significantly increases the computational cost, and the performance-to-cost ratio is not economical.MTL usually learns to share the representations among multiple tasks (Argyriou et al., 2007), which can enable the model to generalize well for these tasks.From Table 1, it is obvious that the MTL models achieve better performance than the baseline and HTL models.Especially However, they lose the local information of data during forward propagation.On the contrary, through MSA and the hierarchical mechanism, the MAMT2 extracts the features with a large respective field while maintaining global attention from the shallow layer to the deeper layers.These characteristics efficiently capture the intertask relationship and utilize abundant features and information of structural images from different source domains.
To further explore the classification principles of the transformer backbone in MAMT2, saliency maps (reflecting the most active pixel regions related to the model prediction) of two sets of examples (object-level and structural-level images) in the test data set are presented in Figure 8.The saliency maps herein are generated from the last feature maps in the shared backbone by the Gradientbased Class Active Maps (Grad-CAM) method following the interpretation procedure proposed in Gao and Mosalam (2022).Since all eight classifiers share the same transformer backbone, it is observed that the saliency maps share a similar but rough shape (e.g., damage region) and focus (e.g., damage location) among different tasks/attributes, which reflect the sharable feature learned by the MTL mechanism.In addition, for different classification tasks, the classifiers in each branch adjust their weights accordingly to fit the prediction, which leads to changes in the saliency maps.
Take the object-level images in Figure 8a as examples, the saliency maps of tasks 1 (scene level) and 6 (component type) present a large and broad activated region, which is an evidence of utilizing the global information in images, and it is consistent with the human way of thinking.In other words, compared to CNN, the global attention mechanism in the transformer backbone strengthens the model's ability in global information utilization.One interesting finding in tasks 2 (damage state) and 3 (spalling condition) is that the MAMT2 can somewhat understand the semantic meanings of spalling.The objectives of tasks 2 and 3 are to judge the occurrence of the damage and spalling, respectively, where the former is more general than the latter.In samples 2 and 4, the activated region for task 2 covers both cracks and spalling, while the regions in task 3 are more limited to spalled areas.It is a good indicator that the backbone model can distinguish the difference between general damage and spalling, and also understand what and where is the spalling.Such characteristics of features are helpful in localizing and segmenting the spalled area, which are tasks 9 and 10.Based on the -NeXt (Figure 1) and path A (Figure 7), tasks 7 (damage level) and 8 (damage type) can be treated as downstream tasks of task 2, where the saliency maps show identical patterns.
Similar observations are achieved in structural-level images in Figure 8b.The saliency maps of tasks 1, 2, and 5 (collapse mode) are roughly identical, but the latter two are more precise for the semantic meanings of damage and collapse.Structural-level images usually have more content and complex patterns, requiring a comprehensive judgment of global information.Herein, the saliency maps coverage is relatively complete and large enough to cover the regions of collapsed buildings and debris.Therefore, these results demonstrate the rationality and effectiveness of the transformer-based backbone in conducting multiclassification tasks simultaneously.
In summary, from the perspective of classification tasks in vision-based SHM, the MAMT2 has the following merits: (1) It achieves state-of-the-art performance in most tasks, (2) it exploits the interrelationships and global information in the data, and (3) it significantly enhances the training efficiency compared to other approaches.From practical considerations, test accuracy for tasks 1 and 4 is nearly promising for real practical applications, and the remaining tasks, that is, hard ones, are still acceptable and can be improved with more data.
F I G U R E 9 Localization and segmentation results.

Localization
The localization results are compared by the rectangular region (represented by the coordinates) for the ground truth and the predicted bounding boxes.These are listed in Table 3 (the best results are highlighted in bold) and the results of 10 samples are shown in Figure 9.
In general, the performance of the proposed MAMT2 is better than Mask R-CNN (MR-CNN in Table 3 for short) under all evaluation metrics.With respect to , MAMT2 achieves nearly 80%  50 score and has 6.5% enhancement over MR-CNN under a single IoU threshold  = 0.5, which is very promising compared to results reported in several benchmark experiments (He et al., 2017;Liu et al., 2021).When  increases to a more strict value of 0.75, the Mask R-CNN performance significantly decreases and the MAMT2s leading advantage increases from 6.4% to 8.6%, which indicates a more robust performance.As for the evaluation of objects in three different scales, it is found that both models are encountering difficulty in identifying small objects (< 32 × 32) where MAMT2 is only leading by 2.2 points.With the increase in the object size, the MAMT2 obtains a significant improvement.Especially for the large objects (> 96 × 96),   of the MAMT2 outperforms the Mask R-CNN by 18.5%.As complementary to the  score, the  score measures the assertiveness of the object detectors for a given class, that is,  scores presented in Table 3 are computed for "spalling."Both models are insensitive to the IoU threshold, where  50 and  75 scores are the same, but the MAMT2 outperforms the Mask R-CNN by 5.6%.Similar to the trend observed in , MAMT2 achieves greater improvement than the Mask R-CNN for different scales, especially the obtained 15.2% increase in detecting large objects.As for mIoU, MAMT2 outperforms Mask R-CNN by 7.6%, which further demonstrates the superiority of the proposed MAMT2 from another perspective.
, , and mIoU scores only reflect a general condition of both models.Several examples are illustrated in Figure 9 to complement the discussion where samples 1 and 2 represent the most common spalling patterns and both Mask R-CNN and MAMT2 achieve relatively satisfactory performance.Consistent with the  and  conclusions, in samples 4, 8, 9, and 10, Mask R-CNN missed certain spalling regions in different scales while MAMT2 accurately captured them all.Furthermore, toward some large spalling patterns, for example, samples 3 and 6, Mask R-CNN produced duplicated boxes; on the contrary, MAMT2 output a single precise box, which demonstrates the robustness of the MAMT2 with respect to the detection of large objects.
In conclusion, due to MAMT2's strong multiscale ability, it can achieve better performance on objects of different sizes than classical models like Mask R-CNN.It improves the model's detection ability toward small and medium objects moderately and boosts the performance on large objects significantly.Therefore, the MAMT2 model increases the confidence of the researchers and engineers to accurately detect spalling areas larger than 96 × 96 pixel resolutions for practical usage.

Segmentation
The segmentation results are listed in Table 3, which are computed by comparing the region pixel-wise between ground truth and predicted mask.The same 10 samples used in the localization task are considered for this segmentation task as shown in Figure 9. Similar to the observations in the localization task, the MAMT2 presents better performance than Mask R-CNN under all evaluation metrics and both models can achieve more accurate results towards larger objects, that is, > 96 × 96.It is noted that the MAMT2 obtains an 81.3%  score under IoU  = 0.5, which outperforms the Mask R-CNN by 11%, and it also leads Mask R-CNN by 13.8% in mIoU, indicating a significant improvement from the current mainstream methods.Furthermore, the performance differences between MAMT2 and Mask R-CNN in detecting small and medium objects increase, where   and   are improved by 5.3% and 8.3%, respectively, and   and   are increased by 4.6% and 6.2%, respectively.This is reflected in samples 4 and 10 in Figure 9, where Mask R-CNN made a wrong prediction in 4 and it missed the detection of the middle and right-hand spalled areas.
From samples 1, 2, 3, and 6, both models cover the spalled areas, but the masks generated by the MAMT2 are more precise, especially near the edges of the spalling areas, and the mask of Mask R-CNN is incomplete in sample 3. Besides, influenced by the localization performance, Mask R-CNN sometime does not generate meaningful masks as shown in samples 4 and 5. Similar to the results in localization, in samples 4, 7, 8, 9, and 10, Mask R-CNN missed a few spalled regions while MAMT2 recognized all of them precisely, except missing a small spalling area in sample 7. Therefore, the results demonstrate the promising and robust performance of the MAMT2 in spalling segmentation, especially for medium-size and large-size spalling areas.
It is noted that both MAMT2 and Mask R-CNN have multiple branches, which can output localization and segmentation simultaneously.In this study, the inferencing/predicting time per image for the two models are 0.094 s (Mask R-CNN) and 0.055 s (MAMT2), which demonstrates a higher computational efficiency of the proposed MAMT2.

CONCLUSIONS AND EXTENSIONS
In this paper, several typical vision-based SHM problems are first reformulated into a multiattribute multitask setting to describe the characteristics of the collected structural image data in a more comprehensive manner.
Extended from the -Net, a more general structural image detection framework, namely -NeXt, is proposed, which introduces 10 benchmark tasks covering the classification, localization, and segmentation tasks.Accordingly, a multilabeled data set along with the data set forming and label merging algorithm is established, which contains 37,000 images with multiple attributes.
To better address the multiattribute multitask problems, a novel hierarchical framework, namely MAMT2, is proposed.Swin transformer is adopted as the backbone, whose hierarchical mechanism and SW-MSA are thought to improve the model recognition ability.Benefiting from the MTL mechanism, MAMT2 can utilize rich information and intertask relationships from different sources and attributes.The joint training mechanism with a shared backbone network improves the recognition accuracy and preserves computational efficiency.In addition, through TL, the trained backbone for the classification tasks is inherited directly as the pretrained model for the localization and segmentation tasks.
Finally, the proposed MAMT2 is validated on 10 benchmark tasks in -NeXt.In the classification tasks, the MAMT2 is compared with baseline models (VGG16, VGG19, and ResNet50) trained separately on single tasks, HTL models with four hierarchical transfer paths designed based on domain knowledge, and one MTL model using CNN (C-MTL) as the backbone.The results show that the MAMT2 achieves the best performance in all tasks except task 6 (although results are still comparable with other models), and both MTL-based models (MAMT2 and C-MTL) outperform the conventional methods, especially in hard tasks (i.e., tasks 5 to 8).In the localization and segmentation tasks, the numerical ( and  scores) and visual results demonstrate a more precise and robust performance of the MAMT2 than the classical model, that is, Mask R-CNN.The extensive experimental results presented herein provide a benchmark reference for future relevant studies pursuing state-of-the-art performance in both structural engineering and general CV applications.
Beyond this study, there are a few extensions worth future investigation.For localization and segmentation tasks, only concrete cover spalling is studied, and locating and quantifying other types of damages (e.g., steel corrosion, fracture, and masonry crushing) need further experiments when corresponding data are available.Beyond the three classical vision tasks, other important tasks such as quantitative structural response monitoring and external load estimation have the potential to benefit from the MAMT2, for example, by treating them as new attributes or downstream tasks of the fundamental classification task utilizing the transferred abundant information.In addition, to pursue the state-of-the-art performance for practice, several novel algorithms and methods, for example, neural dynamic classification (Rafiei & Adeli, 2017a), dynamic ensemble learning (Alam et al., 2020), finite element machine for fast learning (Pereira et al., 2020), self-supervised learning (Rafiei et al., 2022), multimodal learning (Wang et al., 2022), can be implemented into MAMT2, especially for backbone training.

•
Task 1: Scene level.It is defined as a three-class classification: Pixel level, Object level, and Structural level, which represents distance toward the target from a close, midrange, and far range, respectively.• Task 2 Damage state.It is a binary classification: Damaged and Undamaged, which is straightforward and describes a general condition of structural or component's health.The damaged patterns include concrete cracking or spalling, reinforcing bar exposure, buckling or fracture, and masonry crushing.• Task 3: Spalling condition.It is a binary classification: Spalling (SP) and Non-spalling (NSP), where spalling means the loss of concrete cover material (covering the reinforcing steel) from a structural component surface.• Task 4: Material type.It is a binary case: Steel and Others, which intends to identify the construction material of the structure or component.For simplicity, all material types other than steel are grouped into others.• Task 5: Collapse mode.It recognizes the severity of damage that occurred in the structural-level image: Non-collapse, Partial collapse, and Global collapse.• Task 6: Component type.It identifies the type of structural components: Beam, Column, Wall, and Others.It is conducted if the image is recognized as object level.• Task 7: Damage level.It recognizes the severity of component damage from object-level images: Undamaged, Minor damage, Moderate damage, and Heavy damage.• Task 8: Damage type.It describes the type of damage that occurred in structural components from objectlevel images based on a complex, irregular, and even abstract semantic vision pattern: Undamaged, Flexural damage, Shear damage, and Combined damage.• Task 9: Damage localization.It localizes the damage patterns by bounding boxes.In this study, for brevity, only spalling areas are monitored.• Task 10: Damage segmentation.It quantifies the damage by finding the whole damaged area, where each pixel has its own label, and regions of pixels with the same label are grouped and segmented as one object (class).Similar to task 9, each pixel is labeled as either SP or NSP, and images are from task 3.

F
I G U R E 3 Hierarchical mechanism.
is performed.Inspired by the design of Mask R-CNN(He et al., 2017), two branches are added to further process the feature maps of each RoI, where one branch consists of multiple Conv layers to determine the mask for each object, and another branch is constructed by FC layers to obtain the bounding box coordinates of the object and the class label in each box.The total loss of this step  (2) consists of five parts, refer to Equation (2): (1) classification loss of RPN,  ()  ; (2) regression loss of RPN,  ()  ; (3) classification loss of the prediction box,   ; (4) regression loss of the bounding box,   ; and (5) average binary cross-entropy loss of the mask,   .Because the first two loss terms of RPN are the same as discussed in Ren et al. (2015), they are not discussed further in this paper.

F
I G U R E 6 Window-based multihead attention computation mechanism.
Hierarchical transfer learning path.
This research received funding support from: (i) California Department of Transportation (Caltarns) for the "Bridge Rapid Assessment Center for Extreme Events (BRACE2)" project, Task Order 001 of the PEER-Bridge Program agreement 65A0774 to the Pacific Earthquake Engineering Research (PEER) Center, (ii) Artificial Intelligence Institute for Food Systems (AIFS), https://aifs.ucdavis.edu/,and (iii) Taisei Chair of Civil Engineering at the University of California, Berkeley.
TA B L E 1

TA B L E 3
Localization and segmentation results (%).