Nodes2STRNet for structural dense displacement recognition by deformable mesh model and motion representation

Displacement is a critical indicator for mechanical systems and civil structures. Conventional vision‐based displacement recognition methods mainly focus on the sparse identification of limited measurement points, and the motion representation of an entire structure is very challenging. This study proposes a novel Nodes2STRNet for structural dense displacement recognition using a handful of structural control nodes based on a deformable structural three‐dimensional mesh model, which consists of control node estimation subnetwork (NodesEstimate) and pose parameter recognition subnetwork (Nodes2PoseNet). NodesEstimate calculates the dense optical flow field based on FlowNet 2.0 and generates structural control node coordinates. Nodes2PoseNet uses structural control node coordinates as input and regresses structural pose parameters by a multilayer perceptron. A self‐supervised learning strategy is designed with a mean square error loss and L2 regularization to train Nodes2PoseNet. The effectiveness and accuracy of dense displacement recognition and robustness to light condition variations are validated by seismic shaking table tests of a four‐story‐building model. Comparative studies with image‐segmentation‐based Structure‐PoseNet show that the proposed Nodes2STRNet can achieve higher accuracy and better robustness against light condition variations. In addition, NodesEstimate does not require retraining when faced with new scenarios, and Nodes2PoseNet has high self‐supervised training efficiency with only a few control nodes instead of fully supervised pixel‐level segmentation.

2][3][4][5][6][7][8][9] At the early stage, the methods required installing artificial targets on structures.Wahbeh et al. 10 installed light-emitting diode lights on bridges to measure structural displacement in low-light conditions, such as at night.However, the artificial targets are either labor-consuming or challenging to access in field applications.
Measurement methods based on the visual features of structures without artificial targets have been proposed, for example, feature point matching, 11,12 digital image correlation (DIC), 13 correlation filter, and motion amplification.The feature point matching algorithm includes scale-invariant feature transform (SIFT), 14,15 speeded-up robust features, 16 and Kanade-Lucas-Tomasi (KLT). 17The algorithms calculate the image feature points through the feature operator and match the feature points on different images.In addition to the manual design of the feature operator, Dong and Catbas 18 designed a deep-learning-based feature operator (Visual Geometry Group) and combined it with SIFT to identify the displacement of a two-span bridge model.The correlation filter method [19][20][21] used a self-training iterative algorithm to track the initial object.Zhao et al. 22 combined support correlation filters and KLT to improve the robustness of displacement measurement.However, all these feature point matching methods can only obtain the single point displacement of the structure.Helfrick et al. 23 and Baqersad et al. 24 applied the DIC algorithm to measure the vibration and rotation of different types of structures and obtain the displacement field.Almeida et al. 25 used the DIC algorithm based on a set of images measuring the planar deformation.In addition to the fixed camera, unmanned aerial vehicles (UAVs) are also widely used to obtain videos and identify structural displacement. 26,27The laser-embedded light detection and ranging technology is implemented into UAVs to scan three-dimensional (3D) point clouds to determine structural displacement. 28,29 the above vision-based methods, the identification procedure for displacement can be divided into two steps: identification of the pixel displacement on the image, and then conversion of the pixel displacement into real-world displacement using scaling factor or camera matrix transformation.Structure-PoseNet 30 proposed a structural displacement identification method based on a deformable structural 3D mesh model (DSMM), and the displacement is directly obtained from the coordinates in the mesh model.Generally, 3D dynamic displacement recognition of a structure includes the identification of 3D models and model poses.To identify the structural pose parameters, image features should be first refined from video frames.These features should contain the motion of structural components, exclude image background, structural texture, light illumination, and shadow, and be sensitive to subtle motion.Deep-learning-based computer vision techniques can be used to extract structural features.To realize the abovementioned goals, an appropriate two-dimensional (2D) representation of structural motion should be adopted.The semantic segmentation mask was selected as the structural motion representation in Structure-PoseNet. 30However, it is sensitive to the variation of light conditions, and the semantic segmentation mask lacks gradient variations, limiting its effectiveness in identifying dense displacement.Moreover, the training efficiency is limited because two new subnetworks of ParaNet and CompNet need to be retrained when the resolution of the input image changes.
In addition to semantic segmentation, dense optical flow 31,32 can extract structural motion features in the video.Dense optical flow recognizes the motion velocity field of pixels in the image while ignoring other unnecessary information.Optical flow is sensitive to small pixel movements, which is significant for motion identification.
After the representations of structural motion features are obtained, they are further converted into structural pose parameters.This process can be accomplished through deep-learning networks.
In this study, Nodes2StrNet is proposed for dense structural displacement identification, which consists of a control node estimation subnetwork (NodesEstimate) and a pose parameter recognition subnetwork (Nodes2PoseNet).The NodesEstimate subnetwork takes each video frame as the input and outputs the 2D position of control nodes, and the Nodes2PoseNet subnetwork takes the coordinates of control nodes as the input and outputs the structural pose parameters.
Finally, the dense displacement of the structure is obtained based on the deformed structural 3D mesh model.The remainder of this paper is organized as follows.Section 2 introduces the proposed Nodes2STRNet.In Section 3, the proposed method is validated through shaking table tests on a four-storybuilding model.Conclusions are summarized in Section 4.

| METHODOLOGY
This study accomplishes dense structural displacement recognition based on the following ideas.First, the motion of structural control nodes in DSMM represents the motion of structure; therefore, a NodesEstimate subnetwork is established to generate 2D position coordinates of control nodes from the video frame as the input based on dense optical flow.Then, a Nodes2PoseNet subnetwork is established to model the mapping relationship between control node coordinates and structural pose parameters.Finally, the structural 3D mesh model is deformed according to the estimated structural pose parameters, and the dense displacement of the structure is obtained.
An overall schematic of the proposed Nodes2STRNet is shown in Figure 1.The workflow can be completed through three steps as follows: Step 1: The NodesEstimate subnetwork uses each frame in the original video of a structure as input, outputs the 2D dense optical flow field compared with the original frame, and calculates the 2D control node heatmap of each frame.
Step 2: The 2D control node coordinates are converted from the control node heatmap of each frame by calculating the centroid and inputted into the Nodes2PoseNet subnetwork to obtain the structural pose parameters.
Step 3: The structural pose parameters are utilized to generate the DSMM in each frame, and the dense displacement of the structure is finally refined from the coordinates of vertices in DSMM.Section 2 is organized following a logical order as described below.Before the methodology details, explanations of how the method is proposed and the overall framework are introduced.
Then, Section 2.1 shows how a DSMM is established for a structure as it forms the geometry foundation of the proposed method, and Section 2.2 introduces several key concepts and variable definitions for the input and output quantification, mainly including control nodes, 2D heatmaps, and coordinate transformation.Afterward, Sections 2.3 and 2.4 describe the network structure of the NodesEstimate and Nodes2PoseNet, respectively.Finally, Section 2.5 illustrates the efficient self-supervised training strategy of the proposed method.

| Deformable structural 3D mesh model
DSMM categorizes a structure into 3D elements, and each element is represented by a 3D deformable mesh model.The mesh model in an oscillation sequence is divided into the initial mesh model M 0 (as shown in Figure 2) and the time-variant mesh model M t with pose parameter P t (defined later).The basic elements of the initial mesh model M 0 include vertices V 0 .A set of adjacent vertices is connected with each other by edges E to form a face.All faces form the surface of the mesh model M 0 and can be expressed as an undirected graph G with vertices V 0 and edges E: At the frame t, M t shares the same undirected graph G with M 0 .
The coordinates of vertices V t shift with the oscillation of the structure.Therefore, M t is an undirected graph consisting of new vertices V t and the same edge E: inside each cross-section of the structural component forms a common section , which can also be regarded as an undirected graph: Identification workflow of proposed Nodes2STRNet.NodesEstimate, control node estimation subnetwork; Nodes2PoseNet, pose parameter recognition subnetwork.
where n a is the number of common sections and determined by the vertex interval in the perpendicular direction of cross-sections, where S i,0 is the initial state of the common section S i t , , and T is the conversion function with transition and rotation transformations, assuming that the common section S i t , is rigid.Note that P S t PoseNet introduced in Section 2.4 later.)Combining Equations ( 1)-( 4), M t can be determined after all common sections S i are obtained: Considering that G S i and E S i are time-invariant and V S ,0 i is known, , i and R S t , i .In summary, Figure 3 shows the schematics of structural pose parameters in DSMM.

| Definition of control nodes and 2D heatmaps
In Sections 2. where R c and T c represent the rotation matrix and translation matrix of the camera, respectively.
The 2D heatmaps in the initial frame Hm i 0 are generated around 2D control nodes following a normalized 2D Gaussian distribution: where i is the index of control nodes i N = 1, 2…, nodes , N nodes is the number of control nodes, N N σ ( , ) is the probability density function

| NodesEstimate subnetwork
The NodesEstimate subnetwork utilizes each video frame as input, predicts the dense optical flow between each frame and the initial frame by FlowNet 2.0, and then calculates the heatmaps of 2D control nodes for each frame, as shown in Figure 1 where I I , t 0 denote the tth and initial frames, Flow denotes the optical flow calculation process (FlowNet 2.0 32 in this study), Hm i 0 denotes the heatmap of the ith control node in the initial frame, and can be obtained from Equation (7) in Section 2.2.The NodesEstimate subnetwork only performs the feedforward interference process to calculate the control node heatmaps using the pretrained FlowNet 2.0.

| Nodes2PoseNet subnetwork
where H and W, respectively, denote the height and width of the  As shown in Figure 1, the Nodes2PoseNet subnetwork utilizes the heatmap centroid coordinates as input and predicts the structural pose parameters as output: The Nodes2PoseNet subnetwork utilizes a network architecture of multilayer perceptron (MLP) with four hidden layers (as shown in     necessary information and avoids uncontrollable noise compared to semantic segmentation masks of all pixels.

| Self-supervised training strategy of Nodes2Posenet
Figure 8 shows the overall schematic of the self-supervised training process for the Nodes2PoseNet subnetwork, including the following steps: (i) Structural pose parameters H R ˆ, ˆare randomly generated following a uniform distribution in a fixed range as the ground-truth values: where H a , H b , R a , R b denote the preset lower and upper bounds for structural pose parameters H R , , and "Uniform" denotes the uniform distribution function.
(ii) The randomly generated structural pose parameters H R ˆ, ˆare applied to the initial structural 3D model, and the 3D spatial coordinates of all control nodes can be obtained according to Equation ( 5).
(iii) The 3D spatial coordinates are converted into 2D camera coordinates by camera matrix transformation to generate the 2D coordinates of control nodes according to Equation ( 6).The mean-square error (MSE) loss function with L 2 regularization is adopted to calculate the regression error between ground-truth structural pose parameters and predicted values corresponding to all the control nodes: where L denotes the loss function to update the Nodes2PoseNet subnetwork, N nodes denotes the number of control nodes, λ 1 denotes the weight coefficient between H and R, L 2 denotes the regularization term to avoid overfitting, and λ 2 denotes the regularization coefficient.
(v) Return to the first step, repeat steps (i)-(iv), and iteratively update Nodes2PoseNet until the loss reduces below a preset value ε (set as 10 −5 in this study).

| Experimental setups of shaking table test and network training
To verify the identification accuracy of dense displacement and the robustness to different light conditions, a seismic shaking table test of a four-story-building model is conducted.As shown in Figure 4, four  1 shows the investigated earthquake ground motions with their intensities of the five videos during the shaking table tests, and the corresponding waveforms are shown in Figure 9.More details can be found in previous studies. 22,30 shown in Figure 5, a total of 20 control nodes are set on the junction points of the ground and the first floor and joints of columns and beams at each story.The 3D coordinates of these control points are converted into 2D image coordinates by the known camera matrix, as shown in Table 2.
Each video frame is an input to the NodesEstimate subnetwork, which predicts the dense optical flow between itself and the initial frame, and this predicted dense optical flow field is converted into RGB images for visualization through color wheel conversion.Some     Structure-PoseNet (mm) 30  | 243

| Results and discussion
This study proposes a novel Nodes2STRNet for structural dense displacement recognition based on DSMM and motion representation.
The main conclusions are summarized as follows: The Structural pose parameters P S t , i control the motion of S i t , , as shown in Figure 3.A few specific sections S i n , = 1, 2, …, C b i with equal intervals are selected as the control section, where n b is a hyperparameter for the number of control sections.The selection of n b is a trade-off between the prediction accuracy of dense displacement and the model parameter volume of Node2STRNet.Structural pose parameters P S t , i consist of the transition H S t , i and rotation angle R S t , i : S i are calculated by cubic spline interpolation of P S t , C i in control sections S C i .(P S t , 2 and 2.3, the NodesEstimate subnetwork is designed to calculate the 2D control node heatmaps from each video frame.In Section 2.4, control node heatmaps are fed into the proposed Nodes2PoseNet subnetwork to obtain the structural pose parameters.Therefore, structural control nodes are used as intermediate connections between two subnetworks of Nodes2STRNet.Heatmaps of structural control nodes represent structural motion and can be calculated by dense optical flow.Nodes2PoseNet uses FlowNet 2.0 32 to extract dense optical flow, representing the velocity field in a video frame.Compared to semantic segmentation masks, the dense optical flow contains sufficient information density because of the pixel-level velocity gradient.Control nodes N X Y Z ( , , ) c3 have a clear physical meaning of spatial location in real-world 3D coordinates, as shown in Figure 4. Control nodes are usually set on the joints of structural elements, for example, joints of columns and beams at each story of a frame structure.A higher density of control nodes can result in higher F I G U R E 2 Initial structural 3D model of a four-story building.F I G U R E 3 Schematics of structural pose parameters in DSMM with common and control sections.DSMM, deformable structural 3D mesh model.spatial resolution of the 3D mesh model, which improves the identification accuracy of dense structural displacement.To calculate the 2D coordinates of control nodes N x y ( , ) c2 in the initial frame, the 3D coordinates of control nodes in the initial 3D mesh model and the camera matrix are required:

F I G U R E 4
Figure 5.
. The dense optical flow represents the pixel-level displacement vector field between the original frame and the subsequent image.As shown in the left part of Figure 1, the optical flow of two adjacent frames can be obtained by FlowNet 2.0.The optical flow field can be converted into RGB visualization through the color wheel conversion, where the modulus and direction of the optical flow vector are denoted as saturation and hue, respectively, as shown in Figure 6.By applying the predicted dense optical flow field to the heatmaps of control nodes in the initial frame, the corresponding heatmaps of control nodes in the following frames can be obtained, as shown in the right part of Figure 1.The heatmap of ith control node in the tth frame Hm t After the 2D coordinates are obtained by the NodesEstimate subnetwork, the Nodes2PoseNet subnetwork is established to generate structural pose parameters from the 2D coordinates of control nodes.The heatmap centroid coordinate of the ith control node in the tth frame can be calculated using the equations: frame and are indexed by x and y, Hm t i denotes the heatmap of the ith control node in the tth frame and can be obtained using Equation (8) in Section 2.3, CH t i and CW t i denote the centroid coordinates N c t i 2, of ith control node in the tth frame in the height and width directions, respectively.F I G U R E 6 Color wheel conversion for RGB visualization of optical flow with arrow and modulus annotation.

F
I G U R E 7 MLP architecture of Nodes2PoseNet subnetwork.MLP, multilayer perceptron.

Figure 7 )
Figure 7) and a self-supervised training strategy without manual annotations (details about the self-supervised training strategy will be explained in Section 2.5).According to Equation (10), each video frame can generate a pair of input (2D centroid coordinates of control nodes nodes ) and output (with a dimension of structural pose parameters in all control sections).Ten neurons are equally included in each of the four hidden layers.Compared with the Structure-PoseNet architecture in the previous study, 30 the Nodes2PoseNet subnetwork can be directly transferred from the training data sets to the actual recognition scenarios.Artificially generated data in ParaNet of Structure-PoseNet cannot perfectly simulate some real-world scenarios because of slight differences in structural morphology and prediction variations between semantic segmentation masks from synthetical environments and actual videos.Therefore, prediction errors exist in realworld recognition using Structure-PoseNet from the training data.As a comparison, the Nodes2PoseNet subnetwork utilizes the centroid coordinates of a few control nodes as input, which only includes T A B L E 2 The 3D real-world and 2D image coordinates of control nodes in the initial structural model.

F
I G U R E 10 Representative recognition results of dense optical flow by NodesEstimate subnetwork.NodesEstimate, control node estimation subnetwork.F I G U R E 11 Training loss descending curve for Nodes2PoseNet subnetwork.MSE, mean-square error; Nodes2PoseNet, pose parameter recognition subnetwork.F I G U R E 12 Comparison results of recognized and LVDT displacement time-histories and frequency distributions in BM16.FFT, fast Fourier transform; LVDT, linear variable differential transformer; PGAs, peak ground accelerations.F I G U R E 13 Comparison results of recognized and LVDT displacement time-histories and frequency distributions in BM18.FFT, fast Fourier transform; LVDT, linear variable differential transformer; PGAs, peak ground accelerations.F I G U R E 14 Comparison results of recognized and LVDT displacement time-histories and frequency distributions in BM19.FFT, fast Fourier transform; LVDT, linear variable differential transformer; PGAs, peak ground accelerations.F I G U R E 15 Comparison results of recognized and LVDT displacement time-histories and frequency distributions in BM22.FFT, fast Fourier transform; LVDT, linear variable differential transformer; PGAs, peak ground accelerations.

(
iv) The flattened 2D coordinates of control nodes are utilized as input to the Nodes2PoseNet subnetwork according to Equation (10), output the predicted H and R, calculate the regression loss with the ground-truth values of H ˆand R ˆ, and update the Nodes2PoseNet subnetwork by the Adam algorithm.

16
Comparison results of recognized and LVDT displacement time-histories and frequency distributions in BM25.FFT, fast Fourier transform; LVDT, linear variable differential transformer; PGAs, peak ground accelerations.ZHAO ET AL. | 239 LVDTs (i.e., DX1, DX2, DX3, and DX4) were installed to measure the displacement in the vibration direction with a sampling frequency of 256 Hz.A fixed camera was set at a distance of 8 m from the building model with a video frame rate of 60 Hz and an original resolution of 2160 × 3840.Five videos, namely, BM16, BM18, BM19, BM22, and BM25, were recorded under different earthquake ground motions and light illuminations.Table

2 a,
representative recognition results of dense optical flow by the NodesEstimate subnetwork are shown in Figure 10.The number above each subfigure represents the frame number in the video, and the structure motion direction can be intuitively observed by different hues in the RGB map of the optical flow field.The length of the optical flow vector is normalized to the range between 0 and 1.The hyperparameters for training the Nodes2StrNet subnetwork are set as follows.The upper and lower bounds of the uniform distribution to randomly generated structural pose parameters H and R are H = −0.shakingtable.The Adam optimization is utilized with an initial learning rate of 0.0001, a batch size of 5, a total number of iterations inside an epoch of 100, and a training epoch of 50.A learning rate decay strategy is adopted, reducing by half in every 10 epochs.The loss descending curve during the training process of Nodes2PoseNet subnetwork is shown in Figure 11.

Figures 12 -
Figures 12-16 show the comparison results between the recognized displacement time-histories by the proposed Nodes2STRNet and measured displacement time-histories by LVDT at DX1-4 and the corresponding frequency transformation by the fast Fourier transform (FFT) algorithm.The results show that the recognized multistory displacements by the proposed Nodes2STRNet match well with those of LVDTs under different peak ground accelerations.The prediction error tends to be larger at the bottom of the structure, and one possible reason is that some parts at the bottom are missing in specific video frames, leading to the recognition inconsistency in actual videos using a well-trained model.

F I G U R E 17
Dense displacement recognition results by proposed Nodes2STRNet in BM16.(A) Schematic of dense displacement measurement points of MP.1-17 (unit: millimeter).(B) Recognized time histories of dense displacement.PGA, peak ground acceleration.

Figure 17
Figure 17 shows the dense displacement recognition results by the proposed Nodes2STRNet in BM16, in which MP.1-17 represents the dense measurement points along the height direction of the structure.More results of BM18, BM19, BM22,

Figure 20 compares
Figure 20 compares the effects of light condition variations on the model robustness of Structure-PoseNet and Nodes2STRNet.Apparent pixel-level noises exist in the semantic segmentation masks of Structure-PoseNet for BM25 compared to BM16.However, the corresponding control nodes are well identified by Nodes2STRNet with good robustness.The results validate that the proposed Nodes2StrNet is more robust to light condition variations among different video frames.Similarly, low video quality can cause significant recognition noise in semantic segmentation masks for Structure-PoseNet, affecting the recognition accuracy of dense structural displacements.As a comparison, the proposed Nodes2STRNet is flexible to the video resolution because the NodesEstimate subnetwork does not require a training process; however, Structure-PoseNet requires a fixed input resolution and should be retrained when faced with a new video with a different resolution.Table 3 compares the average root-mean-square error (RMSE) and pixel-wise root-mean-square error (RMSE-PX) for DX1-4 using Structure-PoseNet 30 and the proposed Nodes2STRNet.The results show that for BM16, BM19, BM22, and BM25, the proposed Nodes2StrNet shows significant improvements, further proposed Nodes2STRNet comprises two subnetworks of NodesEstimate and Nodes2PoseNet.The NodesEstimate subnetwork utilizes each video frame as input, generates the dense optical flow field based on FlowNet 2.0, and outputs structural control nodes.The Nodes2PoseNet uses structural control nodes as input and predicts the structural pose parameters using an MLP.Various DSMMs are generated according to structural pose parameters with motion representation for the entire structure, in which structural control nodes are utilized as intermediate connections between two subnetworks.The dense displacements of the structure can finally be obtained from DSMMs in different video frames.A self-supervised learning strategy is designed to train the Nodes2PoseNet subnetwork.An MSE loss with L 2 regularization is adopted to calculate the regression error between the ground-truth and predicted structural pose parameters of all control nodes.The recognition effectiveness and accuracy of dense displacement and robustness to light condition variations are validated by shaking table test of a four-story frame structure scale model.The results show that the average RMSE-PX of multistory displacement histories using the proposed Nodes2STRNet ranges from 0.16 to 0.69 pixels.Compared with image-segmentation-based Structure-PoseNet using all pixel information, the proposed Nodes2STRNet gains higher robustness against light condition variations with a more straightforward self-supervised training process using only a few control nodes.In addition, the proposed Nodes2STRNet obtains higher flexibility in the input video resolution because the NodesEstimate subnetwork applies a pretrained FlowNet 2.0 to generate the dense optical flow field without additional training faced with new scenarios.In addition, the effects of the P wave and S wave on the structural dense displacement recognition based on DSMM and motion representation by the proposed Nodes2STRNet will be further investigated in a future study.F I G U R E A2 Dense displacement recognition results by Nodes2STRNet in BM19.PGA, peak ground acceleration.ZHAO ET AL. | 247 F I G U R E A3 Dense displacement recognition results by Nodes2STRNet in BM22.PGA, peak ground acceleration.