Autonomous flight strategy of an unmanned aerial vehicle with multimodal information for autonomous inspection of overhead transmission facilities

This study proposes an innovative method for achieving autonomous flight to inspect overhead transmission facilities. The proposed method not only integrates multimodal information from novel sensors but also addresses three essential aspects to overcome the existing limitations in autonomous flights of an unmanned aerial vehicle (UAV). First, a novel deep neural network architecture titled the rotational bounding box with a multi‐level feature pyramid transformer is introduced for accurate object detection. Second, a safe autonomous method for the transmission tower approach is proposed by using multimodal information from an optical camera and 3D light detection and ranging. Third, a simple yet accurate control strategy is proposed for tracking transmission lines without necessitating gimbal control because it keeps the UAV's altitude in sync with that of the transmission lines. Systematic analyses conducted in both virtual and real‐world environments confirm the effectiveness of the proposed method. The proposed method not only enhances the performance of autonomous flight but also provides a safe operating platform for inspection personnel.

. A control method for vibration was proposed by considering external dynamic loads (Li & Adeli, 2018).A wavelet neural network-based nonlinear vibration control algorithm (Wang & Adeli, 2015) and a dynamic control method (Gutierrez Soto & Adeli, 2018) have also been proposed to secure the stability and durability of infrastructure.Studies have further been conducted to propose novel deep neural networks including estimation of concrete compressive strength (Rafiei et al., 2017) and construction cost estimation (Rafiei & Adeli, 2018).
To secure the overall safety and reliability of infrastructure, appropriate strategies for operation and maintenance (O&M) are also important because most infrastructure including transmission facilities, bridges, and railways has a lifespan of over 20 years even though infrastructure is exposed to severe environments that are not perfectly considered in the design phase.Note that appropriate O&M can ensure that such infrastructure maintains its functionality over a designed lifespan and continues to operate safely and efficiently because infrastructure suffers from various failures including cracks, corrosion, and deformations when it deteriorates (Y.Xu et al., 2018).
For the crack, a detection method was proposed through a Gabor filter (Salman et al., 2013) and effective image processing methods including denoising, sharpening, and edge detection (Y.Zhang, 2014).For the corrosion, a method was proposed to distinguish between corroded and uncorroded regions through a gray-level co-occurrence matrix (Medeiros et al., 2010) and wavelet transformation (Jahanshahi & Masri, 2013).For the deformation, a real-time road inspection method was proposed through a morphological refinement (Koch & Brilakis, 2011).A method for evaluating the performance of suspension structures (Zhou et al., 2022) and measuring deformations in 3D data-based structures was also proposed (H. S. Park et al., 2007;S. W. Park et al., 2015).A study has further been conducted to integrate inspection methods for vibration control (Javadinasab et al., 2021).
Diverse efforts have also included notable studies in the application of novel sensors because innovative sensors are being leveraged to bring new dimensions to infrastructure inspection, offering enhanced capabilities in detecting and diagnosing issues.Specifically, studies have been conducted on geometric structure and location analysis through 3D scanning (Esmorís et al., 2023).Studies have also included the estimation of inclination with a 3D scanner in transmission towers (Lu et al., 2022) and the development of deformation detection models for structures through evolutionary learning (Oh et al., 2017).Novel sensors including infrared and corona cameras have been employed in monitoring power infrastructure (Sriram & Sudhaker, 2021).Methods have been proposed for detecting overheating (Ha et al., 2011) and corona discharge (Davari et al., 2020) in power facilities.These proactive approaches with advanced sensor technologies underscore the commitment to maintain the integrity and efficiency of power infrastructure.
These efforts have improved the efficiency and accuracy of infrastructure inspections.However, these inspection protocols cannot solve the difficulty in access to large infrastructure especially located in complex terrains.This limited accessibility exposes inspectors to dangerous situations including electric shock and fall risks, especially in the transmission line corridor, which is ultrahigh voltage and high-altitude environments.These concerns could be solved by addressing robotic transportation systems.Especially unmanned aerial vehicles (UAVs) have gained attention because UAVs can easily access infrastructure at any location by deploying novel noncontact sensors (Pastor et al., 2007).Specifically, specialized UAVs have been proposed for a variety of applications, including a deflection estimation of bridges (Zhuge et al., 2022), railway tracks detection (Tong et al., 2023), inspection of large buildings (Mader et al., 2016), assessment of contaminant distribution and mobility (Martin et al., 2016), and monitoring defects in photovoltaic power plants (Libra et al., 2019).These trials mark the beginning of a new era in the O&M of infrastructure.However, these inspection methods were installed on UAVs that mostly deployed a manual control method, resulting in several challenges.Specifically, manual operations of UAVs highly depend on the skill and experience of the operator, which might result in inconsistent results.This manual inspection also requires continuous attention and control, which can be taxing and may lead to human errors, especially in complex or hazardous environments.Moreover, the manual operation limits the scalability of inspections, as each UAV requires a dedicated operator, reducing the efficiency and potential for widespread deployment.Hence, these issues suggest that innovative autonomous flight methods should be studied for effective inspection of infrastructure with UAVs.
Autonomous flight methods in UAVs are predominantly categorized based on the sensors addressed.Notably, considerable attention has been given to the exploration of a global positioning system (GPS), 3D light detection and ranging (LiDAR), and vision sensors because these sensors have proven to be effective in the development of autonomous flight methods.
GPS-guided autonomous flight in UAVs primarily revolves around using satellite signals to navigate and maintain the UAV's position.This technology is crucial for tasks requiring geo-location precision, such as surveying large areas or following predetermined flight paths.Studies in this field have focused on enhancing the accuracy and reliability of GPS navigation, especially in challenging environments where signal interference is common (Cui & Ge, 2003).An integration method for GPS reception data was proposed to enhance the positional accuracy of low-cost GPS receivers (Islam & Kim, 2014).A regression method was also proposed through a robust Gaussian process to enhance the accuracy of GPS (Lin et al., 2019).Despite the progress in research, challenges persist in dealing with the uncertainty of GPS signals and the limitations in measuring the coordinates of the infrastructure under inspection, which are essential for establishing the UAV's flight path.
3D LiDAR can generate a detailed 3D map of the environment, which is crucial in areas, where GPS signals are weak or unavailable.Especially, this sensor provides high-precision measurements of distances and thereby is effective for obstacle detection, navigation, and detailed terrain mapping.Studies in this area have focused on optimizing point cloud data (PCD) processing for real-time applications (Wurm et al., 2010).Signal processing methods by deploying a graphics processing unit (GPU) and a method for real-time LiDAR data analysis (J.Zhang & Singh, 2017) have also been proposed.Studies have further been conducted to develop a navigation system, including an autonomous ground vehicle navigation method (Pfrunder et al., 2017) and machine learning approach, for the autonomous navigation of UAVs (Tullu et al., 2021).However, the limited measurement range and environmental sensitivity of 3D LiDAR would be a significant constraint.These limitations can be particularly challenging when intensively exploring wide areas, thus raising the need to integrate 3D LiDAR with other sensors to provide a more comprehensive understanding of the environment.
Vision sensor-guided autonomous flight in UAVs leverages cameras and sophisticated image processing methods to enable UAVs to perceive and interact with their environment dynamically.Studies have been focused on enhancing computer vision capabilities to make UAV navigation more adaptive and reliable (Fujiyoshi et al., 2019).A modular and generic system was proposed through computer vision to improve the decision-making processes of UAVs (Alsalam et al., 2017).A flight method was proposed to assess bolt loosening of infrastructure (Pan et al., 2023).Additionally, in odometry perspective, studies have been conducted to estimate the position and orientation of UAVs, including robust visual-inertial odometry for fast autonomous flight (Sun et al., 2018) and Red, Green, Blue -Depth (RGB-D) camera-based visual positioning system (H.Zhang et al., 2021).However, limited 2D information is a significant drawback of this approach, limiting their capability to capture the comprehensive 3D aspects of the environment.This limitation significantly challenges the accurate discernment of spatial relationships and depth, aspects that are particularly critical in infrastructure inspection, which demands high safety standards.Additionally, the reliance on visual data often results in limitations under low visibility conditions, such as fog or darkness, hindering the UAV's operational efficiency.Hence, these drawbacks result in the need for multimodal information to enhance the robustness and accuracy of autonomous navigation in UAVs.
The necessity of using multimodal information arises from the inherent limitations of individual sensor modalities.These limitations can include issues such as occlusion, limited field of view (FOV), and susceptibility to adverse environmental conditions.To address these challenges and enhance the robustness of autonomous flight, extensive studies have been conducted in the field of integration of multiple sensor modalities.Specifically, a multi-sensor calibration method was proposed based on Gaussian processes estimated moving target trajectories (Peršić et al., 2021).An object distance estimation method was proposed through an accurate fusion approach based on a geometrical transformation and projection method (Kumar et al., 2020).An object-tracking method was proposed through adaptive multi-sensor fusion, considering specific properties and limitations of different sensor types (Lombaerts et al., 2022).However, the effectiveness of multimodal information utilization hinges on the accuracy and reliability of handling each individual modality.This necessity has spurred considerable interest in the application of deep neural networks because they excel in learning complex patterns and correlations from datasets, achieving near-human levels of performance across various domains.
In response to this necessity, researchers have increasingly turned to deep neural networks.Recently, extensive studies have been conducted in the field of deep neural networks (Martins et al., 2020).This achievement includes developing optimized models for precise classification (Rafiei & Adeli, 2017), finite element machine classifiers (Pereira et al., 2020), learning algorithms for ensemble design (Alam et al., 2020), self-supervised learning for electroencephalography (Rafiei et al., 2023), and models for classifying electroencephalography signals (Hassanpour et al., 2019).Studies have also been conducted to develop deep neural networks, including a tracking method of UAVs using several cameras (Unlu et al., 2019) and an obstacle detection method for autonomous flight (Dionisio-Ortega et al., 2018).These studies contribute to develop deep neural networks that are being widely used in autonomous flight of intelligent transportation systems.Particularly, object detection is paramount in autonomous flight because it directly impacts the UAV's capability to navigate and perform its tasks safely.Specifically, optical images have been widely used for object detection with a convolutional neural network (CNN) by employing irrotational bounding boxes to detect and track the objects of interest.However, this model in detecting the infrastructure presents several challenges because of the inherent limitations.First, an irrotational bounding box might result in an inclusion of significant background noise because the infrastructure has various aspect ratios, which might result in a decrease in the accuracy for object detection.Second, basic CNN models have limitations in handling global features.They often struggle with spatial hierarchical structures and cannot consistently perform at a high level across objects of varying scales.These limitations are particularly problematic in detecting complex facilities in infrastructure.
Despite intensive studies on numerous autonomous flight methods, autonomous flight in the overhead transmission line corridor remains predominantly reliant on GPS-guided technologies (J.-Y. Park et al., 2020) because of several critical factors.The overhead transmission line corridor extends over vast and varied terrains, demanding the navigational reliability and robustness on a level provided by GPS systems, which provide consistent coverage over extensive areas.This factor poses a significant challenge for the integration system of multimodal information from 3D LiDAR and vision sensors, especially in overhead transmission line corridor where safety is paramount.These systems require not only advanced hardware but also sophisticated algorithms for data processing, which must operate flawlessly in a wide range of environmental conditions.Moreover, there are practical considerations that currently favor GPS-guided systems, including the operational simplicity and the lower cost of implementation and maintenance.However, it is undeniable that the efficient field implementation of autonomous flight in overhead transmission line corridor depends on the successful integration of multimodal information because two critical limitations still exist in inspection methods using GPS-guided autonomous flight in actual field applications.First, the inspection time is significantly long because of two preliminary tasks of measuring the coordinates of the transmission towers and understanding the surrounding environment around the transmission lines to secure a flight path.Second, a gimbal control is required to properly align the inspection sensors with the transmission lines because the sag of transmission lines varies with environmental temperature.This configuration necessitates the addition of accessories including motors and a frame to the UAV system, which in turn increases the UAV's weight, making it challenging to ensure extended flight times.To overcome these limitations, this study proposes an autonomous flight strategy of the UAV with multimodal information for autonomous inspection of overhead transmission facilities.The novelty and major contributions of this study are as follows.
1.The proposed autonomous flight strategy uniquely employs multimodal information from novel sensors of 3D LiDAR and optical camera.The 3D LiDAR provides 3D geometric information around overhead transmission facilities, and the optical camera supplements the sparse detail of PCD from 3D LiDAR, resulting in a more comprehensive information for accurate autonomous navigation of UAVs in power transmission facilities.Note that this strategy eliminates the need for preparations, significantly reducing inspection time.2. A novel deep neural network, titled a rotational bounding box with multi-level feature pyramid (RoMP) Transformer, is proposed for object detection.This neural network focuses on detecting transmission towers and insulator strings, which have a high aspect ratio, in optical images.The implementation of this neural network enables the UAV to control its altitude and direction for approaching the transmission tower and defines the start and end points for tracking the transmission line.A comparative study underscores the superiority of the RoMP Transformer in object detection, compared with other neural networks.3. Effective environmental cognition and signal processing methods enable accurate extraction of the curved features of transmission lines and thereby achieve precise tracking of transmission lines.This precise flight strategy involves maintaining the UAV's altitude at the same level as the transmission lines, enabling effective inspection of the transmission facilities without gimbal control.Furthermore, flight without gimbal control allows for the lightweight configuration of the UAV, enabling the acquisition of extended flight times.4. Extensive experiments in virtual and field environments confirm that the proposed method successfully completes all missions.Quantitative analysis further demonstrates that the proposed method exhibited superior performance, compared to GPS accuracy.This feature is attributed to the outstanding performance of the novel deep neural network and the management strategy for PCD.
The remainder of this paper is organized as follows.In Section 2, the autonomous approach to transmission towers and autonomous tracking of transmission lines are explained.Section 3 describes virtual and field experiments conducted to validate the proposed method.In Section 4, the results of the virtual and field experiments are analyzed and discussed.Finally, Section 5 summarizes the conclusions and discusses possible future study directions.

METHOD
The proposed autonomous flight method aims to inspect overhead transmission facilities with minimal intervention by the inspectors in two phases (Figure 1).In the first phase, the UAV approaches the transmission tower of interest without GPS information (① in Figure 1) based on optical images from the optical camera.A novel neural network for object detection, the RoMP Transformer, is addressed to detect transmission towers around the UAV.An inspector (electrician) selects the transmission tower of interest for inspection.Subsequently, the UAV approaches the transmission tower by comparing the relative direction of the transmission tower detected by the neural network with the actual direction of the UAV head.The control strategy of this phase shifts to the next phase when the distance between the UAV and tower converges to a safe distance so as not to lose control of the UAV because the electromagnetic force emitted from the live-line transmission lines distorts the compass in an inertial measurement unit (IMU).The distance between the UAV and the tower of interest is estimated using PCD from the 3D LiDAR.Several signal-processing methods have been addressed to detect the tower of interest from the PCD.In the second phase, the UAV tracks transmission lines based on PCD measured from the 3D LiDAR (② in Figure 1).Specifically, a voxel map (VM) of the surrounding environment is generated by the UAV turning approximately 360 • in place to extract a pathway for the transmission lines.Next, the UAV tracks the transmission lines from one tower to the other by maintaining a constant safe distance from the transmission lines at the same height of transmission lines based on the extracted pathway of the transmission lines.The other tower is recognized from the insulator strings at the end of the transmission line through the neural network for object detection with optical images from the optical camera, and then the UAV moves to the next transmission line for inspection with measured PCD.The entire procedure suggests that the RoMP Transformer plays a critical role in the detection of transmission facilities based on the optical images, significantly contributing to both phases of the transmission tower approach and the transmission line tracking.Note that the optical camera (the blue sector in Figure 1) and the 3D LiDAR (the red sector in Figure 1) are not only used individually in different phases but are also used as multimodal information in parallel (the purple sector in Figure 1).This usage of multimodal information aims to create a more effective system by leveraging the unique advantages of each sensor because optical cameras excel in capturing high-resolution visual details and color information for object recognition, while 3D LiDAR provides accurate distance measurements but offers sparser data, compared to the optical camera.Note also that the proposed method has two manual selections for the convenience of inspectors including the designation of the transmission tower after the UAV takes off and inspection direction based on the tower when tracking lines.Manual inspectors do not come at the expense of the novelty of the proposed method because this process can be automated by implementing a specific inspection protocol that instructs the UAV to sequentially approach and inspect all detected transmission towers upon take-off.This automation can also include a thorough tracing process on both sides of each tower, facilitating a comprehensive inspection of all adjacent transmission line corridors by repeatedly taking off and landing.In other words, the proposed method could easily evolve toward full automation, enhancing efficiency and coverage in transmission tower inspections.The next subsection describes the RoMP Transformer in brief, and the detailed methods used in both phases are described in the following subsections.

RoMP Transformer
The RoMP Transformer is a novel deep neural network for detecting overhead transmission facilities with optical images, including transmission towers and insulator strings used for the autonomous flight of UAVs.The architecture of the RoMP Transformer features four key characteristics (Figure 2).First, a rotational bounding box minimizes distortion from a background image when detecting objects at different environments (① in Figure 2).Object detection using irrotational bounding boxes ignores the position of an object in an image.Hence, the background image in the irrotational bounding boxes provides unnecessary features during training, resulting in low accuracy when a neural network detects objects of interest in different environments.In particular, transmission towers and insulator F I G U R E 2 Architecture of the rotational bounding box with multi-level feature pyramid (RoMP) Transformer for object detection.DIoU, distance intersection over union; mSKEWIoU, modified SKEW intersection over union; MSML, multi-scale and multi-level; MLP, multi-layer perceptron.LiDAR, light detection and ranging; RANSAC, random sample consensus; UAV, unmanned aerial vehicle.strings include a significant amount of background images because of high aspect ratios (D.Kim et al., 2021), suggesting that a neural network with an irrotational bounding box is not appropriate for an autonomous UAV to inspect power facilities.Notably, the rotational bounding box not only enables the neural network to detect objects with a high aspect ratio but also secures robustness to detect objects in different environments, implying that this method is effective for an autonomous UAV to inspect power facilities.Note also that the rotating bounding box includes parameters for the center point (  ,   ), size and shape (, ℎ), and angle  (① in Figure 2), suggesting that one additional parameter of angle  is simply added to an irrotational bounding box.Therefore, the computational efficiency of the rotational bounding box is similar to that of the irrotational bounding box.
Second, a multi-scale and multi-level (MSML) feature pyramid network effectively constructs MSML feature maps to detect objects with various sizes and different levels of complexity (② in Figure 2).The feature extraction of the RoMP Transformer is executed by the MSML feature extraction module that addresses the architecture of a multi-level convolutional autoencoder.The detailed formula for extracting MSML features is noted as where    and   denote the feature with the ith scale in the lth convolutional autoencoder and the base feature, respectively. and   also denote the feature fusion process with shallow and deep features extracted by   and the lth convolutional operation process, respectively, and  denotes the level that indicates a number of convolutional autoencoder.Then, the MSML feature fusion module integrates several features extracted through convolutional autoencoder utilizing concatenation and elementwise computation-based 1 × 1 convolution (1 × 1 ).Concatenation plays a role in combining features along the channel axis, while 1 × 1  performs to reduce the expanded channel axis resulting from concatenation.These computational methods aim to integrate features of various sizes and complexities.This concatenated feature map can be presented as where   = ( 1  ,  2  ,  3  , …    ) ∈ ℝ   ×  × refers to the feature at the ith scale in the convolutional autoencoder.Then, concatenated features are condensed through an elementwise operation on the channel axis by executing 1 × 1 , resulting in the final MSML feature map.Note that the architecture of a multi-scale layer is effective for constructing a multi-scale feature map by extracting a variety of features from objects of different sizes.Moreover, the architecture of a multi-level layer would be effective because this layer concatenates shallow and deep feature maps to preserve semantic information.In other words, the architecture of the MSML feature pyramid network enables the RoMP Transformer to detect both large simple objects and small complex objects.
Third, the pyramid vision transformer (PVT) correlates the local pixel positions with each feature and embeds this information into MSML feature maps to enhance the performances of object detection (③ in Figure 2).Specifically, PVT combines various scales of vision transformer blocks with the input of MSML feature maps.The PVT includes three stages of patch embedding, position embedding, and transformer encoder.The patch embedding process refers to grouping a set of pixels into one unit and treating the same as one pixel in a 2D image.The patch embedding process in the PVT reshapes the image  ∈ ℝ ×× into a ℝ ×( 2 ) , where  and  denote the imput image resolution,  and  denote the number of channels and the size of the patch, and  = ∕ 2 denotes the total number of patches.The positional embedding allows operation with 2D position information in multi-layer perceptron.The positional embedding process in the PVT is calculated as where  and  denote the position of the patch and the entire dimension of flatten features.This process prevents the loss of the location information in 2D through a flatten process in the PVT.The transformer encoder performs attention operations by executing key  denoting the main pixel value, query  denoting the set of pixels providing information, and the semantic result value  for the key.Three values of key , query , and the semantic result value  are calculated by using parameters of   ,   , and   , where   ,   , and   denote the query matrice, the key matrice, and the value matrice, respectively.Attention in PVT using spatial reduction attention (SRA) is calculated as This SRA method significantly reduces the computational cost of the PVT, optimizing computational efficiency.Shifted window partitioning is also included for the PVT to mitigate the induced bias problem in vision transformer-based neural networks (Z.Liu et al., 2021) in PVT.The PVT utilizes a single-level neural network, that is, the value of  is specified as unity in Equation (1), whereas MSML feature extraction uses the value of two.Finally, the PVT fuses features extracted in a vision transformer layer by executing concatenation and 1 × 1convolution.The concatenated feature map is presented as  = [ 1 ,  2 ,  3 , …   ], where   = ( 1  ,  2  ,  3  , …    ) ∈ ℝ   ×  × refers to the feature at the  ℎ scale in the PVT.Hence, PVT strengthens the feature maps by fusing the relative positional information of each pixel and the MLMS features, thereby improving both the prediction accuracy and robustness in object detection.
Fourth, bounding box optimization effectively localizes the classified objects (④ in Figure 2).Bounding box localization should be optimized by considering the characteristics of the object of interest because training a neural network might result in an offset between the predicted bounding boxes and the ground truth.The proposed method improves the localization of detected objects by fusing two intersection-over-union (IoU) calculation methods: modified SKEW IoU () and distance IoU ().The fused  with  is calculated as where   denotes IoU loss with  and .The  is subtracted and  is added in Equation ( 4) at the training stage because  and  converge to unity and zero, respectively, when the predicted box coincides with the ground truth.The  method calculates the intersection area () using the Shoelace formula (Braden, 1986), which calculates multiple triangular  between prediction box and ground truth.The value of the  is calculated as where   and   denote  and  coordinates of the intersection point between the rotational bounding box.The union area () that refers to the whole area of the prediction box and ground truth is calculated as where   and   denote ground truth area and prediction area of instances.The value of the  between the prediction box and the ground truth box is presented as  = ∕.Note that the  significantly reduces the computational complexity of the SKEW IoU (Huang et al., 2018).Moreover, the  calculation method determines the level of the intersection area based on the distance between the center coordinate of the ground truth box and the predicted bounding box (  and   , ④ in Figure 2) to minimize the offset (Zheng et al., 2020).The  is calculated as where  max and  min denote the maximum coordinate values  max and  max and the minimum values  min and  min (④ in Figure 2).Those two IoU calculation methods reflect the distance between centers and correlation by calculating the IArea between two bounding boxes.Therefore, the proposed bounding box optimization method by fusing the  and the  improves object localization and classification.

Transmission tower approaching
The UAV approaches the transmission tower with information from the optical images first and the PCD second.This subsection describes an autonomous flight method for approaching a transmission tower of interest using optical images and PCD sequentially.
In the first phase ((1) in Figure 1), the UAV takes off to reach a sufficient height at an appropriate location, enabling the optical camera mounted on the UAV to survey the transmission towers around the UAV.The appropriate take-off location must be selected to ensure the UAV does not enter the electromagnetic field because the risk of the UAV entering the electromagnetic field increases when the angle between the UAV's path toward the transmission tower and the transmission lines decreases.The minimum angle for the UAV to avoid entering the electromagnetic field is 42˚for all transmission lines.It is also necessary to take off from a position about 130 m away to capture the targeted transmission tower from the take-off station considering the vertical FOV of the optical camera because transmission towers generally range in height from 30 to 100 m.Additionally, the take-off height can be set accordingly by using geographic and transmission tower information provided by Korea Electric Power Corporation (KEPCO).The take-off height is expressed as follows: where ℎ,   , , and  denote take-off height, vertical FOV of the optical camera, distance between the targeted transmission tower and take-off station, and margin based on geographic and transmission tower information, respectively.The RoMP Transformer detects the transmission towers around the UAV by turning approximately 360 • in place (Figure 3a).The relative direction of the transmission tower  with respect to the true north is calculated when the RoMP Transformer detects transmission towers as where , ,  ℎ , , and   denote the rotational speed of the UAV, the rotational time duration of the UAV from the start rotational time when the transmission tower is detected, the horizontal FOV of the optical camera, the width of the optical image, and the width coordinate value corresponding to the center position of the bounding box, respectively.The rotation speed  is predetermined by an inspector (electrician).The horizontal FOV  ℎ and width  are provided by the specification sheet of the optical camera.Hence, the rotation time  and the width coordinate value corresponding to the center position of the bounding box   are calculated for each frame of an image to compute the relative direction of the transmission towers  with respect to the true north when the transmission tower is detected.A predefined margin angle was set to cluster transmission towers within the same group.Transmission towers in the same cluster are recognized as one transmission tower to mitigate the concern of recognizing a single transmission tower as multiple transmission towers because errors may occur in real-time data communication.The relative direction  with respect to the true north that represents a cluster is set to the average value of the estimated  for the same cluster.
In the second phase ((2) in Figure 1), an autonomous flight path is generated as a straight line between the UAV and the transmission tower of interest when the inspector (electrician) selects the transmission tower of interest.The UAV then flies autonomously and detects transmission tower images using the RoMP Transformer.A detailed flowchart of approaching the transmission tower with optical images is shown in a in Figure 3b.Specifically, the UAV approaches the transmission tower by controlling the relative direction of the tower, recognized by the RoMP Transformer, to correspond to the center of the image frame.The UAV is controlled to minimize the error   (Figure 3a) between the tower detected from the RoMP Transformer and the center of the image frame in real time.Interestingly, the RoMP Transformer limits the detection of the transmission tower when the UAV approaches the transmission tower at a distance of approximately 100 m because the optical image fails to capture the complete shape of the transmission tower, resulting in the UAV losing its pathway.This phenomenon occurs because the RoMP Transformer trained only the fully shaped transmission tower in the images.In this case, the control strategy shifts to the next phase based on two criteria.First, the height of the bounding box of the previously detected transmission tower is used to calculate whether this value exceeds at least three-fourths of the height of the optical image.This criterion is effective because the transmission tower is fully contained in the optical image as the UAV approaches it.Second, the number of consecutive misses in detecting the transmission tower is calculated and compared to a predefined threshold of three to compensate for incorrect cognition of the RoMP Transformer.The control strategy shifts to the next phase when both criteria are fulfilled.In this process, images from the optical camera and PCD from the 3D LiDAR are concurrently processed in real-time for smooth and uninterrupted phase changes.
In the third phase, the relative location of the transmission tower with respect to the UAV is calculated with PCD measured by 3D LiDAR ( b in Figure 3b).Specifically, the 3D PCD representing the ground is removed by addressing a random sample consensus (RANSAC) first ( b -① in Figure 3b).RANSAC is a method that analyzes an entire dataset by repeatedly randomly selecting samples from a given dataset.This method is effective for analyzing the PCD because it is fast and effective in avoiding the distortion caused by an outlier included in the PCD (Fischler & Bolles, 1981).Specifically, this study addressed a planar model of  +  +  +  = 0 to detect the PCD representing the ground and then remove the PCD representing the planar model because the PCD satisfying the planar model represents the ground (Jeong et al., 2020).Note that the proposed method eliminates PCD representing the ground not only in flat terrains but also in mountainous regions because a plane model with a proper distance threshold to distinguish PCD representing the ground could detect both flat and complex terrains.This feature ensures that the proposed method encompasses a broader range of geographical characteristics.Second, Euclidean distance clustering (EDC; Yadav & Sharma, 2013) is used to remove noisy PCD ( b -② in Figure 3b).EDC is a clustering method in which the Euclidean distance between two points is calculated, and the two points are regarded as the same cluster when the distance is less than a specific distance.This problem is addressed in the proposed method because it is fast and simple for removing noisy PCD around the transmission tower.Then, coordinate transformation is executed to convert the 3D LiDAR coordinate system (  ,   , and   in Figure A1 in Appendix A) to the UAV coordinate system (  ,   , and   in Figure A1 in Appendix A), considering the posture of the UAV and the hardware configuration (Figure A1 in Appendix A).This process is executed at the end of the second step because the coordinate transformation with the PCD necessary for autonomous flight optimizes the memory consumption of an embedded computer.Third, the transmission tower is detected with the characteristic of PCD representing the transmission tower ( b -③ in Figure 3b).Specifically, the PCD of the transmission tower features a high vertical density, compared with that of the surrounding environment and a steep vertical increase, suggesting that these characteristics can be used to localize the transmission tower.The point with the highest number of neighboring points is selected by calculating the number of points within a specific distance  ℎ , which is a predefined threshold considering the dimensions of the transmission tower.Next, a region was created by generating a grid of square cells with a side length of 2 ℎ , centered around the selected point in the sky view, to select candidates for the PCD representing the transmission tower.The characteristic of steep ascent in a transmission tower is also used to avoid misrecognition because the PCD for a large tree also shows a high vertical density.A level of ascent  is evaluated considering an elevation difference with the surrounding environment and is calculated as where  and  , denote the highest value among the elevation values of the points located in the region with high vertical density and the highest value among the elevation values of the points located in the surrounding eight regions, respectively.The level of ascent  is calculated based on the average elevation difference, and the region is considered as an area where the PCD of the transmission tower exists when the level of ascent  exceeds a predefined threshold, considering the height of the transmission tower.This approach is effective because transmission towers are typically tens of meters taller than the surrounding trees.Finally, the location of the transmission tower is determined by the PCD center, which is designated as the tower.Next, the location of the transmission tower relative to the UAV is calculated again, and the UAV is controlled to move toward the transmission tower.The distance between the UAV and the transmission tower is compared to the safe distance in real time and is not affected by the magnetic field generated around the overhead transmission corridor.Hence, the proposed control strategy is terminated when the distance between the UAV and the transmission tower converges to a safe distance.

Transmission line tracking
The UAV tracks the transmission lines at the same height and at a constant distance from one tower to another by recognizing the transmission lines based on the PCD measured from the 3D LiDAR and detecting the insulator strings that indicate the start and end points of the transmission lines.Tracking the transmission lines at the same height is crucial as it eliminates the need for gimbal control, reducing weight and thereby ensuring long flight times.This subsection describes the detailed procedure of the transmission line tracking method, which comprises two phases ((4) and ( 5) in Figure 1) with optical images and PCD from the optical camera and 3D LiDAR in real-time.
In the first phase ((4) in Figure 1), the transmission lines of interest are selected by the inspector (electrician), who determines the direction of inspection between the right and left sides of the transmission lines from the tower.The UAV then turns approximately 360 • in place to generate a 3D VM of the surrounding environment, including the transmission tower and lines.This phase is executed because extracting the curved features of transmission lines with one PCD frame is difficult because the inherent characteristics of 3D LiDAR limit the horizontal FOV to 40 • .Hence, turning 360 • in place is a simple yet effective way to generate a 3D VM for recognizing transmission lines and thereby extracting their curved features around the tower of interest.Notably, a 3D VM is generated by executing a GPU-oriented environmental cognition method with plane segmentation to increase computational efficiency, in which an embedded computer should generate a VM and perform object detection simultaneously with the measured PCD and optical images (S.Kim et al., 2021).Specifically, the GPU-oriented environmental cognition method comprises three steps.First, the PCD representing the ground is removed by addressing RANSAC ( a -① in Figure 4a).This step aims to reduce the computational load required to generate the VM.This method is consistent with the method used in Section 2.2 ( b -① in Figure 3b).Second, coordinate transformation is executed to convert the 3D LiDAR coordinate system (  ,   , and   in Figure A1 in Appendix A) to the global coordinate system with the odometry of the UAV measured from the IMU mounted on the UAV ( a -② in Figure 4a).This step is executed because coordinate transformation enables the utilization of the structural features of the transmission facilities and optimizes the memory consumption of the embedded computer.Third, the probabilistic downsampling method is executed to generate a VM because voxelization effectively uses GPU memory by decreasing the computational load ( a -③ in Figure 4a; S. Kim et al., 2021).Specifically, a VM is generated in a new area, and the occupancy probability of each voxel is estimated based on Bayes' theorem (Wurm et al., 2010) as follows: where (| 1∶ ), , (), ( 1∶ ),  1∶ , and   denote the given occupancy input from the first to the tth state, the case where the voxel is occupied, the probability that one voxel is occupied, the probability that one voxel is given from the first to the tth state, the case where one voxel receives sensor data from the first to the tth state, and the case in which one voxel receives sensor data at the tth state, respectively.Dividing the conditional probability   (| 1∶ ) by the conditional probability (| 1∶ ) yields Equation ( 12) can be rewritten logarithmically as with The occupancy is updated in real time considering the occupancy at the current tth state based on Equation ( 13).The method for assigning (| 1∶ ) to update the occupancy status of voxels is divided into three classes.First, when the voxels occupied at the current tth state exist in the VM at the (t − 1)th state, a positive value is assigned to update the state (| 1∶ ), indicating the continued occupancy of the voxel (① in Figure 4).Second, when the voxels occupied at the current tth state do not exist in the VM at the (t − 1)th state, it is recognized as a newly measured voxel, and an occupancy rate of 0.5 is attributed to it (② in Figure 4b).Finally, when the voxels are not occupied at the current tth state but exist in the VM at the (t − 1)th state, a negative value is assigned to update the state (| 1∶ ), thereby reflecting the change in the voxel's occupancy status (③ in Figure 4b).This process of updating the occupancy plays a significant role in efficiently removing noise around the transmission lines.This method enables the real-time extraction of transmission lines because the proposed method leverages GPU.The transmission lines, updated in real time, significantly aid the UAV in tracking the transmission lines effectively.
In the second phase, a flight path is generated and tracked by obtaining the curved features of the transmission lines ((5) in Figure 1).Specifically, the cognition method for transmission lines comprises four steps ( b in Figure 4a).First, voxels representing the transmission tower facing the UAV are eliminated from the VM so that the embedded computer only handles the voxels representing transmission lines ( b -① in Figure 4a).Note that a VM only comprises voxels representing transmission lines because voxels representing environments and transmission towers have already been removed.Second, a coordinate transformation is executed to convert the 3D VM of the transmission lines to a 2D VM in the sky view by eliminating   values in the 3D VM (the     plane in Figure A1 in Appendix A).Note that the transmission lines are straight in the new 2D VM because they are curved owing to gravity in the 3D VM.Hence, the transmission lines could be cognized through RANSAC with the line model of  =  +  in this map to determine the horizontal control direction of the UAV between two transmission towers ( b -② in Figure 4a).The parallel and orthogonal directions of the transmission lines are defined by the   and   coordinates, respectively.Third, the 2D VM is transformed again into a 3D VM by adding   values, and then the transmission lines are clustered according to their height in the  direction through the EDC.This step aims to obtain curved features of each transmission line located in the top, middle, and bottom.Note that this step also eliminates noisy VM because the EDC plays a role as a noise filter ( b -③ in Figure 4a).Fourth, a coordinate transformation is executed to convert the 3D VM of the transmission lines to a 2D VM in    coordinates using the information obtained in the second step.This step is executed to improve the computational efficiency of extracting a pathway of transmission lines because voxels representing transmission lines are curved in    coordinates because of gravity.Next, voxels representing the transmission lines are subjected to second-order polynomial curve fitting via the RANSAC ( b -④ in Figure 4a) with the reference model of  =  2 +  +  to extract the pathway of transmission lines.Finally, this step is terminated by calculating the tangential direction of the transmission line tan −1 (2 + ) to determine the direction of the autonomous flight of the UAV.Hence, the UAV moves in the tangential direction of the transmission line based on the extracted tangential direction in real time.Simultaneously, the RoMP Transformer detects insulator strings to determine the beginning and end of one span of a transmission line because the insulator strings serve as markers indicating the start and end points of one span of the transmission line.This implies that the images from the optical camera and the PCD from the 3D LiDAR are processed concurrently in real time.The entire procedure of the second phase is repeated three times, such that one span of the transmission lines comprises the top, middle, and bottom transmission lines in general (Jeong et al., 2020).This repetition can change depending on the configuration of the transmission lines.During line tracking, the altitude difference between the UAV and the adjacent point in the PCD of the transmission lines is calculated to maintain the same height by calculating the error between the UAV and the transmission line of interest.The error is fed back to adjust the UAV height by controlling the speed along the Z-axis (  in Figure A1 in Appendix A).This control strategy improves the accuracy of the transmission tracking by dynamically adjusting the vertical velocity of the UAV based on the tracking error.The UAV can return to its home point when all transmission lines are inspected using the aforementioned procedure.

EXPERIMENTS
This section describes the three experiments conducted in this study.First, field experiments were conducted to collect data and evaluate the performance of the proposed autonomous flight method.Second, the developed hardware in loop simulation (HILS) system is described because this study validated the proposed method using a HILS system to avoid accidents before conducting field experiments.Third, the construction of the RoMP Transformer is described to play a critical role in several phases of the proposed autonomous flight method.

Field experiments
Currently, the inspection protocol of KEPCO using the UAV, which was established by the expert system, mandates that operations should be carried out only under favorable weather conditions, specifically in the absence of snow, rain, or fog.Furthermore, it is ensured that the wind speed remains below 10 m/s (Nam, 2020).The rationale behind this strict protocol is grounded in the significant safety risks posed by adverse weather.In bad weather conditions, the flight stability of UAVs is substantially poor, which elevates the risk of UAVs crashes (Gao et al., 2021).Such incidents not only threaten the safety of operating personnel but also pose a serious threat to nearby residential areas.Therefore, all field experiments conducted in this study were carried out under good weather conditions following the protocol.While these requirements may pose limitations for UAV inspections, it is imperative to highlight that the proposed method in this study offers notable advantages in ensuring the safety of the inspector (electrician).This emphasis aligns with the overarching goal of UAV applications in various sectors, particularly where safety and precision are of utmost importance.The findings of this study underscore the need for considering weather conditions as a crucial factor in planning and executing UAV operations, particularly in contexts where the accuracy of data and the safety of operations are nonnegotiable.Table 1 lists the three types of field experiments.Weather conditions are represented by the amount of cloud cover and wind speed, with the cloud cover rated on a scale from 0 to 10.As aforementioned, all field experiments were conducted under good weather conditions with a cloud cover of 5 or less and wind speeds of 10 m/s or less.First, several field experiments were conducted for image acquisition during the period of 2017-2020 (the first five sites in Table 1).These experiments were aimed at acquiring sufficient images for the training, validation, and testing of the RoMP Transformer.The image sets of the transmission facilities were recorded using a customized UAV supported by the Korean Electric Power Corporation Research Institute.Specifically, the UAV was positioned 30 m away from the transmission facilities to prevent distortion of the compass deployed in the UAV from an electromagnetic field energized by the transmission lines and to capture high-quality images at 12X zoom.All flights were conducted in the autopilot mode using the waypoint method (J.-Y. Park et al., 2020).The measured images were recorded at different resolutions using a SPMY FDR-AX-100 camera (Sony).Four sites, namely, Asan-Yesan, Asan-Hwasung, Shinseosan-Shinanseong (SS), and Shingosung-Tongyeong (ST), were explored using a customized UAV, resulting in 15,251 images of transmission facilities with a resolution of 1920 × 1080.Another site, namely, Daeduck-Duckjin (DD), was explored using a UAV, resulting in 1004 images of transmission facilities with a resolution of 1440 × 1080.The transmission facilities of interest include five classes: transmission towers, insulator strings, stock bridge (SB) dampers, spacers, and marker balls.The total number of transmission facilities in the entire image set is 28,726.The image set includes 3909 tower images, 10,181 insulator string images, 8399 SB damper images, 5400 spacer images, and 837 marker ball images.
Second, field experiments were conducted to measure the PCD of transmission facilities in different environments.These PCDs aim to build a virtual environment for use in HILS.Hence, validation in the HILS system ensures high confidence and fidelity to real environments around live-line transmission lines because a virtual environment originates from the scanned PCD of actual power facilities.Specifically, a 3D LiDAR (Velodyne VLP-16C) was deployed on an M600 (DJI; Jeong et al., 2020).The same protocol as image acquisition was used to measure the 3D PCD around the DD 5-10 transmission line.These measurements were saved as a 3D VM using Octomap voxelization (Wurm et al., 2010) for efficient data storage, resulting in 543,846 voxels in the corridors.
Finally, two types of field experiments were conducted to validate the performance of the proposed autonomous flight method.First, field experiments were conducted at the Gochang Power Test Center (GPTC) without imposing a voltage of 154 kV to the transmission lines (Figure 5a).The experiments at the GPTC aim to check the applicability of the proposed autonomous flight method under actual field conditions, except by imposing an actual high voltage to transmission lines.Note that GPTC is located in a flat terrain, implying that the ground condition is moderate.The autonomous flight method was tested five times to ensure the repeatability of the proposed method.These field experiments were executed at the dead line, which was a line where no voltage was applied, because distortion of the compass deployed on the UAV might result in a loss of controllability, leading to accidents.Second, field experiments were conducted six times at the DD 6-7 corridor to validate the performances of the proposed method on live transmission lines (Figure 5b).Therefore, the method proposed in Section 2 was fully tested experi- mentally.Note also that the DD 6-7 corridor is located in a mountainous region, suggesting that the ground condition was also more complex than that of GPTC.Hence, experiments at the DD 6-7 corridor could validate the robustness and generality of the proposed method.All the information regarding the real-time autonomous flight was measured, including 56,193 images and 36,726 PCD, to evaluate the performance of the proposed method.Note that a direct comparison of the proposed method to other methods is not feasible because the autonomous flight method is totally different.Existing UAV-based inspection systems rely on GPS information, whereas the proposed method does not use GPS information.Instead, the accuracy of the proposed method was compared with that of GPS because the accuracy of the autonomous flight method for existing UAV-based inspection follows that of GPS.The evaluation was conducted through the images and PCD acquired, with detailed results described in Section 4.3.All autonomous flights were performed in a sequence of flying through the top, middle, and bottom transmission lines at both sites (Figure 5).

HILS system
This subsection describes the construction of the HILS system in detail.The HILS comprises a UAV equipped with a single-board computer (SBC; described in Appendix A) and a personal computer (PC) that facilitates virtual environments.Specifically, the SBC on the UAV and PC in the virtual environment communicate with each other in real time (Figure 6a).The PC transmitted images and real-time PCD from the virtual environment to the SBC on the UAV.Based on the received data, the SBC on the UAV generates control signals and transmits them back to the PC in a virtual environment.Hence, a virtual UAV flies around the power transmission lines based on these control signals in a virtual environment on a PC.Note that all communicated data are handled as a robot operating system-based message (Stanford Artificial Intelligence Laboratory et al., 2018).
The virtual environment was built on a desktop computer with an Intel Xeon 8 core @ 2.1 GHz and GeForce GTX 1050 Ti @ 1.3 GHz.This virtual environment was constructed with a PCD measured at 154 kV DD 5-10 corridors and computer-aided design (CAD) files of the transmission facilities.The detailed procedure for constructing the virtual environment is shown in Figure 6b.First, rendering was executed after removing voxels representing the complex shape of overhead transmission facilities from the 3D VM measured from the DD 5-10 corridors because these voxels distort rendering.Overhead transmission facilities were then generated from the 3D CAD model and inserted into a rendered map at the same position and posture.Transmission line models were constructed using different sag values to verify the robustness of the proposed autonomous flight method, reflecting the variation in sag of transmission lines due to changes in environmental temperature.The sags of the constructed transmission lines were set from 12 to 17 m with an interval of 1 m.Finally, the real hardware configuration of the UAV equipped with an optical camera and 3D LiDAR was replicated in the HILS system, implying that the UAV can fly around the transmission lines based on the control signal transmitted from the flight controller of the real UAV.

RoMP Transformer construction
This subsection describes the construction of the RoMP Transformer in detail.A GPU server with two Intel Xeon  validation had a resolution of 1920 × 1080 pixels, whereas those for testing had a resolution of 1440 × 1080 pixels.This difference can be explained by the fact that using images of different resolutions for testing clearly confirms the robustness of the RoMP Transformer.This study addresses two methods for optimizing the accuracy and real-time performances of the RoMP Transformer.First, Bayesian optimization (BO; Frazier, 2018) was addressed for hyperparameter optimization because the hyperparameters play a critical role in determining the accuracy and robustness of the RoMP Transformer.Note that BO was selected in that BO outperforms other hyperparameter optimization methods including grid search and genetic algorithm because of its tendency to converge to optimal hyperparameters quickly.Specifically, hyperparameters of the RoMP Transformer, including early stopping epoch, optimizer, learning rate, momentum, patch, autoencoder level, autoencoder scale, and transformer layers were optimized on transmission facility image sets (Table 2).Specifically, the learning rate and momentum are hyperparameters within the optimizer that respectively determine to adjust the weights of a deep learning model at each iteration by considering the calculated momentum at the previous iteration.Patch size, autoencoder level, autoencoder scale, and transformer layers are hyperparameters that determine the number of layers and nodes in the RoMP Transformer.The patch specifies the size of the patch to be partitioned in for the feature map.The patch determines the number of nodes in a transformer block.
Autoencoder level and scale refer to the levels and scales of the MSML feature pyramid network in the RoMP Transformer.Finally, transformer layers determine the number of layers in the PVT for utilizing a multi-scale transformer.Note that the maximum scales and levels of the autoencoder were limited to six and three because of the specifications of hardware resources.More details for BO are described in Appendix B. Second, a half-tensor was used for testing the object detection and embedded in a flight controller to increase the frames per second (FPS) for real-time applications.Note that the double, float, and half tensors perform operations with 64-bit, 32-bit, and 16-bit precision, respectively.Specifically, tensor is calculated as where  and  denote the exponent and significance, respectively.The sign determines the parity of tensor values, suggesting that all types of tensors have a 1-bit sign.
The exponent  and significant  determine the size and decimal of the number, respectively, suggesting that they determine the tensor type.The RoMP Transformer was trained with a float tensor during the training and validation phases, whereas a half-tensor was employed only for the testing phase.This is because the float tensor must be employed in the training phase to secure the global minimum; however, it is not necessary to use a large number of bits during the testing phase.

RESULTS AND DISCUSSION
This section presents the results of the field experiments with an in-depth discussion.First, the entire procedure is presented with measurements from field experiments to demonstrate that the proposed method successfully completes a mission, that is, autonomous flight, in actual fields.
Second, the superiority of the RoMP Transformer is discussed using test image sets measured for transmission facilities.Third, this subsection presents an evaluation of the accuracy of the transmission tower approach and transmission line tracking methods using experimental data.

Entire procedure
Figure 7 shows the results of the field experiments at the GPTC by executing all the phases.Supplemental Video A also provides the recorded results of an autonomous flight at the GPTC using an extra UAV.Note that an extra UAV was only permitted for the GPTC site because recording images would result in safety issues for the live-line transmission facilities at the DD site.This is why this demonstration is shown at the GPTC.
In the first phase, the relative directions of the transmission towers with respect to true north were calculated by detecting the transmission towers through the RoMP Transformer when the UAV turned approximately 360 • in place (Figure 7 (1)).All seven transmission towers around the UAV were detected by the RoMP Transformer.The seventh tower was selected as the transmission tower of interest by the electrician to execute the next mission.
In the second phase, the UAV approaches the transmission tower of interest.The UAV was controlled to minimize the error   between the tower detected by the RoMP Transformer and the center of the image frame in real-time (Figure 7 (2)).Hence, the UAV could approach the tower of interest by maintaining a constant altitude.
In the third phase, the relative position between the tower of interest and the UAV was calculated in real-time based on the 3D PCD when the height of the bounding box of the detected transmission tower exceeded at least threefourths of the height of the optical image and the RoMP Transformer could not detect the transmission tower in an image recorded three times.Based on the 3D PCD measurements, the transmission tower of interest was detected (orange box in Figure 7 (3)) by executing a series of steps (Figure 8).First, the 3D LiDAR acquires the PCD of the transmission towers in real time within the measurement range (Figure 8 (1)).Most of the acquired PCDs represent the ground, resulting in large memory consumption.Hence, the PCD representing the ground was removed in the second step by applying RANSAC to efficiently process the PCD in real time (Figure 8 (2)).Third, the EDC was used to remove noise in the PCD (Figure 8 (3)).Finally, the relative location of the transmission tower was detected based on its characteristic of having a high vertical density, compared with that of the surrounding environment and a steep increase in the vertical direction (Figure 8 (4)).The UAV approached the transmission tower of interest until it reached a safe distance based on the relative position of the transmission tower.The direction of the UAV was controlled by maintaining the same height as the guidelines from the tower information measured from the PCD.
In the fourth phase, the transmission lines of interest are selected by the inspector (electrician), who determines the direction of inspection between the right and left sides of the transmission lines from the tower.Subsequently, a 3D VM without ground voxels was generated by turning it approximately 360 • in place (Figure 7 (4)).Hence, the VM includes only voxels for transmission facilities.
In the fifth phase, the UAV was controlled to track the transmission lines based on the extracted pathway from the VM in real time (Figure 7 (5)) by executing several steps (Figure 9).First, the VM of the surrounding environment was eliminated using the GPU-oriented environment cognition method (Figure 9 (1)).Second, the voxels representing the transmission tower were eliminated using the method proposed in Section 2.3 to only handle voxels representing transmission lines (Figure 9 (2)).This step was executed to decrease memory consumption, similar to the fourth step of the third phase (Figure 8 (4)).Third, coordinate transformation was executed to convert the 3D VM of the transmission lines into a 2D VM in sky view (Figure 9 (3)).This step aimed to determine the horizontal control direction of the UAV between the two transmission towers.Fourth, the 2D VM was transformed again to a 3D VM (Figure 9 (4)), and then, the transmission lines were clustered according to their altitude in the z direction through the EDC to obtain the curved features of each transmission line located at the top, middle, and bottom regions (Figure 9 (5)).Finally, RANSAC was used to calculate the curve equation to extract the UAV pathway (inset figure red boxed in Figure 9).The UAV was controlled to track the tangential direction of the transmission line and initially tracked the top line.Simultaneously, the RoMP Transformer detects insulator strings indicating the start and end points of a span on the transmission line to move the next transmission line (Figure 7 (5′)).Specifically, the UAV considered the termination of one transmission line when insulator strings were detected, and then the UAV was controlled to align with the insulator string connected to the next transmission line to inspect the next transmission line.
By repeating the procedure aforementioned, the UAV can scan the top, middle, and bottom transmission lines at the same altitude as the transmission lines, using only the images and PCD.This strategy enables the UAV to inspect transmission facilities without active gimbal control, simplifying the inspection system and decreasing its weight.

Superiority of the RoMP Transformer
This subsection demonstrates the superiority of the RoMP Transformer using test image sets.A performance comparison was conducted using the metrics of average precision (APs), mean AP (Map; Jesse & Goadrich, 2006), and FPS with five transmission facilities.Note that these components should be inspected during periodic maintenance (Ferraro, 2015), suggesting that the RoMP Transformer can be used for fault detection in power facilities based on object detection in future work.
Table 3 presents the results of the ablation studies.First, the effectiveness of a rotational bounding box was analyzed in comparison with that of an irrotational bounding box (Table 3 ①).Notably, the RoMP Transformer that utilizes the rotational bounding box outperforms the one that uses an irrotational bounding box.Specifically, the architecture of the neural network with a rotational bounding box shows a significantly higher prediction accuracy, with a 5.6% increase in Map, compared to that with an irrotational bounding box.This analysis indicates that addressing a rotational bounding box minimizes distortion from background noises, thereby increasing the robustness of the test image sets.Second, the effectiveness of the MSML architecture was evaluated by changing the architecture of the neural network at different levels and scales (Table 3 ②).These results demonstrate that the standalone multi-scale architecture limits the detection of small and complex objects, including insulator strings and SB dampers, whereas the MSML architecture effectively extracts distinct features from the objects of interest, resulting in improved accuracy and robustness.The RoMP Transformer adopts a two-level and six-scale layer architecture when deployed in an SBC because the prediction accuracy of the two-level network is similar to that of the three-level network, whereas the inference speed is faster.This decision was made to ensure realtime inspection while maintaining effectiveness in terms of the computational speed and accuracy.Third, the effectiveness of the PVT was evaluated by comparing different scales (Table 3 ③).The scale of the PVT also contributes to improving the performances of object detection for small and complex objects.The proposed method adopts a four-scale PVT configuration, which is beneficial for identifying objects in such scenarios.Finally, the combined IoU method was compared with other IoU methods to analyze the effects of bounding box optimization (Table 3 ④).The analysis reveals that SKEWIoU outperforms angular related IoU (ARIoU), which is a simpler calculation method (L.Liu et al., 2017), but there was a significant decrease in FPS.In this study, mSKEWIoU was used, which improved FPS while maintaining prediction accuracy.Furthermore, the prediction accuracy was significantly improved by combining it with DIoU.In summary, the inherent characteristics of the RoMP Transformer ensure high detection accuracy for the five key facilities and environmental robustness for field applications.
Table 4 quantitatively compares the performances of the RoMP Transformer to those of the baseline one-stage neural networks for object detection.Baseline one-stage neural networks include single-shot detection (W.Liu  (Redmon & Farhadi, 2018), and M2Det (Zhao et al., 2019), which are widely used for many applications of object detection.The RoMP Net was also included in this comparison because the RoMP Transformer is the next version of the RoMP Net.One-stage neural networks are preferred for real-time applications, whereas two-stage neural networks require high accuracy (Lohia et al., 2021).Hence, this study considered only onestage neural networks.Note also that the hyperparameters of all neural networks were optimized through BO (Frazier, 2018) for fair comparison.Remarkably, the RoMP Transformer exhibited the highest APs and FPS among the object one-stage detection neural networks.This analysis clearly suggests that the architecture of the RoMP Transformer is effective for extracting the features of power facilities, thereby increasing the accuracy and robustness.Moreover, addressing the half tensor in the RoMP Transformer enables a fast inference time, resulting in the fastest calculations.Field experiments have also confirmed that it achieves a computational speed of over 3 FPS at an SBC, even though other processes, such as real-time control and measurements of images and PCD, were executed together at an SBC.In conclusion, the RoMP Transformer outperforms other one-stage baseline neural networks in terms of accuracy, robustness, and inference speed.Moon et al. (2024) provide additional comparative studies using public image sets.This study focuses on the performances of the RoMP Transformer with only image sets for power facilities because it aims to develop an autonomous flight method for the inspection of power facilities.

Accuracy on autonomous flight
This subsection first analyzes the real-time detection accuracy of the RoMP Transformer because detecting

Site
Pixel inclusion ratio (%) <0.5 0.5-1 1-5 5-10 >10 transmission towers plays a critical role in autonomous flight.Specifically, the size of the transmission tower was calculated as the pixel inclusion ratio of the transmission tower, where the pixel inclusion ratio denotes the proportion of pixels in one image frame that represents the transmission tower.The analysis was conducted on 272 images acquired at the GPTC and DD.These images represent 11% of the total 2483 transmission tower images for training the RoMP Transformer.The results indicate that the RoMP Transformer achieved a higher AP when the pixel inclusion ratio of the transmission tower increased (Table 5).This observation is explained by the fact that a lower pixel inclusion ratio is less capable of accurately extracting the features of objects, making it susceptible to external factors, such as reflections caused by sunlight and the presence of fog.Furthermore, the results indicated that the RoMP Transformer successfully detected all but one transmission tower when the pixel inclusion ratio of the transmission towers was 1.0% or higher, achieving a detection probability of 100% for both sites (Table 5), securing flight safety for autonomous inspection.The results suggest that an appropriate distance between the UAV landing station and the transmission tower of interest should be determined.Specifically, selecting the location for the landing station of the UAV within a 400 m radius from the transmission tower of interest secures the autonomous flight of the UAV for inspecting power facilities.Future studies should include several experiments to confirm this hypothesis.Note that the AP is not 100%, even though the detection probability is 100% for the pixel inclusion ratio of 1%-10%.This observation could be attributed to the influence of external factors such as reflections caused by sunlight and the presence of fog.This origin would be a reason that the detection probability at GPTC is 100% for a pixel inclusion ratio of 0.5%-1.0%,whereas at DD, it is 94.1%, suggesting that the weather conditions play a critical role in ensuring the safety of autonomous inspection.Second, the accuracy of the cognition on the transmission tower extracted from the PCD was analyzed when approaching the transmission tower only with 3D LiDAR TA B L E 6 Accuracy of the transmission tower location with PCD measurements.

Site
Average (m)

Standard deviation (m)
Hardware in loop simulation (HILS) system  3b).Specifically, the difference between the estimated tower center and the real tower center was compared by calculating the average difference and standard deviation (Table 6).The center of the transmission tower was measured using high-accuracy GPS equipment (J.-Y. Park et al., 2020).Remarkably, the overall average and standard deviations of the error were 1.199 and 0.19 m.This quantitative comparison suggests that the UAV can successfully approach the transmission tower only using the PCD measurements with reasonable accuracy.Note that the accuracy of the GPS developed in the flight controller is 1.5 m (DJI onboard SDK: Telemetry topics, 2018), suggesting that the proposed method would be more accurate than the autonomous flight with GPS because the 3D LiDAR features good accuracy.Note also that the error at the field experiment of the DD site shows a relatively higher average difference of 1.568 m, compared to the overall average difference of 1.199 m because the site of DD is located in the mountainous area, which could result in the PCD of trees being included within the PCD designated for towers.However, this difference was negligible considering the sizes of the transmission towers.This phenomenon was not observed in the experiments at the HILS because the rendering process smoothed the PCD of trees when generating the virtual map at the HILS.
The slope of the ground in complex terrains might also vary depending on the locations of transmission towers.Hence, virtual experiments were conducted under various slope conditions of 0˚to 45˚with the interval of 5i n the HILS (Table 6).Note that these slope conditions cover most inspection terrains because terrains with slopes greater than 45˚are extremely rare.Table 6 shows that the average and standard deviations of the error are 1.147 and 0.15 m, respectively.The values align closely with those obtained in experiments, indicating that the proposed method consistently demonstrates the same order of magnitude error across varying slope conditions.Note that the small average value of standard deviation underscores that the proposed method is less affected by the slope of terrains.The robustness of the proposed method is also reinforced by the small value of the coefficient of variation (CV), which is calculated as the standard deviation divided by the average.This metric is useful for comparing the degree of variability between datasets in sensitivity analysis (Brown, 1998).A lower CV indicates less variability, whereas a higher CV indicates greater variability and potential inconsistency.Specifically, the CV for the error was found to be 0.013 when the transmission towers are located at a variety of slope conditions.Hence, the low CV under a variety of slopes confirms not only the precision of the proposed method but also its adaptability to different environmental conditions.Note that this value is also the same order of magnitude as that in the virtual DD environment, confirming that this case study is feasible.Note also that RANSAC can detect and remove terrains at any slope condition in a theoretical manner because the planar model of  +  +  +  = 0 used in RANSAC is a general model (Fischler & Bolles, 1981).These results indicate that the proposed method successfully approaches the transmission tower in complex environments with different slopes in mountainous regions, confirming again that the proposed method provides a novel inspection platform for overhead transmission facilities.Third, the accuracy of the transmission line tracking phase was analyzed based on the error between the UAV flight trajectory and the transmission lines.The error is defined as the height difference between the UAV and the transmission line; the UAV must fly at the same height as the transmission line of interest so that overhead transmission facilities can exist within the FOV of inspection sensors, including optical and infrared cameras (J.-Y. Park et al., 2020).This analysis was executed using the results from both the HILS system and field experiments (Table 7).The initial sag condition was assumed to be 13 m at the HILS.Remarkably, the tracking errors of the transmission lines in HILS and field experiments were 0.7518 and 0.8149 m, respectively (Figure 10), confirming that the proposed method could be used for accurate tracking of transmission lines.This high accuracy originates from the high accuracy of the 3D LiDAR.Notably, the error of the top line is larger than that of the other lines.This observation can be explained by the fact that the initial positioning error of the UAV occurs because of the difference between the initial take-off height and the height of the top insulator string.The inspector (electrician) sets the relatively accurate initial take-off height by using geographical information provided by KEPCO, but some discrepancies still exist.Consequently, a sensitivity analysis on the initial take-off height should be conducted to elucidate its effect on the transmission line tracking phase because this difference affects the control performances of this phase when tracking the top insulator string.Hence, virtual experiments were executed through the HILS (Table 8) to evaluate the accuracy and robustness of top-line tracking by considering the difference between the initial take-off height and the height of the top insulator string.The difference between the take-off height and the height of the top insulator string was assumed from −10 to 10 m with an interval of 5 m because the maximum error of the geographical information provided by KEPCO would be in this range.Interestingly, the sensitivity analysis shows that the greater the difference between the initial take-off height and the height of the top insulator string, the larger the average error in top-line tracking.However, large errors are only observed in the initial 5 s because the line control strategy compensates for this error.Specifically, the average error is 2.8678 m during the initial 5 s of top-line tracking and decreases over time through the line control strategy, resulting in the error for the remainder of the top-line tracking of 0.7708 m.These findings suggest that the control strategy deployed in the proposed framework effectively controls the altitude of the UAV along the Z-axis and thereby the proposed method would be appropriate for inspecting transmission lines.This result also suggests that the control strategy deployed in the proposed framework is robust against the environmental variable of the take-off height.The results from GPTC (Figure 10b) show a relatively higher accuracy, compared to DD (Figure 10c), excluding top-line tracking.This is because the abrupt movements of the UAV were induced by the relatively conditions of 12 to 17 m with an interval of 1 m in the HILS (Table 9).Additionally, extra experiments were conducted that involved approaching the right tower and tracking the transmission line to ensure robustness.The results show a trend similar to that of the field experiments.The tracking errors increase proportionally to the sag because of the significant sag, which induced large movements of the UAV.However, the difference is negligible because the average error is 0.7656 m, and all errors are within 0.8 m.Also, the difference in tracking errors when approaching the left tower versus the right tower shows a negligible difference of 0.0013 m, demonstrating the robustness of the proposed method.This case study confirms that the proposed method can complete all missions regardless of the span, demonstrating superior performances to GPS accuracy, which is at 1.5 m (DJI onboard SDK: Telemetry topics, 2018).
These analyses demonstrate that the proposed method provides a simple yet accurate strategy for autonomous flight.Furthermore, the proposed method enables system lightweighting, compared to existing UAV-based inspection systems that require gimbal control because of the absence of transmission line tracking technology.This feature results in superior battery efficiency, thereby ensuring extended flight durations.However, there are several challenges that need addressing.One such challenge is the environmental sensitivity of UAV operations, particularly concerning weather conditions.Factors such as wind, rain, and visibility can significantly limit the operational envelope of UAVs.Additionally, the absence of an advanced obstacle detection system presents difficulties in managing emergency situations.In future work, studies will be conducted to address these challenges, aiming to facilitate the realistic application and advancement of autonomous flight systems.

CONCLUSION
This study proposes a new autonomous flight strategy for UAVs with multimodal information to inspect overhead transmission facilities.The proposed method features a unique deep neural network architecture for object detection, enabling the UAV to approach transmission towers at a consistent altitude.Multimodal information from the optical camera and LiDAR data also aids in tower recognition, especially in challenging visual conditions.A simple, yet accurate, control strategy is employed for tracking transmission lines, maintaining UAV altitude in line with the transmission lines without gimbal control.Extensive experiments in both virtual and real-world environments confirm the method's success in autonomous flight missions.The quantitative comparison also underscores that the accuracy and robustness of RoMP Transformer outperform those of other neural networks in object detection.The proposed method offers a faster inspection of transmission facilities without relying on GPS information and gimbal control system, ensuring a safer platform for electricians.Future work explores diagnostic methods for transmission facilities including estimation of sag and environmental encroachment with the proposed platform. A

R E F E R E N C E S
The autonomous control system deployed a 3D LiDAR (Velodyne VLP-32C) and an optical camera (See3Cam CU135) for environmental cognition, including transmission towers and lines.A 3D LiDAR with a maximum measurement range of 200 m was deployed because the UAV must maintain a safe distance of dozens of meters from the transmission line to avoid interference from the magnetic field generated by the transmission line.
The range accuracy of the 3D LiDAR is up to ±3 cm, and the measurement angles in the horizontal and vertical directions are 360 and 40 • , respectively.The horizontal resolution ranged from 0.1 to 0.4 • , and the vertical resolution was 1.25 • , considering the number of channels and measurement angle.The 3D LiDAR was mounted by rotating 90 • about the Y-axis of the UAV (  in Figure A1) to effectively detect thin transmission lines because the horizontal resolution was more precise than the vertical resolution.An optical camera with a resolution of MP was used to detect the overhead transmission facilities over long distances.A 3D LiDAR and optical camera were mounted to measure the same area for fusing these sensors.The flight controller was built into the UAV and its location, posture, and flight status information was measured.An SBC based on a GPU (NVIDIA Jetson AGX Xavier) was also deployed for the effective management of PCD and object detection through a deep neural network.Two battery packs, comprising three 18,650 cells and five 18,650 cells in series, were mounted to supply power to the 3D LiDAR and SBC, respectively, and an optical camera was powered through the SBC.The gimbals were manufactured through 3D printing using polylactic acid to mount the components deployed in the autonomous control system.The weights of the 3D LiDAR, optical camera, SBC, battery packs, and gimbals were 925, 22, 736, 408, and 460 g, respectively, resulting in the total weight of the autonomous control system being 2.55 kg.

APPENDIX B: HYPERPARAMETER OPTIMIZATION
The purpose of the hyperparameter optimization through BO is to search for an optimized hyperparameter value  * in which the loss function ()'s value becomes the lowest.A process to perform the lowest loss value is represented as This function is defined as the objective function.The acquisition function is used to determine criteria for selecting the following hyperparameters as the estimated value BO defines hyperparameters that allow  to have the minimum value by using the newly updated equation.Hence, the RoMP Transformer defined the () as validation loss in the hyperparameter optimization process.This hyperparameter optimization process improves the performance of the deep learning-based neural network.

F
Schematic flowchart of the proposed autonomous flight.LiDAR, light detection and ranging; UAV, unmanned aerial vehicle.

F
I G U R E 3 Transmission tower approaching: (a) Calculation of the relative direction of the transmission towers with respect to the true north and (b) flow chart of the method of approach to the transmission tower.EDC, Euclidean distance clustering; GPU, graphic processing unit; RANSAC, random sample consensus; PCD, point cloud data; RoMP, rotational bounding box with multi-level feature pyramid; UAV, unmanned aerial vehicle.

F
I G U R E 4 Transmission line tracking: (a) Flow chart of transmission line tracking method and (b) classes of updating occupancy.EDC, Euclidean distance clustering; GPU, graphic processing unit; RANSAC, random sample consensus; RoMP, rotational bounding box with multi-level feature pyramid; UAV, unmanned aerial vehicle.

F
I G U R E 5 Autonomous flight path in the field experiment: (a) autonomous flight at Gochang Power Test Center (GPTC) and (b) at Daeduck-Duckjin (DD).

F
Hardware in loop simulation (HILS) system: (a) basic configuration and (b) process of generating virtual environment map.PCD, point cloud data; ROS, robot operating system; SBC, single-board computer; UAV, unmanned aerial vehicle.Gold 5220R CPUs and eight Tesla V100-SXM2 were used for training, validation, and testing of the RoMP Transformer and other networks on the image set described in Section 3.1.The total images of 16,255 were separated into 13,726 optical images (84.4%), 1525 optical images (9.4%), and 1004 images (6.2%) for training, validation, and testing the RoMP Transformer.The optical images for training and

F
Results of the proposed autonomous flight in a field experiment (GPTC).UAV, unmanned aerial vehicle; VM, voxel map.

F
Cognition results of the transmission tower with point cloud data (PCD) measurements.

F
Cognition results of the transmission lines with PCD measurements.

TA B L E 5
Accuracy of the RoMP Transformer under real-time operation.
This study was supported by the "Development of Drone System for Diagnosis of Porcelain Insulators in Overhead Transmission Lines (R22TA14)" project supported by 2022 Main R&D Projects performed by KEPCO, Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government (MOTIE) (20213030020260, Development of Fire detection and protection system for wind turbine), and the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (Grant Number 2020R1C1C1003829).
of the objective function.A representative acquisition function is the expected improvement .The acquisition function  is calculated as   * ()  denotes a value of an objective function according to the hyperparameter .However, continuous feedback using a large objective function results in an increase in computational burden.Hence, BO employs a surrogate model to replace a large objective function.A surrogate model includes the Gaussian process, random forest, and Tree Parzen Estimator.Among these surrogate modes, this study utilizes the Tree Parzen Estimator because the Gaussian process and random forest have the high risk of divergence to optimize multiple hyperparameters.Specifically, the Tree Parzen Estimatorbased surrogate model (|) using Bayes' rule is calculated as (|) =  (|) *  ()  ()(B3)The probability of hyperparameters following the objective function (|) is also calculated as  (|) = {  ()   <  *  ()   ≥  * (B4) where () and () denote the probability distribution of comparison of the objective function value  and  * .The equation  newly is defined by using Equations (B1) to (B4) as   * () =  *  () −  () ∫ Initial ranges and optimal hyperparameters of the rotational bounding box with multi-level feature pyramid (RoMP) Transformer.
TA B L E 2

TA B L E 3
Effectiveness of key characteristics of the RoMP Transformer (bold for the best score for each ablation study).

precision (AP; %) Key characteristic Tower Spacer Marker ball Insulator string SB damper Mean A (Map; %) Frames per second (FPS)
ARIoU, angular related intersection over union; DIoU, distance intersection over union; mSKEWIoU, modified SKEW intersection over union.Performance of the RoMP Transformer for object detection.
Abbreviations:TA B L E 4