Computer Vision Tool for Detection, Mapping and Fault Classification of PV Modules in Aerial IR Videos

Increasing deployment of photovoltaics (PV) plants demands for cheap and fast inspection. A viable tool for this task is thermographic imaging by unmanned aerial vehicles (UAV). In this work, we develop a computer vision tool for the semi-automatic extraction of PV modules from thermographic UAV videos. We use it to curate a dataset containing 4.3 million IR images of 107842 PV modules from thermographic videos of seven different PV plants. To demonstrate its use for automated PV plant inspection, we train a ResNet-50 to classify ten common module anomalies with more than 90 % test accuracy. Experiments show that our tool generalizes well to different PV plants. It successfully extracts PV modules from 512 out of 561 plant rows. Failures are mostly due to an inappropriate UAV trajectory and erroneous module segmentation. Including all manual steps our tool enables inspection of 3.5 MW p to 9 MW p of PV installations per day, potentially scaling to multi-gigawatt plants due to its parallel nature. While we present an effective method for automated PV plant inspection, we are also confident that our approach helps to meet the growing demand for large thermographic datasets for machine learning tasks, such as power prediction or unsupervised defect identification.


I. INTRODUCTION
Deployment of solar photovoltaics (PV) has increased exponentially in the past years. At the end of 2019, globally installed capacity reached 586 GW p [1]. Many PV plants contain defective PV modules which pose safety hazards and reduce power output, yield and as a consequence, the profitability of the plant. Defects occur during manufacturing, installation or due to aging. To identify defective modules PV plants need to be inspected regularly.
A valuable tool for defect identification in PV modules is thermographic imaging which uses a thermal IR camera to visualize defects based on their increased temperature. To speed up the inspection process thermography is typically performed by unmanned aerial vehicles (UAV) [2][3][4][5]. Many works have explored the use of UAVs for PV plant inspection. A high-level overview of the inspection process and the challenges involved is given in [6,7]. [8] compares available camera and drone technologies and [9] performs an economical analysis. [10,11] analyze the influence of the image resolution on the detectability of defects.
UAV thermography of PV plants with millions of modules produces so many images that manual sighting is infeasible. This raises the need for image processing tools which automatically detect PV modules in each image and identify thermal anomalies. To enable repairs or exchange of defective modules the automated processing tool needs to further determine the exact location of each module in the plant. Instead of taking individual images at predetermined positions, we simply fly along each row of the PV plant and acquire videos. This renders expensive and time consuming flight planning unnecessary and allows for faster inspection on-site. However, it increases the amount of data as each PV module occurs in multiple consecutive video frames. It further introduces perspective distortion and other artefacts, such as sun reflections, which need to be handled by the processing tool to make the images usable for downstream anomaly classification and other machine learning algorithms. The large number of acquired thermographic images is key to accurate anomaly classification as some anomalies are very seldom and machine learning algorithms used for anomaly classification require many examples to achieve high accuracy and good generalization.
In this work we develop such an image processing tool for the semi-automatic extraction and localization of PV modules in UAV thermographic videos of large-scale PV plants (see fig. 1). It can be used to automate inspection of PV plants and to curate large datasets for downstream machine-learning tasks. While there are several works on automated PV plant inspection systems [12][13][14][15][16][17][18][19], they rely heavily on classic image processing techniques, such as intensity thresholding (see tab. 1). These techniques are based on heuristics, need extensive manual tuning, do not generalize well and are not very accurate. Further, many of the related works can distinguish at most three different thermal anomalies or perform only a binary classification. First works have shown promising results using deep learning for these tasks [20,21]. Following this recent trend, we use the Mask R-CNN instance segmentation framework [22] to robustly extract PV modules from thermal IR videos. A ResNet-50 deep convolutional classifier [23] is used for fine-grained classification of ten thermal anomalies. Further, we exploit the large redundancy and temporal context present in the video data to efficiently build a large-scale dataset of thermographic images of PV modules for downstream machine learning tasks. To summarize, our contributions are as follows: • A tool for semi-automatic extraction and localization of PV modules in UAV thermographic videos of large-scale PV plants which can be used for automated plant inspection and to curate large datasets for downstream machine-learning tasks. • A dataset of 4.3 million thermographic images of 107842 PV modules from seven PV plants with fine-grained labels of ten common thermal anomalies. • Training and evaluation of a ResNet-50 classifier on our dataset. • A quantitative analysis of generalization ability, processing time and failure cases of our tool.

II. RELATED WORKS
The following is an overview of related methods for semi-automatic thermographic PV plant inspection by UAVs. We compare them in terms of module detection, thermal anomaly detection and localization of modules in the plant. Tab. 1 summarizes methods and dataset sizes of the related works.

A. PV Module Detection
Most works employ classic computer vision algorithms to detect PV modules in both visual and thermographic images. The most popular method used by [13-15, 19, 24, 25] is binary thresholding of image intensities to obtain segmentation masks of the PV modules. [21] detects rectangular candidate contours by thresholding, extracts texture features and classifies them with a Support Vector Machine (SVM). Other works find edges of PV modules using morphological operations [26,27] or the Hough transform [12,16]. More exotic techniques are template matching [18] and maximally stable extremal regions [17]. Main issue of all these works is their reliance on classic image processing which is based on manual priors and heuristics, needs extensive manual tweaking of hyper parameters and generalizes poorly to unseen imagery.
Deep learning overcomes these problems and is applied to PV module detection by [21,28,29]. [28] performs semantic segmentation with a combination of a ResNet-34 [23] and a U-Net [30]. A weakness of semantic segmentation is that it does not distinguish between individual PV modules. [29] employs the YOLO object detector [31] which does not have this problem. However, it suffers from the imprecise representation of PV modules by bounding boxes instead of segmentation masks. Similar to our work [21] solves both problems by utilizing the Mask R-CNN instance segmentation model. It outputs an individual segmentation masks for each PV module which allow for accurate localization of PV modules in thermographic images.

B. Thermal Anomaly Detection
Similar to the PV module detection many works [13,14,16,17,32] use binary thresholding to segment hot regions of PV modules in thermographic images which correspond to thermal anomalies. The works in [12,33] iteratively grow segmentation masks of hot spots starting from local intensity maxima. In [18] hot spots are found by template matching. Another approach is to extract features, such as mean and standard deviation, for each PV module and finding outliers with statistical tests [19] or by comparing with neighbouring modules [25].
Several recent works explore deep learning for anomaly detection to overcome the limitations of classic image processing [20,34,35]. In [34] a segmentation model based on VGG-16 is used to segment three different anomalies directly in the thermographic image. VGG-16 is also used by [35] to classify whether an image contains an anomalous module or not. Problem of this method is the inability to accurately localize the anomalous module. In [20] four different anomalies are classified using MobileNet and VGG-16. The authors find that both deep learning methods outperform a SVM and a Random Forest classifier using SIFT features.
Problem of the current methods is that the list of anomalies classified is by no means complete. Further, small datasets with only 360 to 3336 images are used.
Similar to [20] we utilize a deep convolutional classifier, in our case ResNet-50. However, we obtain a significantly larger anomaly classification dataset with more than 450000 images and perform a much more fine-grained classification of ten thermal anomalies. In addition, we employ majority voting over subsequent video frames to enhance classification accuracy.

C. Localization of PV Modules in the Plant
To localize PV modules in the PV plant [13,14,26] create panorama images of each row, detect modules and assign an ID to each module. This way, module locations are defined relative to other modules. [15] uses the same technique and additionally matches each row panorama to a CAD plan by means of GPS positions. Problematic is the need for an accurate flight path with specified overlap of individual images which makes the UAV operation more complicated. Further, CAD files are not always available and the format can vary for different PV plants.
Several works [36][37][38] create an orthophoto of the entire PV plant from a higher altitude. This requires nadiral images with a suitable overlap which may not always be feasible in case of nearby power lines, streets or train tracks. Spatial resolution of a high-altitude image is low making fine-grained anomaly classification of PV modules difficult.
Other works [18,39] use direct georeferencing to estimate the GPS position of each PV module in the image. This requires an expensive Real Time Kinematics system to accurately estimate the UAVs position.
In [40] GPS positions of the video frames containing an anomalous PV module are marked on a map. While this is straightforward it still requires manual localization of the anomalous module within the frame.
Our work uses relative mapping similar to [13,14,26]. Instead of creating a panorama, we encode the spatial relationship of PV modules in a graph that is matched with a standardized plant file containing module identifiers. This allows for easy integration of other data modalities, such as electrical measurements. The plant file needs to be created only once for each plant which saves time when inspecting the same plant multiple times. We further do not require nadiral images or a specific overlap of adjacent frames and a standard GPS receiver is sufficient. This reduces cost and allows for a more flexible operation of the UAV.

III. VIDEO DATASET
For this work we acquire thermographic videos of seven utilityscale PV plants containing a combined 122865 PV modules (ranging from 2850 to 35360 modules per plant). As can be seen in fig.  2 the plants in our dataset cover a variety of row layouts, module sizes, module orientations and module technologies. Plant D comprises of thin-film modules while the others use crystalline silicon modules. In total our dataset contains 8 hours of video footage (231172 frames) with on average 21.8 PV modules per frame. Videos were acquired by a UAV of type DJI Matrice 210 and a DJI Zenmuse XT2 camera which has a resolution of 640 × 512 pixels and a frame rate of 8 Hz. Acquisition took place under clearsky conditions and solar irradiance above 700 W m −2 .

IV. PV MODULE EXTRACTION
This section introduces our tool for semi-automatic extraction of PV modules from thermographic videos. An overview can be found in fig. 3. First, the tool splits thermographic videos into individual frames and extracts their GPS coordinates. Aided by the GPS coordinates the user manually specifies which frames belong to which row of the PV plant. PV modules are segmented by Mask R-CNN, extracted, rectified and stored to disk. A tracking algorithm associates each PV module in subsequent video frames with a unique track ID. This way the extracted patches of each PV module can be grouped together. Finally, track IDs are associated with plant IDs. Plant IDs are specified in a standardized plant file and describe the electrical wiring and the location of each module in the plant. We chose a semi-automatic approach to achieve a high degree of flexibility and good generalization to different PV plants.
The rest of this section explains the tool in detail.

A. Video Acquisition and Preprocessing
Thermographic videos can be captured with any UAV or camera as long as the following requirements are fullfilled: • Each row of the PV plant is scanned individually.
• The camera moves monotonically along the row, i.e. there is no significant backward movement. • The current row must be fully visible and always the frontmost (bottommost) one in each frame. • The row must lie approximately horizontal or vertical in each frame.
Our tool is robust to changes of the flight velocity, altitude and camera angle. This allows the operator to manually track rows with varying elevation (e.g. hillsides) and choose the optimal camera angle to reduce sun reflections. Additional rows which may become visible in the background due to low camera angles are filtered out. After acquisition thermal IR videos are split into individual frames and stored as 16-bit grayscale TIFFs. The GPS position of each frame is extracted and stored in CSV and KML files. They are needed during the manual grouping of frames that follows in the next step. In case the PV rows are vertical we rotate the video frames by 90°to enable equal treatment of both cases in the remaining processing steps.

B. Grouping of Frames into Rows
For maximum flexibility our tool processes each row of the PV plant independently. To this end, the user has to manually specify which video frames belong to which row of the PV plant. Specifically, he has to provide the plant IDs of the bottom left and top right modules and the index of the first and last frame of each row. A graphical tool (see fig. 4) for browsing frames based on their GPS position simplifies this process. The user can skip parts of the video and rows do not need to be scanned in any particular order. It is also possible to scan rows partially, e.g. when a row contains multiple strings of which only a subset needs to be inspected. Further, single frames can be processed which is useful for short rows.

C. PV Module Segmentation
To locate PV modules in each video frame we use the Mask R-CNN instance segmentation framework. It outputs an axisaligned bounding box and a binary segmentation mask for each PV module. We train it to segment only fully visible PV modules. Example outputs are shown in fig. 5. 1) Dataset: For fine-tuning of Mask R-CNN we annotate segmentation masks and bounding boxes of 26612 PV modules in 862 video frames of PV plants A, B, C and D. For this we developed a custom annotation tool, however any annotation tool for instance segmentation can be used. We select 60 frames (15 of each PV plant) with a total of 2104 PV modules for validation and the remaining 802 frames for training. For compatibility with Mask R-CNN we convert the 16-bit grayscale frames to Celsius scale, normalize the values to the interval [0, 255], convert to 8-bit, maximize contrast by means of a histogram equalization, convert to RGB and subtract the channel means estimated from the training set. In addition, each frame is padded with zeros to a square of size 640 × 640 pixels.
2) Training: Starting from MS COCO-pretrained weights [41] we train the segmentation and classification heads of Mask R-CNN for 59 epochs using stochastic gradient descent with a batch size of 2, learning rate 0.001, momentum 0.9 and weight decay 0.0001. Subsequently, all weights are fine-tuned for additional 60 epochs with 1/10th of the previous learning rate. During both training stages frames are augmented by random up-down and left-right flips and (in 50 % of the cases) rotation by a uniform random angle between −10°and 10°. We additionally rotate images by ±90°in 50 % of the cases to reduce differences between landscape and portrait orientation of modules. Table 1: Comparison of related works on PV module detection and thermal anomaly detection in aerial IR images of PV plants. F1-scores are taken from the original works and are not directly comparable due to different test datasets and different definitions of the F1-score (pixel-based, bounding box-based, choice of IoU threshold). A unification is out of the scope of this work. F1-scores defined in the same way as in our work are demarked with a † .

3) Validation Metrics:
We evaluate Mask R-CNN in terms of F1-score and average precision (AP) metric from the MS COCO benchmark [41]. To this end, all pairs of predicted and ground truth module bounding boxes in a validation frame are formed and the intersection over union (IoU) of each pair is computed. Pairs with an IoU larger than a specified threshold are true positives (TP).

D. Extraction of Module Patches
This step extracts segmented PV modules from the thermographic frames and stores the resulting patches to disk. Due to perspective distortion and irregular shape of the segmentation masks direct cropping and storing is not possible. Instead, we fit a minimum-perimeter enclosing quadrilateral to each segmentation mask and obtain a homography which maps the quadrilateral to a rectangle. Width and height of this rectangle correspond to the maximum width and height of the quadrilateral. This yields variable-sized patches which retain most of the information of the source frame without wasting storage space. To ensure each pixel within the quadrilateral is valid we restrict it to lie within the frame. If the IoU of a segmentation mask and the fitted quadrilateral is below 0.9 the segmentation mask is most likely incorrect and filtered out.

E. PV Module Tracking
Multiple object tracking is performed to associate segmentation masks of the same PV module in subsequent video frames. This enables grouping of the extracted patches by their associated PV module. To this end, mask centers are projected from frame t − 1 into frame t using a homography that is estimated by extracting and matching ORB keypoints [42] in both frames. We also tried a Kanade-Lucas-Tomasi tracker but found that it fails due to large motion magnitude whenever the IR camera recalibrates. Each projected mask center is then matched with the nearest segmentation mask center in frame t and its track ID is propagated. If multiple projected mask centers are matched with the same segmentation mask center only the match with the smallest Euclidean distance is considered. The other matches typically correspond to PV modules that left the frame. Whenever a segmentation mask center in frame t is not matched with any of the projected mask centers, a new unique and random track ID is assigned to it. This usually occurs when a new PV module enters the frame.

F. Filtering of the Front Row
For low camera angles additional rows of PV modules may be visible in the background of the frame. We develop a filter which discards these background rows and the corresponding patches. It operates independently on each frame and assumes that the currently processed row is the frontmost row (for nadiral videos the bottommost row) in the frame.
The filter iteratively fits a line into the set of segmentation mask centers using RANSAC, removes the inlier mask centers and repeats until no more lines can be fit. Each line must deviate at most ±20°from the horizontal. During iterative fitting outlier lines can occur which intersect the other lines. We remove them by iteratively removing the line which intersects most other lines until no more intersecting lines are present. Given the number N of vertically stacked PV modules in each row we can retrieve the N lines with largest y-intercept (the image y-axis points downward). The segmentation masks associated with these lines represent the front row and thus are the ones of interest for the further processing steps. Fig. 6 shows some example outputs of the row filter.

G. Association of Track IDs and Plant IDs
In this step the random track IDs of PV modules are mapped to plant IDs which encode the electrical wiring of the modules and their location in the plant. The algorithm involves three steps: i) track graph creation, ii) plant graph creation and iii) graph matching.
Both track graph and plant graph encode the spatial relation of all PV modules in a single row of the PV plant. Nodes contain the track IDs and plant IDs, respectively. Edges connect IDs of adjacent modules.
1) Track Graph Creation: The track graph is built iteratively based on all frames associated with the row. For each new frame previously unseen track IDs are added as nodes to the track graph. However, track IDs of spurious tracks (track ID occurring in less than five successive frames) are ignored. Edges are added whenever the overlap, i.e. the number of shared pixels, of two segmentation masks exceeds a threshold. Prior to that all masks are dilated to ensure sufficient overlaps. For PV plants with gaps between module tables adjacent modules are found by additionally searching along a horizontal line passing through the segmentation mask center. In the end, all but the largest connected component of the track graph are removed. The smaller components correspond to background rows resulting from occasional row filtering failures. Additionally, nodes with degree 1 are removed since they correspond to spurious detections.
2) Plant Graph Creation: Plant graphs are created as one-toone mappings of the rows in the plant file which contain plant IDs and correspond directly to the plant layout.
3) Graph Matching: The final mapping between plant IDs and track IDs of a row is obtained by finding all isomorphisms of the two graphs and selecting the one compatible with a provided seed match between the track ID and plant ID of the bottom left module in the row. The plant ID of this module is provided by the user in an earlier step. Its track ID is found by searching for the bottom left module in the first or last frame of the row using the multiline fitting approach from above. Whether the first or last frame is used depends on the scan direction (leftward or rightward) which is estimated from the horizontal motion of the tracked modules. As the track graph can contain imperfections an isomorphism can not always be found and instead a subgraph isomorphism is computed. In the seldom case that this also fails the row can not be processed further.

H. Filtering Patches with Sun Reflections
For some camera angles sun reflections occur which distort the temperature measurement in the thermographic video and the extracted patches (see fig. 7). Due to the non-stationary nature of the reflection typically only a subset of the patches of a given PV module is affected. We need to filter them out to prevent issues in the downstream anomaly classification.
The filter finds the maximum temperature (T i ) i=1,...,N and its coordinates (x i , y i ) in all N subsequent patches of a module. Patches in which T i and (x i , y i ) deviate significantly from a reference value most likely contain a sun reflection and are filtered out. More specifically, patch i is filtered out if |T i −T | > 5 K and (x i −x, y i −ȳ) 2 > 10 px. The reference valuesT and (x,ȳ) are median values computed from a subsequence of the patches which is obtained as follows. First, the discrete difference p i+1 − p i of the Euclidean norm p i = (x i , y i ) 2 is binarized at a threshold of 10 px. All zero-subsequences of p i which are longer than 0.3N are obtained (the longest is used if none exceeds 0.3N ). Finally, the zero-subsequence with the smallest variance of the maximum temperature T i is selected for computation of the reference values. Fig. 7 demonstrates the effectiveness of our filter.

V. ANALYSIS OF PV MODULE EXTRACTION
In this section we present the dataset created by our PV module extraction tool and analyze failure cases, processing time and generalization ability.

A. Extracted Dataset
We run our PV module extraction tool on the seven PV plants in the video dataset and obtain a large-scale dataset with 4.

B. Generalization of the PV Module Segmentation
In this experiment we analyze how well Mask R-CNN generalizes to new PV plants. This is practically relevant as fine-tuning on a new plant is time and cost intensive.
To this end, we create training and validation datasets for PV plants A, B, C and D. Validation uses 25 video frames of each plant, training around 2380 PV modules per plant. Mask R-CNN is trained on all combinations of the training sets and its AP (mean of IoU thresholds {0.5, 0.55, . . . , 0.95}) is evaluated on each validation set. Training follows sec. IV-C, however, to speed up the experiment we pretrain and fine-tune for at most 25 epochs each and always select the model with lowest validation loss.
While the results in fig. 8 show an increase in validation AP with more training data, they also indicate that plant C differs significantly from plants A, B and D. This is because PV modules are oriented in landscape in plant C and in portrait in plants A, B and D. We validate this by re-running the experiment without randomly rotating frames by ±90°during training. This leads to a lower AP of 2.1 % to 43.7 % on plant C whenever plant C is not in the training set. Thus, to achieve a high AP Mask R-CNN must be trained on plant C and at least one of the plants A, B or D. At this point we can not fully explain the low sensitivity of AP for plant  Fig. 8 also reports the mean and standard deviation of all APs when training on one, two, three and four PV plants, respectively. While the standard deviation decreases the mean of the AP increases with more training data. As the AP asymptotically approaches a saturation value the benefit of adding more training data decreases. We found a segmentation model trained on at least three PV plants (of which one is plant C) achieves good results.

C. Failure Cases
Previously, we reported that our tool fails to process 49 out of 561 PV plant rows in our video dataset corresponding to 12.2 % of all PV modules. We identify four common causes: (1) the UAV flight path violates the requirements from sec. IV-A, (2) the PV module segmentation can fail, (3) rows have an irregular layout and (4) the row filtering can fail. Fig. 9 shows examples for each failure and tab. 3 contains the relative frequencies. We report missed rows instead of missed modules because rows contain varying numbers of modules and an error in a single frame usually leads to loss of the entire row.   The majority of rows (22 out of 49) can not be processed due to an inadequate UAV trajectory. This is because some older videos in our dataset were acquired before we established the requirements on the UAV trajectory. Another 14 rows are missed due to false negatives of the PV module segmentation. They occur mostly in plants F and G on which Mask R-CNN is not fine-tuned and which contain PV modules in landscape orientation. In a few cases segmentation also fails due to sun reflections or occlusion of modules by vegetation. Fine-tuning Mask R-CNN on more data can mitigate segmentation failures. Irregular row layouts cause failures in six rows. While our tool can handle missing modules some failures still occur because Mask R-CNN fills gaps in the grid of modules. Further six rows are missed due to failures of the front row filter. They occur only for plants F and G and are related to the lower module segmentation accuracy. A more robust line-fitting method can solve this issue.
For now we tolerate these failures as our extracted dataset is large enough for downstream tasks.

D. Timing Analysis
Processing time is a critical factor for scaling our tool to larger PV plants. Fig. 10 reports timings of both manual and automatic steps of our tool. Automatic steps are timed on a workstation with an Intel Core i9-9900K, 64 GB of DDR4 RAM, a 4 TB Seagate IronWolf HDD and a GeForce RTX 2080 Ti running Ubuntu 20.04 LTS. Manual steps comprise of UAV flight, frame grouping and plant file creation. The flight duration is estimated from the number of video frames and the frame rate. This underestimates the true duration slightly as battery changes and row changes of the UAV are not considered. For the manual frame grouping we estimate that the user can configure 30 groups per hour. Due to a lack of accurate measurements fig. 10 omits manual plant file creation. It takes 2 to 8 hours for a 3 MW p plant (10000 modules) depending on the regularity of its layout.
Timing differences between the plants are due to different video file formats, different plant and row layouts and different UAV flight altitudes and velocities. Track graph creation is faster for plants A, B and C because we can deactivate gap handling. In total, extracting 10000 modules from a 3 MW p plant takes 8 to 21.7 hours, depending on the plant layout. In here, automatic steps account for 3.8 to 12.1 hours which could be significantly reduced by parallelizing the currently sequential processing of PV plant rows. A further speedup is possible by increasing UAV flight velocity and altitude.

VI. THERMAL ANOMALY CLASSIFICATION
In this section we use the extracted thermographic patches for supervised classification of thermal anomalies in PV modules. To this end, we label the patches and train a ResNet-50 classifier to predict whether a patch is nominal or exhibits one of ten common anomalies. As our dataset contains on average 40 patches per PV module, we choose the majority class across those patches as the final class label for each module.

A. Dataset
An expert in our group labels each of the PV modules in our thermographic patch dataset with one out of the ten thermal anomaly classes shown in fig. 11. The class scheme is based on experience and includes relevant module anomalies encountered in previous studies. It is deliberately not optimized for machine learning as the intention is to see how closely the classification of an expert can be reproduced. The structure of our dataset allows to label modules instead of individual patches which speeds up labelling. Note, that we ignore modules of plant D because they are thin-film modules which exhibit different thermal anomalies than the crystalline silicon modules in the other plants. We further exclude all patches with sun reflections from the anomaly dataset and ignore sectors S1 and S2 of plant B to reduce the labelling workload. To reduce class imbalance (only 6.91 % of all modules are anomalous) we balance the numbers of healthy and anomalous modules separately for each plant. Finally, we select 70 % of the PV modules for training, 20 % for testing and 10 % for validation. By splitting the data on module-level we ensure that patches of the same module do not occur in multiple splits. The resulting classification dataset (see tab. 4) contains 453511 patches of 11644 PV modules half of which are anomalous. There are on average 38.95 patches per module which act as different augmented views. Note, that the distribution of anomalies differs significantly between the PV plants.  Figure 11: Example patches for the ten anomaly classes in our dataset. Severity decreases from left to right and top to bottom. Temperature ranges from 30°C (black) to 60°C (white). All patches except for class Cm+ are taken from plant A.  Mh  5  87  4  0  1  494  591  212  2636 112  0  38 19 968  22 966  Mp  2  0  2  5  1  1  11  74  0 151  272  62  26  585  Sh  61  31  1  1  1  4  99  2421  804  43  73  13  145  3499  Sp  9  5  0  33  5  37  89  360  118  0  1802  217  1573  4070  Pid  980  341  0  0  0  0  1321  40 422  9143  0  0  0  0  49 565  Cm+  1  10  0  11  6  0  28  26  243  0  477  352  0  1098  Cs+  12  25  0  11  27  0  75  468  742  0  582  1348  0

B. Classifier Training
We initialize ResNet-50 with ImageNet 1.4M pretrained weights and replace the original fully connected (FC) classification layer with a randomly initialized FC layer containing 11 neurons. We fix the base model and train only the FC layer for 10 epochs using Adam optimizer with learning rate 0.001 and batch size 32. Afterwards, we fine-tune all layers starting from layer 101 for another 20 epochs using RMSprop optimizer with learning rate 1e−5. During training patches are augmented by random left-right and up-down flips. Preprocessing is similar to the one for segmentation (see sec. IV-C), however histogram equalization is skipped and patches are resized to 224 × 224 pixels without any padding and without maintaining the aspect ratio. During training we do not address class imbalance explicitly.

1) Validation Metrics:
The ResNet-50 classifier is evaluated on the test set by means of accuracy and per-class F1-scores averaged over all classes. Both the unweighted average and the average weighted by class support are reported. We further distinguish patch-level and module-level metrics which are obtained before and after majority voting, respectively. For all metrics we report mean and standard deviation over three training runs.
2) Test Performance: After fine-tuning ResNet-50 achieves 89.40 % test accuracy on patch-level (see tab. 5). Majority voting improves it to 90.91 %. The results are stable over three independent training runs. Training the classifier only on the first patch of each module instead of all patches reduces test accuracy by 5.4 %. This confirms the benefit of collecting multiple patches per PV module. As can be seen from the per-class metrics in tab. 6 and the confusion matrix in fig. 12 the classifier performs well on most anomaly classes, however is less accurate on classes Mp, Cm+, Cs+ and Chs. Reason for this is the under-representation of these classes in our dataset leading to poor generalization from training to test set. Other low-resource classes, such as Sh and Sp, are classified more accurately because the underlying visual patterns are less variable and can be learned accurately from a small number of patches. In some cases, the classifier confuses classes C and D with the healthy modules due to high visual similarity of these classes. Similarly, Pid and C are confused. This is because some Pid modules have comparably little overheated cells and some C modules comparably many of them leading to overlap of the two classes. High visual similarity between some classes also makes labelling difficult and may be a source for considerable amount of noise in the ground truth labels. 3) Classifier Visualization: To understand if the classifier bases its predictions on meaningful features of the patches we compute class activations maps (CAMs). Fig. 13 shows a selection of CAMs. Each CAM visualizes the contribution of a particular image region to the classifier's final prediction. The high correlation between CAMs and temperature anomalies indicates that the classifier draws its confidence mainly from the hot regions in the patch. This is sensible and confirms that the high accuracy of the classifier is based on meaningful image features.
To gain additional insight into the classifier we visualize embeddings of the test set patches in fig. 14. A few large clusters can  Figure 13: Class activation maps of the ResNet-50 classifier obtained with Grad-CAM++ [43]. The patches correspond to fig. 11. be observed which correspond to the six PV plants and most of the anomaly classes. For plant A there are two clusters each because modules in the top row are rotated by 180°as compared to those in the bottom row. In addition, several smaller clusters occur which correspond to individual PV modules. Some of them are outliers, others represent classes, such as Cs+ and Sp, which do not form compact clusters due to low sample count and high intra-class variance. The embedding space reflects the classifier's confusion of some classes, e.g. Pid/C and C/D/Healthy, as partial overlap of the respective clusters. Similarly, the low accuracy of some classes, such as Cm+ and Chs, can be explained by the almost complete overlap of the respective clusters with other clusters.
VII. DISCUSSION AND CONCLUSION 1) Summary: In this work, we developed a computer vision tool for semi-automatic processing of UAV thermographic videos. It handles the large amounts of thermographic images acquired during inspection of PV plants, extracts individual PV modules and classifies ten common module anomalies with an accuracy of more than 90 % using a ResNet-50 classifier. It further provides the exact location of defective modules in a plant allowing for targeted repairs. Videos are used instead of single images for faster inspection and increased flexibility of UAV operation. Our tool can be used for automated inspection of PV plants superseding an expensive and time-consuming manual inspection. This can reduce cost of PV plant maintenance, ensures safe operation and maximizes yield.  Figure 14: ResNet-50 embeddings of the test dataset after dimensionality reduction with UMAP [44]. Embeddings are obtained from the last convolutional layer. Colors represent the ground truth class. For better visualization we show only 5 % of all data points. Furthermore, our tool efficiently creates large-scale thermographic datasets by exploiting redundancy in the video. We use this capability to curate a dataset with 4.3 million thermographic images of 107842 PV modules from seven PV plants. Modules in the dataset are automatically indexed based on their electrical wiring and location in the plant. This unique index and the large size of the dataset enable research on other downstream machine learning tasks, such as power prediction, which are essential for the safe and profitable operation of future PV plants of evergrowing size.
2) State-of-the-art Improvements: As compared to many of the related works we use deep learning for PV module detection which improves accuracy and generalization. No hyper parameters had to be adjusted to extract modules from the seven different PV plants. By using a deep convolutional classifier for supervised classification of thermal anomalies we followed a recent trend in the field. However, our dataset is significantly larger and we distinguish ten anomaly classes as opposed to at most four classes in the related works. Distinguishing many anomaly classes is not only of value for research datasets but also for plant operators as it facilitates more detailed cataloguing of anomalies in a plant. This is important because some anomalies can worsen over time eventually causing power losses or outages. Despite the larger number of classes test accuracy of our classifier is on par with the related works. However, we also found that classification accuracy is lower for some under-represented classes in our dataset which confirms the need for very large datasets. This also shows that large-scale datasets are required to detect rare anomalies which affect only a handful out of thousands of modules. Smaller datasets as used in many related works do not sufficiently cover such rare anomalies. To allow for even more accurate and finegrained classification in future we will expand our dataset and explore other deep learning methods which overcome the issue of low accuracy on under-represented classes.
3) Future Relevance: Our work is a first step towards the ultimate goal of automatically characterizing gigawatt-scale PV plants with millions of modules in a day. It shows a way to organize and process the large amounts of data accrued during inspection. However, to achieve full automation and scale up to gigawatt plants multiple UAVs should be used and UAV operation has to be automated. This leads to a predictable scanning order of plant rows which renders most of the manual steps of our tool unnecessary. Scaling up also requires reducing processing time. Given full automation, the worst case throughput of our tool is 19800 modules per day on a single workstation. To process 3.5 million modules in a 1 GW p plant in a day requires a 177-fold speedup. This speedup is practically feasible by parallelizing the currently sequential processing of PV plant rows. While this demands for a parallel implementation on a small compute cluster it does not require principle changes to the vision algorithms. 4) Future Challenges: Some challenges remain for future works. For example, the detection of string-level anomalies or faults of non-module components, such as inverters. To this end, multimodal datasets (imagery and electrical) as produced by our tool can be used in combination with machine learning. Future work should also consider additional image sources, such as visual and electroluminescence imagery. For wider applicability anomaly classification could be extended to thin-film, bifacial and half-cell modules, and PV module extraction to plants with nonrow layouts, as common in floating PV. Furthermore, methods are needed which predict the PV plant's future health state based on historic data. Finally, the dependency of the anomaly classification on ambient conditions should be explored. We have indications for such a dependency but not yet enough data for a systematic analysis.