Road crack detection interpreting background images by convolutional neural networks and a self‐organizing map

The presence of road cracks is an important indicator of damage. Deep learning is a prevailing method for detecting cracks in road surface images because of its detection ability. Previous research works focused on supervised convolutional neural networks (CNNs) without non‐crack features or unsupervised crack analysis with limited accuracies. The novelty of this study is the addition of background classification. By increasing the number of non‐crack categories, CNNs are driven to learn non‐crack features and improve crack detection performances. Non‐crack images are preprocessed, and their features are extracted in an unsupervised way by a deep convolutional autoencoder. A self‐organizing map clusters features to obtain non‐crack categories. This study focusses on classification though the method can be adopted in parallel with the latest segmentation algorithms. Using common road crack datasets, modified deep CNN models significantly improved accuracy by 1%–4% and f‐measure by 3%–8%, compared to previous models. The modified visual geometry group (VGG) 16 showed the top‐level performance, 96% accuracy and 84%–85% f‐measure. The models drastically reduced false detection cases while maintaining their crack detection abilities.

and inertial positioning systems, which is called a mobile mapping system, is utilized to obtain large-scale top-view road surface imaging data (Mizutani et al., 2022).
Many research works have addressed the development of an automatic algorithm for detecting road cracks in images.In this study, concrete and asphalt-type pavements are targeted.Typical cracks in pavements are wider than those in concrete structures.However, pavements are sometimes noisy with aggregates and dusted and shadowed; there exist different difficulties.Nevertheless, in some research works pavement crack images were mixed with those in concrete structures and other materials (e.g., Yahui Liu et al., 2019;Zheng et al., 2022).Therefore, general crack detection algorithms are discussed in this article for a comprehensive review.
Previous crack detection algorithms are categorized into (1) image processing, (2) machine learning (ML), and (3) deep learning (DL)-based algorithms (e.g., Shi et al., 2016;Yamaguchi & Mizutani, 2023;Yang et al., 2020).Considering target tasks, the algorithms can be also divided into (a) classification, (b) localization, and (c) segmentationtype algorithms.Considering the robustness and trend of DL approaches, this study also focusses on DL algorithms.In practice, top-view road surface images are divided into grid areas.The areas are classified into "crack" or "noncrack" to construct a crack map and evaluate the entire condition of the road surface.This study addresses the crack image classification task to offer a realistic crack map.
Most of the recent research works focus on segmentation algorithms.There are two reasons why this study focuses on classification.(1) If perfect segmentation is realized, then classification is not necessary.However, segmentation accuracy decreases when there are shifts in pixels.As will be discussed, the highest classification accuracy realized in this study is around 96%, which is the top-level performance, compared to recent segmentation algorithms showing at most 60% to 70% f-measure and intersection over union.The problems of localization and classification are separated.(2) Interpretation of non-crack images is the idea of this study.To analyze images, certain areas should be extracted.However, for future work, the method can be adopted in parallel with previous segmentation algorithms by segmenting images classified by the proposed method (Dorafshan et al., 2018;B. Kim & Cho, 2019).Li et al. (2023) proposed a road crack detection method combining localization and classification networks to efficiently estimate crack extension.Segmentation is important for obtaining detailed crack information such as bifurcations and widths.In the following literature review section, classification and segmentation algorithms and road crack databases will be discussed in detail.
The information of background non-crack images is not fully utilized in previous research works.Though there are abundant features in non-crack images such as textures of asphalt aggregates, shadowed areas, and paints and dusts, these features are not utilized in typical "crack" and "non-crack" two-category classification by existing convolutional neural networks (CNNs).Some noisy patterns may be considered to discriminate cracks from non-crack images.However, they are simply considered as noise though their features need to be carefully explored.
The approach of this study is different from previous research works.This study incorporates the back-ground analysis of unsupervised convolutional autoencoder (CAE) feature extraction and self-organizing map (SOM) clustering into state-of-the-art supervised deep CNN architectures.SOM is superior in visualizing the structure of data.Few research works discussed about the features of non-crack images.This approach provides two advantages: (1) The algorithm showed the highest classification accuracy with little calculation cost increase, compared to previous supervised CNNs and unsupervised crack analysis.(2) As already explained, this approach can be adopted in parallel with the latest DL and ML classification and segmentation algorithms.
The logic of this study is to analyze patterns in non-crack images to label non-crack images into finer categories.The number of categories of non-crack images is increased by unsupervised CAE and SOM to drive CNN architectures to learn features of non-crack images aiming for improving crack detection abilities.Supervised and unsupervised algorithms are combined.This study adopts pseudo-labeling.However, the problem setting is different from semi-supervised (partially labeled, self-supervised) learning acknowledged in DL research fields.This point is discussed in the following section referring to previous research works.
The accuracy of semi and unsupervised learning on crack detection problems presented in previous research works was limited.The validity of the proposed method depends on the performances of CAE and SOM.As will be shown, three points reinforce the validity of the method: the developed CAE reconstructed crack images; the SOM showed a certain pattern and repeatability; the comparisons of classification accuracy quantitatively demonstrated the effect of the method.
Calculation costs should be considered for a practical application.The proposed method imposes small modifications to existing CNN architectures, that is, increasing the number of categories.Therefore, the calculation time is comparable with previous architectures, which is another advantage of the method.
As a summary of the discussions above, the pros and cons of the proposed algorithm, compared to the state-ofthe-art methods, are listed below.About the pros: 1.This study incorporates the information of background images in the classification task by unsupervised feature extraction and pseudo-labeling.This approach is different from ordinary "crack" versus "non-crack" classification, which does not consider the differences of spatial features in the background images.2. By utilizing the classification ability of supervised deep CNN using pseudo-labeling, the highest classification accuracy was achieved, compared to the previous CNNs and unsupervised methods.
3. The state-of-the-art CNN architectures can be used with the small modifications of the CNNs.Training time is almost the same as the previous CNNs.
The cons are: 1. To analyze the contexts of the background images, this study focuses on the classification of images.The classified crack images can be segmented by using up-to-date segmentation algorithms.2. The improvements by the proposed method are limited by the amount of background images in the considered dataset.
The algorithm consists of two steps: The first step is preprocessing, CAE feature extraction, and SOM clustering.The second step is CNN classification.The outline of this article is: Section 2 discusses related previous research works; Section 3 shows the concept of the proposed method and summarizes the contributions of this study; Section 4 details the configuration of datasets and evaluation metrics; Sections 5 and 6 explain Step 1: preprocessing and comparisons of feature extraction methods (Section 5) and optimization of a SOM (Section 6).In Section 6, clustered features are discussed with example images.Section 7 explains Step 2: Comparisons of CNN architectures, confusion matrix, and classification results are shown.Section 8 discusses the limitations of this study.Section 9 concludes the article.

LITERATURE REVIEW
In this section, state-of-the-art previous research works are reviewed related to DL road crack detection, training and data augmentation methods, semi and unsupervised analysis including background detection, clustering, and feature extraction for general image and movie object detection and road crack databases for validating the proposed method.
The two research directions in detecting cracks in images using DL are accurate segmentation and detection of thin cracks.In terms of accurate segmentation, Yang et al. (2020) proposed cascaded convolution layers with top-down and bottom-up feature pyramids to consider the contexts of cracks for accurately segmenting crack pixels.Yahui Liu et al. (2019) proposed a deep hierarchical CNN to segment crack images obtained from the Internet.Ji et al. (2020) and Dung and Anh (2019) segmented asphalt road cracks by optimized convolutional encoder-decoder architectures.Bang et al. (2019) proposed a convolutional encoder-decoder architecture to segment cracks in black-box images.Huyan et al. (2020) proposed a U-Net architecture to segment pavement cracks, which was validated using their own datasets.Gopalakrishnan et al. (2017) applied a deep CNN adopting transfer learning (TL) and compared neural networks (NNs) and ML algorithms for the accurate classification of pavement distress images.Chu et al. (2022) introduced an attention mechanism to an NN to efficiently extract thin crack features from large concrete bridge structures.Maeda et al. (2018) addressed the localization of general road damages using smartphones fixed on vehicle dashboards comparing existing deep CNN architectures.Single shot multibox detector was used to localize and classify damages.Elaborated architectures have been proposed to accurately segment cracks.After several initial works, deep CNN architectures for the crack classification task may be optimized.Simple comparisons of existing architectures are not novel.
In terms of thin crack detection, J. Chen and He (2022) proposed a U-Net architecture with an attention mechanism to detect pixel-level thin cracks.Siriborvornratanakul (2023) proposed an optimized U-Net architecture reducing the number of feature maps with an appropriate loss function for developing a sensitive algorithm.The problem of thin crack detection is that increasing sensitivity also increases false detection cases.
Some research works focused on highway pavements using their own vehicle imaging systems comparing typical image processing filters to delineate crack features (Lee & Lee, 2004;Majidifard et al., 2020;Wu et al., 2014;K. Zhang et al., 2018).Other recent applications of DL to image processing are related to concrete slab track and tunnel lining monitoring.Ye et al. (2023)  The approach of this study is different from the previous research works.Semi-supervised learning is the utilization of unlabeled data to complement small labeled datasets (Engelen & Hoos, 2019).However, in this study, all the crack images are already labeled, while non-crack images are further analyzed.For example, CrackGAN proposed by K. Zhang et al. (2021) adopted a generative adversarial network (GAN) for generating training images and receptive field analysis to consider large crack image contexts.Maeda et al. (2021) proposed a GAN for complementing road damage datasets.Wang and Su (2021) proposed a combination of student and teacher crack detection models with different noise and averaged weights.Unlabeled images were used to evaluate the consistency of the two models.Zheng et al. (2022) adopted active and semisupervised learning to detect cracks in concrete bridge structures.Active learning is manual labeling of confusing images suggested by classifiers.The meaning of semisupervised in that study is that unlabeled data are labeled by a supervised model (pseudo-labeling) to increase training images.These research works utilized the patterns in the unlabeled images though non-crack features were not directly targeted.The concept of the pseudo-labeling adopted in this study is for classifying non-crack images.
Another possible strategy is to develop completely unsupervised classification algorithms and interpret classified groups afterward for the crack detection purpose.Oliveira This study is inspired by previous research works about general unsupervised image and video analysis especially using CAE and SOM (Bishop, 2006;Goodfellow et al., 2016).Unsupervised segmentation is one of the most promising image analysis research fields (Deng & Manjunath, 2001;W. Kim et al., 2020).The algorithms are based on region growing considering the similarity inside the region and differences between the regions.The algorithm assumes salient objects with certain spreading areas.Therefore, detecting thin cracks in noisy asphalt pavements is not easy.Some crack areas are merged with background aggregates and falsely detected.How to select crack areas from many segmented areas is another problem.
SOM is one type of NNs that maps feature vectors into designated cells, constructing a distribution pattern.Local updating rules are applied to the representative weight vectors of the cells (Kohonen et al., 1990).This algorithm is related to clustering.Obtained SOM cells are further clustered to understand characteristic features represented by the SOM (Vesanto & Alhoniemi, 2000).Gemignani and Rozza (2016) detected moving objects in videos by a multilayered SOM.This algorithm stores features of background regions in the SOM to detect moving objects as anomalies.Therefore, this algorithm is one type of anomaly detection algorithms.As will be compared, unsupervised crack analysis shows limited accuracies, indicating that unique features of cracks should be considered to realize accurate crack detection.Consequently, simply applying an SOM is not useful though the SOM may have the potential to effectively incorporate background non-crack image features.
How to select feature values is the problem of applying an SOM approach.In addition to typical statistical indices and image features, deep CAEs are a common technique that effectively extracts image features in an unsupervised way (M.Chen et al., 2021;Chow et al., 2020).In their research works, background image features were extracted by CAEs and analyzed by SOMs.TL is also a common technique that can import image features that appeared in other research fields (Weiss et al., 2016).In this study, CNN architectures trained by ImageNet are transferred to help the architectures learn features from non-crack images and accomplish a multicategory task.The adopted TL method is not novel, compared to previous research works.
In terms of road crack databases, Shi et al. (2016) proposed a CrackForest database (CFD) to validate their segmentation algorithm based on random structured forests and a support vector machine (SVM).The number of the images is 118.The image size is 480 by 320 pixels.Corresponding ground truth segmentation data are also provided for each image.Yang et al. (2020) developed a large segmentation database called Crack500.Five hundred images and corresponding segmentation data with the sizes of about 2500 by 1500 pixels are provided.This database includes top-view images and several forwardview images with various crack and pavement patterns.NHA12D (Z.Huang et al., 2022) is a complete validation dataset from top and forward-view cameras consisting of 120 images.RDD2020 (Maeda et al., 2018) captured various road damages using smartphones fixed on vehicle dashboards.SDNET2018 (Maguire et al., 2018) contains concrete pavements, bridge decks, and walls.DeepCrack (Yahui Liu et al., 2019) adopted cracks in various materials obtained by crawling Internet images.Considering the sizes of the datasets and road crack detection tasks, CFD and Crack500 are adopted to demonstrate the applicability of the proposed method.
To apply classification algorithms, the images were divided into small patches.Referring to the segmentation ground truth data, crack and non-crack images were prepared.Few previous research works discussed the performances of classification algorithms for these datasets.The segmentation accuracy by Yang et al. (2020) was evaluated by an aggregated F-measure applying the best threshold for each CFD and Crack500 image (called OIS).The F-measure was 0.6-0.7.The F-measure of the existing DL image classification architectures (e.g., VGG16) for the mixture of several datasets by Zheng et al. (2022) was around 0.9.These are the baselines of this study.This study compares existing DL architectures to demonstrate the effectiveness of the method.

Concept
The method of this study is summarized in Figure 1.The method consists of two steps.In the first step, non-crack images are interpreted.To effectively conduct unsupervised analysis, preprocessing is important.This study adopts a Wiener filter and normalization (Haykin, 2001;Oppenheim & Schafer, 2010).The reasons for selecting the algorithms will be discussed in Sections 5 and 6, and Section 7 shows validation results.The characteristic features of asphalt pavement patterns and objects on road surfaces such as shadowed areas and spot-like paints and dusts are extracted in an unsupervised way using an optimized CAE architecture.An SOM algorithm automatically optimizes a map size and clusters the features output by the developed CAE.Non-crack images are pseudo-labeled referring to the obtained clusters.The labels are denoted as background 1, background 2, . . ., background .
The second step is ordinary DL classification.Considering the increased  background categories and crack images, a  =  + 1 category network is trained to learn detailed non-crack image features.This approach also improves the discrimination ability between crack and non-crack images as will be shown in the results section.Architectures are typical deep CNN architectures such as VGG16 adopting TL of ImageNet parameters.The probabilities of the  categories are output.The probability of the crack category is thresholded to judge the existence of cracks.This probability indirectly considers all the remaining non-crack categories.
All the adopted CNNs and CAE in this study are deep considering their performances.The selections of a preprocessing filter, CAE architecture, SOM size, CNN architectures, and hyperparameters are important parameters.Each process should be optimized to achieve the highest performance.The proposed method is composed of feature extraction and construction of classification criteria.The method adopts supervised learning of labeled data.However,  − 1 categories are pseudo-labeled by unsupervised feature extraction and clustering.Therefore, completely supervised two-category classification and unsupervised feature extraction and classification algorithms should be compared to highlight the effectiveness of the proposed method.

Contributions
In terms of the originality, the problem setting of this study is different from the previous research works.The three contributions are summarized below corresponding to the two steps of the method.
Effective preprocessing and feature extraction schemes are proposed for a road crack detection purpose.Regarding preprocessing, a Wiener filter and normalization are adopted.About feature extraction, other than CAE, original images with average pooling, histogram of oriented (HOG) feature, Gabor filter, and Hough transform of edges detected by the Canny method are compared.
F I G U R E 1 Flow chart of the proposed method.The method consists of two steps: top: background analysis by preprocessing, convolutional autoencoder (CAE), and self-organizing map (SOM), and bottom: crack detection by convolutional neural network (CNN).The algorithm will be discussed one by one in this article.
The extracted features are quantitatively evaluated and extensively compared using nonlinear SVM classifiers optimized for a crack detection task, though for comparison and not being adopted as the highest performance algorithm in this study (Section 5).
Unsupervised road surface analysis based on an SOM and common dataset is conducted for the first time.A mapping result and example images are discussed to provide insights into non-crack road surface image features (Section 6).
Multicategory deep CNNs by the combination of unsupervised and supervised learning using CAE, SOM, and CNNs are first proposed to achieve the highest crack detection performance incorporating non-crack image features.The performances of the developed deep CNNs are significantly improved, compared to recent CNNs such as VGG16 and MobileNetV3.A confusion matrix is shown to discuss the nature of the developed multicategory CNN.The method is validated using the two datasets, Crack500 and CFD (Section 7).

Dataset configuration
In this study, two datasets, Crack500 and CFD were used to develop and validate the algorithms.Table 1 summarizes the configuration of the two datasets.Crack500 is detailed in Yang et al. (2020) and CFD in Shi et al. (2016).
In Crack500, the cracks on asphalt pavements in the uni- The generality of the background information depends on the generality of the crack datasets provided that an accurate classification algorithm is developed.This study utilizes two datasets, Crack500 and CFD, to demonstrate the performances of the proposed method using the different datasets.Detection results with example images and tables are shown to claim the ability of the algorithm qualitatively and quantitatively.
The two datasets were proposed for the segmentation purpose.The images were cropped to 128 by 128 pixel patches and labeled crack or non-crack based on the segmentation masks in the corresponding areas.The size of the image patches is important.Too large sizes will lower localization ability; too small sizes will decrease classification accuracy.Considering the area resolution required in the practice of road crack evaluation and confirming classification accuracy does not decrease, this 128 by 128 pixel size is selected.In some cases, patches included only small fractions of cracks on the corners.If the 2.5% (128 × 128 × 0.025 = 410 pixels) of an image is crack, then the image is labeled as crack.This assumes 3-pixel-width cracks.It is the minimum in the crack detection research field.The data volume of Crack500 is close to the storage limit of ordinary personal computers (PCs).Therefore, the half of the images were randomly sampled confirming training accuracy was not reduced.The total numbers of the crack and non-crack images produced from Crack500 are about 8200 and 47,000, respectively.Those from CFD are about 2400 and 15,000.The numbers are large enough not to affect the results of this study.The ratios of the crack and non-crack images are 1:6-1:7 for both Crack500 and CFD.
The large background datasets are learned by the proposed CNN-SOM algorithms.

Computational environments
High-end computers are of paramount importance to conduct required training.Graphics processing unit (GPU): NVIDIA GeForce GTX 1080 ti, and CPU: Intel Core i7-8700K @ 3.7 GHz are used (NVIDIA, 2023).All programs were written in TensorFlow (TensorFlow, 2023).Inference time (the time to output the probabilities of categories from one input image) is also important and should be suppressed using simple and fast CNN architectures such as MobileNet as will be compared in Section 7. The time is not increased by the proposed method because most parts of CNN architectures are the same, and the number of the output categories is the only difference.

Training and evaluation metrics
The two datasets are imbalanced; the number of noncrack images is much larger than that of crack images.For evaluation, referring to the previous research works, accuracy (classification accuracy), precision, recall, and fmeasure were compared.Accuracy is the ratio between the correctly classified images among all the images.Ordinary accuracy is not appropriate for the imbalanced datasets.CNN models output a probability  of cracks for each image.Therefore, after training, by adjusting a threshold  for the probability (> ), the accuracy was optimized for each dataset.Therefore, the training in this study is accuracy-based.Considering the ratios of the two categories, accuracy less than 6/7-7/8 (86%-88%) is meaningless because even CNNs that output all the input images are non-crack and will satisfy this accuracy.
With that threshold , precision, recall, and f-measure were calculated.Precision  is the true crack images among the detected crack images, and recall  is the detected crack images among the true crack images.f-measure (dice coefficient)  is defined below: , ,  are the functions of .Of course, accuracy and f-measure lower than 0.5 (50%) are meaningless.These indices not necessarily correspond with each other oneto-one.In some cases,  is high; in other cases,  is high.Among the four indices, f-measure is the most robust to the imbalance (Lobo et al., 2008;Popovic et al., 2007).In this study, accuracy and F-measure are compared.Precision and recall are also shown.These indices are tabulated to show the effectiveness of the proposed method.Area under the curve (AUC) is not appropriate because the outputs of the proposed models are multicategory and merged afterward.
This study also compares the time for predicting a category of one image with the given computational environment.This inference time is the most common index for evaluating computational performances in the DL research field.As a reference for comparing the results with those of other research works, this index is adopted.

Preprocessing
Figure 2 exhibits the process of the adopted preprocessing method.Preprocessing is applied to the whole image.
Preprocessing is mandatory for valid unsupervised feature extraction.Figure 2a is an original red-green-blue (RGB) color image.It was found that pixel-level speckle-like noise is dominant in pavement images.Preprocessing was applied to each RGB channel.For example, Figure 2b is the colormap of the R channel of Figure 2a. Figure 2c is after applying the Wiener filter and normalization (Haykin, 2001;Oppenheim & Schafer, 2010).The size of the filter is 16 by 16 pixels.The Wiener filter is an adaptive filter.Noisy areas are strongly smoothed, while the areas of cracks and characteristic features are weakly smoothed.From Figure 2c, the noise is reduced, while the crack remains clear.The larger the filter size is, the smoother the image is.The feature of the crack may be blurred with too large a filter size.The results were not sensitive to the size.A total of 16 pixels were enough for the two datasets.Normalization is applied to each RGB channel to adjust the maximum intensity as 1 and the minimum as 0. This process is to remove the effects of lighting conditions and colors of pavements.Without the Wiener filter and normalization, feature extractors may learn the differences of noise levels and colors as important features.After preprocessing, channel images were stacked up to input into feature extractors.
Figure 3 shows the proposed five-layer CAE architecture.The input and output of the CAE are 128 by 128 pixel RGB-color three-channel images.The objective of the CAE is to extract spatial features to compress images.Simultaneously, reconstructed images from the Two important parameters should be considered: the number of layers and the choice of a loss function.There is a tradeoff; the deeper the CAE is, the smaller the compressed data are, and the lager the information loss may be.The CAE is trained to minimize the difference between the input and output images.An appropriate loss function should be assigned to develop an effective autoencoder.
In terms of the number of layers, the dimension of the input image is about 49,000.This number is not assumed nor feasible for SOM analysis (Kohonen et al., 1990;Vesanto & Alhoniemi, 2000).It was observed that it took several days to analyze the features; calculation time drastically increased around this number.The first encoder layer calculates spatial features from the input image to convert it to 10 feature maps with 64 by 64 sizes.Max pooling is used to suppress map sizes.However, the dimension is still around 41,000.The second encoder layer further compresses features to 15 feature maps with 32 by 32 sizes.The dimension is 15,000, still large.However, it is one third of that of the original image and relatively lightweight.
Figure 4 compares the reconstructed images changing the number of layers and type of the loss functions.Figure 4a shows the original image, and Figure 4b shows the reconstructed image by the adopted autoencoder.The autoencoder is trained using a mean absolute error (MAE).The MAE is the summation of the absolute differences of the intensities between the original image   and reconstructed image   at the corresponding pixel (, ) divided by the number of the pixels .

𝑀𝐴𝐸 = ∑
, The image is slightly blurred.However, the crack and aggregates inside the crack are recognizable.
Figure 4c shows the effect of the number of layers by inserting one convolution layer after the second encoder layer with the same architecture and one convolution layer before the first decoder layer with the same architecture (seven-layer CAE). Figure 4c  the layers, the loss of information is evident.Another typical loss function is a mean squared error (MSE).Figure 4d was trained using the MSE. Figure 4b,d are similar.In detail, the overall image is jaggy.The outlines of the aggregates are dissolved.The reason may be that MAE is a sparse feature.A sharp crack line is enhanced by a sparse feature.Considering the performances, the five layers and MAE are adopted.
The number of filters and filter sizes are also important parameters.These parameters should be large enough for extracting image features.The parameters are selected confirming that the loss value converges.The numbers of the filters of the first layer of the encoder and second layer of the decoder are smaller than those of the second layer of the encoder and first layer of the decoder.This is to avoid the sudden decrease of the information of the maps.About other typical DL techniques, rectified linear unit (ReLU) activation functions are adopted, and batch normalization is used in the encoder part to output stable results avoiding saturation.
There are few research works about applying existing DL architectures for autoencoders because those architectures are optimized for classification or segmentation tasks.Advanced CAEs with higher compression and reconstruction ability are future work.They may further improve the results of this study.The encoder of Figure 3 is rather adopted is fundamental spatial filters, which automatically learn to extract the features of pavement surface images.

Comparison of unsupervised methods
The authors tried to develop completely unsupervised crack detection algorithms connecting unsupervised feature extractors with unsupervised clustering algorithms.Afterward, clusters are interpreted based on the clustered images.This strategy naturally integrates the features of crack and non-crack images.Unsupervised feature extractors discussed in this section were combined with SOMs discussed in the following section.However, the classification accuracy of the classifiers was lower than 60%.
A next strategy is that the extracted features are fed into a supervised SVM with a nonlinear kernel function.SVMs were trained by the features of crack and non-crack images and labels to compare the performances of the developed feature extractors.The outputs of the feature extractors including the encoder shown in Figure 3 were flattened to construct feature vectors.SVM needs less data than DL methods because small support vectors define classification criteria.The adopted kernel function is a radial basis function to construct a nonlinear dividing plane.For reproducibility, there are some other parameters: 2% outlier vectors were removed.The regularization factor was 0.5.A logistic function is fitted to output the probability of cracks.
For comparison, the HOG feature as a typical image feature value, the Gabor filter as a typical frequency filter, and the Hough transform after edge detection by the Canny method as a typical image processing method and simple average pooling were considered (Dalal & Triggs, 2005;Zalama et al., 2014).Other than the above features, an extensive comparison of image processing techniques was conducted such as the Fourier transform and speeded-up robust features (SURF) feature, resulting in similar results.Gabor filter is a set of 2D filters that emphasizes certain frequency components.It is related to the Fourier transform.The HOG feature is a fast version of the SURF feature.Therefore, the Gabor filter and HOG feature are compared to represent frequency analysis and image feature value methods.
For reproducibility, the parameters of the feature extractors are: the cell size of HOG feature is 32 by 32; the wavelengths and orientations of the Gabor filters are 10, 20, and 30 pixels and 0, 30, 60, 90, 120, and 150 degrees, and filtered images are downsampled to one-fourth; the threshold for Canny method is 0.7, and the resolutions of rho and theta are 2 pixels and 2 degrees, respectively.Average pooling is dividing images into four images with 32 by 32 pixels and calculating the averages for each image in each RGB channel.This is the most primitive downsampling approach.The dimension of the original images is too large and not acceptable for SVMs.The dimensions of the features are comparable to that of the encoder.
Table 2 summarizes the SVM classification results.The proposed encoder achieved the highest accuracy, 92%, and f-measure, 69%.The accuracies of HOG feature, Gabor filter, and Hough + Canny are lower than the CAE by 20%, 10%, and 26%, respectively.The f-measures decreased by 20%, 12%, and 26%.The extent of decrease may depend on datasets.In terms of the considered dataset, these features are not appropriate for extracting thin cracks from noisy asphalt pavement pattens.These features are inferior to simple average pooling.Average pooling considers the dark areas of the cracks and their combinations.On the other hand, the CAE effectively extracted the spatial features of the cracks and pavement patterns.The accuracy and f-measure of the CAE is higher than those of average pooling by 6%.The conclusion is the CAE is the most effective unsupervised feature extraction method for the cracks.
However, the performance of the best CAE is not the same as that of state-of-the-art deep CNNs as will be discussed in Section 7.1.This is because the supervised DL architectures with TL accurately learn crack features and construct precise classification criteria inside the networks.However, supervised CNNs may ignore the details of the non-crack images during the training process.Therefore, the proposed method adopts an SOM to construct finer non-crack categories to motivate CNNs to learn the features of the non-crack images utilizing the advantages of both unsupervised and supervised methods.
It is not possible to quantitatively discuss the effect of misclassified pseudo-labels by unsupervised methods on CNNs because the true structure of non-crack image data TA B L E 2 Comparison of the performances of the feature extractors combined with nonlinear support vector machines (SVMs) on Crack500.

Theories
SOM was proposed by Kohonen (1990).A typical map is a hexagonal honeycomb structure.A weight vector is assigned to each cell of the structure.An input feature vector is classified into one of the cells matching the input and weight vectors.After the matching, the corresponding weight vector and weight vectors close to the corresponding cell are updated to increase the matching coefficient.This is a mutual optimization process.After the repetition of the classification of the input vectors and update of the weight vectors, similar input vectors are stored in close regions.The distribution of the grouped input vectors in the map visualizes the structure of the data.The distances between the neighboring weight vectors imply the similarity of the neighboring cells.
The two important points of SOM are how to select appropriate feature vectors and map size.The discussions of the previous section are related to appropriate feature vectors.About the map size, the numbers of vectors in cells become too small or large with too large or small sizes.In that case, it is difficult to find a meaningful structure in the map.The map size was automatically determined, and the cells were automatically clustered following the methodology and parameters provided by Vesanto and Alhoniemi (2000).It picked up important eigenvalues setting a threshold.A MATLAB library provided by the research group was utilized (Helsinki University of Technology the Laboratory of Computer and Information Science, 2023).

Mapping results
Figure 5a shows the constructed SOM inputting the feature vectors of the non-crack images of Crack500 extracted by the developed CAE.The estimated map size was (28,11).
The numbers are the numbers of the classified vectors in each cell.To reduce calculation time, the number of the feature vectors was reduced to one-tenth.Calculation time was 2-3 h after the reduction.Without the reduction, calculation time was several days and infeasible.It is confirmed that with a certain number of vectors, the size and color (the distribution of distances between cells, U-mat) of the SOM map were converged.The colors inside the cells correspond to the averages of the distances of the neighboring cells.Note that the size, distribution, and color were automatically estimated in SOM development.
Figure 5a suggests the structure of the non-crack images.Some cells include 40-50 vectors, while others include 5-10 vectors.Similar vectors are grouped into the same cells.As a qualitative discussion, in Figure 5a, two salient cells, salient 1 with 42 vectors and salient 2 with 51 vectors, are observed because the two cells have large numbers of images and are surrounded by yellow areas.These features are manually extracted with other features for discussion purpose.
For labeling, the SOM in Figure 5a was further automatically clustered following the previous research work considering the distances between the weight vectors (Vesanto & Alhoniemi, 2000; Figure 5b).Common features are considered.To address various features, the number of the clusters should be large.However, large numbers will result in less training data.The numbers of the non-crack images are six to seven times larger than those of the crack images as shown in Table 1.The numbers of the largest clusters are twice as large as those of other clusters.Considering this situation, five clusters are adopted.
The numbers in Figure 5b belong to clusters.Five clusters are shown from "background 1" to "background 5." As explained, the discrimination ability of the SOM is not as high as DL models.A crack versus non-crack classification model using SOM was developed.However, it showed low accuracy.The two reasons why the SOM is less accurate may be that CNNs consist of deep convolution layers, which have a high representation ability, and CNNs can effectively learn crack features with crack labels.For future work, advanced autoencoders and unsupervised clustering methods may be developed in DL research fields.They have the potential to solve the problem of unsupervised learning.

Interpretation
SOM is a black-box algorithm; what feature is considered in the learning process is not known.In this section, the characteristics of pavement background images are qualitatively discussed showing example images for facilitating unsupervised road surface analysis.
Figure 6 shows three images in each cluster of Figure 5, a total 15 images.The R channel images after preprocessing are shown.No difference was observed among RGB channels.Comparing Figure 6 backgrounds 1 and 3, the difference between the brighter images of background 1 and darker images of background 3 is evident.The images of both clusters are relatively smooth.Some images of background 1 are white pavements with small black aggregates.Some images in background 2 are the opposite.However, other images in background 2 such as salient 2 show only black shadowed areas.These results may seem contradictory to the fact that the RGB channels are normalized to eliminate the effect of the lighting conditions.The reason may be that the SOM considers the contrast of the images; the SOM may focus on the blurriness of edges in shadowed images.Figure 6 background 2 shows noisy large aggregate pavement patterns.Salient 1 is some of these noisy patterns.patterns may be confusing for crack detection algorithms.Figure 6 backgrounds 4 and 5 are difficult to interpret.They appear to include finer textures and brighter and darker spot-like paints and dusts.These patterns are less noisy than Figure 6 background 2 and medium intensity, compared to Figure 6 backgrounds 1 and 3.No difference between the two categories was found.The categories do not necessarily follow humans' intuition.However, they are expected to have common characteristic features because the SOM is based on the spatial image features of the CAE.This expectation is correct considering that CNNs differentiate the two categories as will be explained in the following section.Road features such as manhole covers and white lines are not included in the Crack500 and CFD datasets.If the areas of these features are large enough, the features are automatically considered in the SOM.This may be one advantage of the proposed method.
Figure 7 displays the maps of the five clusters using the four example images of Crack500.From the maps, the target features are implied.Crack #1 is a smooth asphalt pavement.The left half area is recognized as bright and smooth patterns of background 1.The right bottom area is dark because of a shadow and black dusts.The area is classified as background 3. The transition region is classified as background 4 and 5 because some areas are dusted, while others are not.Crack #4 is moderate brightness with finer textures.The majority of Crack #6 is an alligator crack.Smooth and fine patterns are also observed.Crack #88 is a complicated pavement pattern; the corresponding map is messy.The majority of the map is the noisy areas of background 2 and bright areas of background 1.Some larger black aggregates and black paints and dusts are classified as backgrounds 4 and 5. Their differentiation is not clear but at least recognized as confusing features.
Table 3 summarizes the numbers of the images of the five clusters of Figure 5 and their characteristic features.The imbalanced numbers suggest the structure of the non-crack image data.From the previous discussions, brightness, texture smoothness, and spot-like features of aggregates, paints and dusts may be considered in the SOM.This is a completely automatic process.Using different datasets, the SOM may focus on different features.

Configuration of the training
To demonstrate the effect of the SOM-based method, the two common datasets, Crack500 and CFD were used.The images were cropped to produce their own crack and non-crack two-category datasets.Common DL architectures trained by the two datasets (previous) and SOMbased method (proposed) increasing non-crack categories were compared.In the proposed method, non-crack images are classified into backgrounds 1-5 by the developed preprocessing, CAE, and SOM algorithms.The images in the largest non-crack category are twice as many as the images in the smallest non-crack category.To avoid the imbalance among the non-crack categories, weight factors were applied to each category as explained.The same two-category test datasets were used for evaluating the previous and proposed methods.The training was accuracybased.For evaluation purpose, five non-crack categories were merged into category "non-crack."Accuracy, precision, recall, and f-measure were compared.Inference time was also compared.For reference, training time is also discussed here.To confirm the stability of the training results, the training was conducted three times and indices were averaged.The variations of the four indices were at most 1%.Therefore, a difference larger than 1% is significant.
A custom CNN (custom-net) designed by the author and three common DL architectures, VGG16, DenseNet121, and MobileNetV3(-large) were compared with and without introducing the proposed method (Howard et al., 2019;G. Huang et al., 2017;Krizhevsky et al., 2012;Simonyan & Zisserman, 2014;Szegedy et al., 2015).Latest representative networks were selected considering the characteristics of architectures following the history of deep CNNs: simple custom network, latest deep network (VGG16), residual network (Densenet121), and light-weight network (MobilenetV3).Each model has its advantage.
As shown in Figure 8a, the custom-net consists of two convolutions and one fully connected layers.ReLU and softmax activation functions, maxpooling with strides, and dropout are adopted.The  perceptrons of the last fully connected layer correspond to  categories.For the previous method,  = 2, and for the proposed method,  = 6 as explained in Figure 1.The numbers of the convolution filters (, ) are the important parameters.The numbers were optimized for each case.Figure 8b shows the three DL architectures.The convolution layers of the existing architectures were connected with the two fully connected layers.The numbers of the layers and perceptrons are large enough confirming the accuracy convergence.The same activation functions were used, and outputs are -category probabilities.TL adopted here is to import the structures and learned parameters of existing convolution architectures as initial values and add and train fully connected layers to adapt to target classification tasks.ImageNet parameters were used as initial values.The two fully connected layers were trained from scratch.The adopted training method was stochastic gradient descent; the learning rate was 0.01, and early stopping was adopted confirming the convergence of the categorical cross-entropy loss.

7.2
Results of Crack500 nonlinear SVM, the accuracy and f-measure of the CAE are higher than those of the custom CNN and lower than those of the custom-CNN-SOM.The CAE learns features from non-crack images to exceed the performance of the custom two-category CNN.However, because the effectiveness of the supervised deep CNN is high, the performance of the custom-CNN-SOM is further higher than that of the CAE.The numbers of the convolution filters of the optimized custom CNN is (, ) = (15, 8) and the custom-CNN-SOM is (45,15).The numbers increased because the models should differentiate five background categories.However, this fact does not necessarily mean that the filters contribute to the discrimination between the crack and non-crack images.

Classification results
Comparing the custom CNN and custom-CNN-SOM, an improvement is significant referring to the previous DL research works.Accuracy increased by 4% and fmeasure by 5%.The features of the non-crack images were introduced to the custom CNN by the proposed SOM framework.A characteristic property is that the CNN models are high-recall models, while the three SOMbased models are high-precision models; the differences between the precision and recall were decreased by the proposed method.The CNNs caused many false detection cases, which is not evitable when test data are imbalanced.The SOM-based CNNs adopted the five categories for non-crack images to decrease the effect of the imbalance problem.This effect is a by-product.The accuracy and f-measure demonstrate the improvement of the performances of the models.For a practical application, the 0.57 precision of the custom CNN and 0.62 recall of the DenseNet121 are too low and may not be applicable.
Comparing the custom CNN and the other three DL architectures without the SOM, the VGG16 shows the highest accuracy and f-measure.The number of the layers of VGG16 is 16, and the number of the trainable parameters is about 16 million; DenseNet121 is 121 layers and 9 million parameters; MobileNetV3(-large) is 27 convolution layers and 5 million parameters.The latest architectures adopt deep convolutional layers with relatively small sizes of filters and numbers of parameters incorporating special structures such as inception modules (VGG16) and residual blocks (DenseNetV3).From the results, the accuracy of the VGG16 increased by 6% and f-measure increased by 15%, compared to the custom CNN.This is because more abstract features were obtained by deep architectures.The accuracy and f-measure of the MobileNetV3 is not as high as those of the VGG16.However, the inference time of the MobileNetV3 is one third of that of the VGG16.If computational resources are limited or the application to real-time video data is considered, MobileNetV3 may be the first choice.These conclusions are the same as those of the previous DL research works.
Comparing the four CNN architectures with and without the proposed method, the performances significantly improved for all the architectures.The accuracy increased by 4% and f-measure by 8% in the case of the DenseNet121.The accuracy increased by 3% and f-measure by 6% in the case of the MobileNetV3.Considering the inference time and improved performance, 95% accuracy and 82% f-measure, the MobileNetV3 is a strong candidate.The improvement of the VGG16-SOM is not as large as those of the other architectures but still significant.The accuracy improved by 1% and f-measure by 3%, compared to the previous VGG16.The proposed VGG16-SOM achieved the highest 96% accuracy and 84% f-measure.The prepared datasets as it is were not compared in the previous research works.These performances are top-level in the previous DL road crack detection research works.The high performance of the VGG16-SOM is because of the deep architectures of VGG16 with imported parameters optimized for general image classification tasks and the introduction of the information of the non-crack images by the proposed SOM framework.4 is the average of the indices.The standard deviations were lower than 1%, indicating 1% difference is significant.

Statistics Accuracy (-) Precision (-) Recall (-) F-measure (-)
Standard deviation 0.005 0.007 0.006 0.005 The proposed method does not increase inference time.The increase was lower than 0.001 s per image.In deep architectures, the latter part of convolution layers and the former part of fully connected layers account for most of the parameters.Therefore, increasing the categories did not affect the results.Training epochs were 10-50 and training time was 5 to 30 min using GPU depending on the architectures.The three existing architectures needed fewer epochs because of the TL effect.There was little increase in training time adopting the method because of the same reason.
Table 5 lists the standard deviations of the indices in the case of the custom CNN.The training cases were conducted three times, and averages are shown in Table 4.The standard deviations were lower than 1%, indicating a 1% difference is significant.
Summarizing the above discussions, the proposed method improved the performances of the existing DL architectures and custom CNN for the road crack detection task.The highest accuracy, 96%, and f-measure, 84%, were achieved by the proposed VGG16-SOM.The accuracy improved by 1% and f-measure by 3%, compared to the previous VGG16.The MobileNetV3-SOM is also fascinating considering the inference time was one-third of that of the VGG16-SOM, with 95% accuracy and 82% f-measure.

Confusion matrix
A confusion matrix applying the developed VGG16-SOM is shown in Table 6 to discuss the characteristics of false detection and missing crack cases.The precision and recall do not correspond to the results shown in Table 4 because the five non-crack categories were merged and the threshold for the crack probability was applied afterward.
From the matrix, the errors are distributed.The developed VGG16-SOM clearly distinguished the five non-crack categories.It regarded the differences among the five noncrack categories as evident as those between the crack and non-crack categories.The cracks falsely detected as background 1 were smaller than backgrounds 2, 4, and 5 possibly because background 1 is brighter than the other non-crack categories.The errors between backgrounds 1 and 3 were zero.This may be because background 1 is bright and background 3 is dark.The errors between backgrounds 1 and 4 were large.In a practical sense, the mistakes among the non-crack categories are not important because crack or non-crack is important.

Crack detection results
Detection results are compared in Figure 9 and continued figures in the next page using the developed VGG16 (previous model) and VGG16-SOM (proposed model).Images were divided into 128-by-128-pixel patches to classify each patch as crack or non-crack.The boundaries of the images were processed by adjusting the positions of the patches.
There was a tendency that the proposed model showed a higher probability of cracks than the previous model.

F I G U R E 9 Continued
Crack #1 in Figure 9 is an easy case by humans, one crack with a smooth pavement surface.However, even images such as crack #1 are not easy for automatic detection algorithms.The previous research proposing this dataset and following research works indicated that image processing and some DL algorithms falsely detected spots and dusts and aggregates and missed thin cracks (Shi et al., 2016;Yang et al., 2020).The previous and proposed models adopted in this study successfully detected the crack without false detection cases.In most cases, the two models showed similar results because even the previous model showed high performance as shown in One point that should be remembered is that all four cases in Figure 9 were output by the same CNN model with the same parameters.Table 4 demonstrated that the SOM-incorporated CNNs show higher performances, compared to existing CNN architectures on the whole dataset.The model is effective for both easier and more complicated cases because of the robustness of background noise considering the contexts of non-crack images.

Results of CFD
The number of CFD images is about one-third of that of Crack500 in Table 1.Furthermore, it was observed that the pavement surfaces of CFD are more homogeneous than those of Crack500; the effect of the proposed method may be marginal.However, the same tendency was observed in the case of Crack500.Clear images for humans are sometimes not easy for automatic detection algorithms.10 comparing the previous VGG16 and proposed VGG16-SOM models.The CFD images include relatively smooth pavement surfaces with simple crack patterns, compared to the Crack500 cases.Crack #95 is a crack with a moderate thickness.The previous and proposed models successfully detected the crack.Crack #112 is paints and dusts on a pavement surface around a crack.The previous model falsely detected some of the paint and dust as cracks, while the proposed model did not detect any paint and dust maintaining the shape of the detected crack.Crack #98 is a thin crack, compared to Crack #95.In some cases, the proposed model detected thin cracks, which the previous model did not detect.This fact indicates that the sensitivity of the proposed model to thin cracks is not inferior to that of the previous model.The proposed model puts a priority on reducing false detection cases considering the features of the non-crack images.

LIMITATIONS AND DISCUSSIONS
The logics, results, and remaining problems are discussed in this section for showing future directions of research.In the proposed method, fully connected layers were added to the last layer of the convolution layers of the existing DL architectures in the TL phase.These fully connected layers effectively utilized the features output from the con-  flatten convolution filters with the perceptrons.Therefore, the fully connected layers are not minor in the architectures.For example, the number of the parameters in the first convolution layer of VGG16 is three channels of 64 filters with a 3-by-3 size = 1728, while that of the first fully connected layer is 100 perceptrons with 8192 parameters = 819,200.However, convolution filters are not minor either because they extract meaningful spatial features as discussed in Weiss et al. (2016).
Summarizing the results, the proposed method improved the performances of all the existing DL architectures.The methods were validated by both Crack500 and CFD datasets.The highest performance was achieved by the proposed VGG16-SOM model.Considering inference time, the proposed MobileNetV3-SOM is also attractive.The proposed models show high precision, while the previous models exhibit many false detection cases.The proposed models accurately detect thin cracks from noisy pavement surfaces maintaining the sensitivity to cracks.
What features were focused on by the proposed models is not clear, which is the drawback of the DL approaches.Two factors may contribute to the improvements.The first factor is related to the features obtained by the convolutional layers.The proposed SOM framework drives the convolution filters to learn non-crack image features by augmenting background categories.The previous models may learn some of the features of the non-crack images though they are implicitly considered.The proposed method explicitly learns these features to minimize the loss function.VGG16 has many parameters, while the other three architectures have less parameters to be tuned.The improvements of the VGG16-based models are the same or smaller than those of the other models because the imported parameters may include some features of the non-crack images.The second reason is related to the fully connected layers and the final output of the crack probability.Fully connected layers assemble features and judge categories based on criteria constructed by them.Some noncrack categories may be confusing.The fully connected layers may apply different correction coefficients to the probabilities of the five background categories.The confusion matrix implies there are abundant salient features not only in crack images but also in large non-crack images, which should be considered for accurate road crack detection.
In terms of remaining problems, the recall of the proposed models is less than that of the previous models.If the target is detecting as many cracks as possible allowing false detection cases, the previous models are useful.There are long-distance road surfaces yet to be inspected.For a screening purpose, the proposed models accurately evaluate the conditions of the road surfaces.Road sections with low ratings may be prioritized and be further analyzed by accurate segmentation algorithms, which are sensitive to thin cracks.
This study does not adopt AUC because the SOM-based models are multicategory CNNs.The precision-recall curves showed complicated and distorted shapes in some probability ranges.Therefore, the probability thresholds were fixed to compare the indices listed in Tables 4 and 7 and show the crack detection results in Figures 9 and 10.This nature does not limit the applicability of the proposed method in practical situations.
There are many advanced DL architectures optimized for the road crack detection purpose as discussed in the literature review section.For example, attention mechanism and dynamic learning techniques may help efficiently learn subtle crack features.Again, it should be emphasized that an attention mechanism and dynamic learning can be adopted in parallel with the proposed method because the approach of the proposed method is different from the previous research works.Only by increasing the number of categories, the proposed method can be easily applied.The method of this study is to incorporate the information of non-crack images to improve the performances of the existing architectures.Road crack segmentation is an important task.Segmentation algorithms can be also adopted in parallel applying the detected regions by the proposed method to segment cracks.
The performance of the proposed method may be limited by the inaccuracy of SOM pseudo-labeling though the inaccuracy cannot be quantified.The latest CAE and SOM algorithms may solve the problem (Kohonen et al., 1996;Vesanto & Alhoniemi, 2000).The common two road crack datasets are used to validate the algorithm.The number of the non-crack images is limited.There are many road features and obstacles in actual road surface images such as manhole covers, joints, road signs, and fallen objects.The increase in the number of non-crack images may further improve the crack detection performances.
The algorithm is not validated but may be applicable to cracks in concrete and steel structures.However, concrete and steel surfaces are relatively smooth, compared to road surfaces, and the detection of thin cracks is the problem.Therefore, the algorithm may not have a large advantage.Non-destructive techniques such as radar and seismic wave images are other possible applications of the proposed framework (Yamaguchi et al., 2019).

CONCLUSION
A novel method was proposed for accurately classifying road crack images based on common DL architectures.Non-crack images are preprocessed.The features of the non-crack images were extracted by the developed CAE and clustered by the SOM.Abundant features were observed in the non-crack images to construct the characteristic SOM.The number of the categories of the non-crack images was augmented by referring to the developed SOM.Modified DL architectures were trained by the crack and pseudo-labeled non-crack images.The DL architectures were driven to learn non-crack features, resulting in the increase of the ability of discriminating cracks from non-crack features.The proposed method was validated by the common Crack500 and CFD datasets.The accuracy of the modified DLs increased by 1%-4% and f-measure by 3%-8%, compared to the previous models.These improvements are significant.The modified VGG16-SOM achieved the highest performance, 96% accuracy and 84%-85% f-measure.The crack regions in the two datasets were accurately detected reducing false detection cases.The thin cracks were also detected, indicating its sensitivity to thin cracks.Top-level classification results and realistic crack maps were provided by the proposed models for both datasets.Possible directions of future work are the application to large-scale road surface imaging data with large non-crack images.Cracks in damaged sections may be further segmented by accurate segmentation algorithms.Advanced previous architectures such as an attention mechanism may be incorporated into the proposed method to further improve the performances.The algorithm may be applicable to cracks in concrete and steel structure surfaces and images obtained by non-destructive techniques such as radar and seismic wave methods.
The effectiveness of the proposed method is demonstrated by training existing deep CNN architectures.In terms of recent dynamic learning methods, Rafiei and Adeli (2017) proposed a method to automatically reorganize effective features in NNs.Pereira et al (2020) proposed a fast method for developing a classification algorithm inspired by the finite element method.Alam et al. (2020) proposed an automatic method for selecting a set of NNs for ensemble learning.Rafiei et al. (2023) proposed self-supervised learning robust to less training data of EEG records.These methods may facilitate the training process and enhance the performances of the trained architectures, which remains as future work.
segmented cracks on slab tracks adopting a network with skip connections and estimated crack sizes.Rosso et al. (2023) compared deep CNNs using Fourier transform as a preprocessing algorithm to detect anomalies in tunnel lining sensing data.Recent advancements in the integration of computer vision, augmented reality, and DL produce a new research field.Malek et al. (2023) developed an interface to connect inspectors with automatic crack detection results.Meng et al. (2023) developed a real-time crack detection system using a drone.Jang et al. (2021) adopted an encoderdecoder segmentation network and developed a climbing robot to detect cracks in concrete bridge piers.Structural health monitoring is a traditional research field.However, adopting ML and DL techniques, monitoring and control systems become rapidly advanced.Pan et al. (2023) applied deep CNNs for tracking targets in videos to monitor vibrations of multistory buildings.Jabadinasab-Hormozabad et al. (2021) proposed a system integrating structural control, health monitoring, and energy harvesting.Pezeshki et al. (2023a, 2023b) proposed a modal analysis and structural health monitoring methods for an offshore wind turbine.Oh et al. (2023) presented a measured-data-based strain estimation technique for building monitoring using a CNN.Urdiales et al. (2023) combined a Kalman filter and DL algorithms to track multiple objects in videos.Rafiei et al. (2017) proposed a deep-restricted Boltzmann machine for estimating concrete compressive strength.Hassanpour et al. (2019) proposed generative DL for motor imagery classification using electroencephalography (EEG) signals.Martins et al. (2020) reviewed recent recommendation systems and adopted DL techniques.In terms of training and data augmentation methods for cracks, Y. Zhang and Yuen (2021) proposed dynamic learning, which adaptively modifies CNN architectures to reduce training time.Żarski et al. (2022) proposed an efficient training and architecture optimization framework based on pruning and TL.Çelik and König (2022) proposed a copy-edit-paster TL methodology that imports the geometries of cracks to other concrete background images to improve the performances of CNN architectures.This research work is related to this study in terms of background image utilization.However, their objective is complementing small datasets by TL.Training methods are not the target of this study though they may be combined with this study to facilitate the training process and further improve the performances of CNNs.
and Correia (2013) calculated the mean and standard deviation of image patches as feature values to cluster them, detect cracks, and identify the directions and widths of cracks.Zalama et al. (2014) extracted crack features by Gabor filters and detected cracks by combining weak classifiers using AdaBoost.Yang Liu and Gao (2022) introduced a baseline model of the visual characteristics of images based on Gaussian convolutions as an index for detecting concrete cracks.Rodriguez-Lozano et al. (2023) proposed an accumulated pixel values for fast crack classification.These works are thought-provoking and related to this study though the trend of DL indicates the manual designing of feature values is problematic in terms of robustness to different image datasets.Unsupervised crack feature extraction methods are compared in this study.
versity campus were photographed by cell phones.CFD reflects typical urban roadway surface conditions using a smartphone.Crack500 includes 500 images of about 1500 by 2500 pixels.CFD includes 118 images of 480 by 320 pixels.CFD contains shadows, oil spots, and water stains.As the authors observe, Crack500 includes larger datasets with various pavement patterns, shadows, marks, and dirt.The resolutions of the CFD images were relatively low.Therefore, to apply common CNN architectures, the CFD images were upsampled four times.The training and test images of Crack500 were used.The CFD images were divided into training and test images with a ratio of about 4:1; the cracks #1-#94 were used for training, and #95-#118 were used for testing.

F
I G U R E 2 Proposed preprocessing steps.(a) Original RGB color pavement image with a crack.(b) R channel of the image.(c) After Wiener filter and normalization.The image was effectively smoothed by the Wiener filter preserving the crack shape.compressedfeatures should be as similar as possible to the input image to confirm the extracted features are effective.The CAE consists of encoder and decoder parts.The optimized encoder corresponds to a feature extractor.The following symmetric-architecture decoder layers are also trained to decompress the feature maps to reconstruct the image.

F
I G U R E 3 Proposed CAE architecture.The architecture consists of an encoder and decoder.Input images with a 128 by 128 pixel by three channel size are compressed by the encoder to output 15 feature maps with a 32 by 32 size.The decoder decompresses the feature maps to output the images, which reconstruct the input images.The optimized encoder is a feature extractor.
is blurred.The MAE of Figure 4b is 0.018 and that of Figure 4d is 0.056.By adding F I G U R E 4 Comparison of the original image and reconstructed images by the CAE.(a) Original image.(b) Image reconstructed by the proposed CAE using mean absolute error (MAE).(c) Reconstructed by the seven-layer CAE.(d) Reconstructed using MSE.The proposed CAE using the MAE shows the clearest image.

F
I G U R E 5 Constructed SOM.The map size 11 by 28 was automatically estimated.(a) Map with the numbers of images stored in each cell.The distances between the weight vectors of the neighboring cells (U-mat) are shown by a color map.Two salient features surrounded by yellow areas are observed.Some features are concentrated and show unique distributions.(b) Clustered background categories.The number of the categories is five.Five map areas are delineated.From Figure 5b, salient 1 is merged into background 2 and salient 2 into background 3. The top left area of the background 2 and bottom right area of the background 3 are yellow, implying the two clusters include characteristic features.The other three clusters are less characteristic.The map suggests that non-crack images are affluent in features.If more non-crack images are collected, the number of the clusters may be increased.

F
Example images of five clusters of Figure 5. R-channel images after preprocessing are shown.Top three images are background 1, next are background 2, and so on.Salient features 1 and 2 of background 2 and 3 in Figure 5 are also shown.

F
I G U R E 7 Maps of the five clusters of Figure 5 for the four example images of Crack500.The top and third two figures are the images of Crack #1, #4, #6, and #88.The second and bottom two figures are the distributions of the clusters.The crack regions are eliminated.The colors correspond to the background categories.The four images show different maps with characteristic tendencies.

F
I G U R E 8 Adopted deep learning (DL) model architectures with novel points in red.(a) Custom-net is a simple two-convolution and one fully connected layer model.(b) Common DL architectures.The convolution layers of VGG16, DenseNet121, and MobileNetV3 were imported with ImageNet parameters to use them as initial values.Two fully connected layers were connected and trained to adapt to the target problem.

TA B L E 5
Standard deviations of the indices in the case of the custom CNN running three training cases.The custom CNN shown in Table Therefore, the threshold  = 0.95 for the previous model and  = 0.99 for the proposed model.The top figures show the images.The second top figures are the manually annotated crack pixels.The second bottom figures are the previous model, and the bottom figures are the proposed model.

F
Crack detection results on Crack500, continued to the next page.Top figures show the images.The second top figures are the ground truth crack pixels.The second bottom and bottom figures are the prediction results of previous VGG16 and proposed VGG16-SOM models.Thresholds were applied to crack probabilities to detect crack patches as black areas.

F
Crack detection results on CrackForest database.The second bottom and bottom figures are the previous VGG16 and the proposed VGG16-SOM.Black areas are detected as crack patches.Thresholds are applied and the boundary regions are also processed.

Image size (pixel by pixel) Training image number Test image number Crack500
Prepared crack and non-crack image datasets from the two common datasets.Cropped image sizes and numbers of training and test images for crack and non-crack categories are listed below.
TA B L E 1 128 by 128 Crack: 4311, non-crack: 23321 Crack: 3921, non-crack: 23765 CrackForest database (CFD) 128 by 128 Crack: 1893, non-crack: 12207 Crack: 531, non-crack: 3069 accuracy in the training process, about 1%.Therefore, weight factors were applied according to the ratio of the numbers of the crack and non-crack images in every case for comparison.Without the weight factors, in the case of two-category classification, training fails or achieved accuracies are low.In the case of the proposed method, because the number of the non-crack categories is increased, the imbalance problem did not occur.Categorical cross-entropy is used for a loss function.Of course, the numbers of the categories in the loss functions are different between the existing and proposed CNNs.
The authors extensively investigated training methods for imbalanced datasets such as setting weight factors for each category, a loss function based on f-measure, and picking up background images to adjust the numbers of images in different categories.However, the achieved classification accuracy was not different among the training methods.The differences were smaller than the fluctua-tion of validation Image numbers in clusters and interpretation of clusters.
TA B L E 3

Table 4
Comparison of the performances of the four deep learning (DL) architectures and combinations with the proposed method on Crack500.Evaluation indices and inference time are listed.
summarizes the comparison of the previous and proposed (suffix-SOM) methods with the four DL architectures, a total of eight cases.Evaluation indices and inference time per image are listed.As explained, comparing the custom CNN, CNN-SOM, and CAE with a TA B L E 4 Confusion matrix of the VGG16-SOM.The six categories, recall, precision, and accuracy summing up the diagonal terms are shown.The precision and recall do not correspond to Table4because the five non-crack categories were merged and the threshold for the crack probability was applied afterward.
TA B L E 6Note:

Table 4 .
Crack #4 Comparison of the DL architectures and proposed method on CFD.Evaluation indices and inference time are shown.The crack detection abilities of the two models are similar.However, the previous model falsely detected the top-right part of the pavement as cracks, while the proposed model did not show any false detection cases.This false detection case is possibly because of the noisy small black aggregates in the top right part and relatively dark right side of crack #4.Even the latest deep CNN architectures sometimes output false detection cases to detect thin crack patterns.The proposed model persistently suppresses misleading features as shown by the results in this section.Crack #6 is a smooth pavement surface with an alligator crack.Both models successfully detected complicated cracks even though there are various widths of cracks.Crack #88 shows a noisy surface of large aggregates with a bifurcated crack.The previous model falsely detected many large black aggregates as cracks while the false detection cases were drastically reduced by the proposed model preserving the shapes of the cracks.The results are similar to other CNN architectures, compared in Table4, such as DenseNet121-SOM, one of the latest ResNet architectures.The conclusion is that the proposed model can output more accurate results considering the textures of the non-crack images.
TA B L E 7 Table 7 summarizes the classification results.Inference time was the same as that of Crack500.Training time was halved,