Identification of Martensite Bands in Dual‐Phase Steels: A Deep Learning Object Detection Approach Using Faster Region‐Based‐Convolutional Neural Network

Martensite banding in dual‐phase steels is an important research topic in the field of materials design, since it affects the local damage properties of the material largely. Therefore, it is necessary to quantify the amount and the geometrical details of the bands in a specific microstructure, for example, for simulative approaches. A convolutional neural network is trained on manually labeled scanning electron microscopy images of DP800 steel and a subsequent effort is made to transfer these results to statistical quantities for the generation of representative volume elements (RVE). As exact geometric definitions of martensite bands in 2D are difficult, the influence of different band definitions is investigated. The result of the training shows good prediction accuracy but is strongly dependent on the chosen band definition and the underlying human bias from the labeling process. A statistical analysis using cross‐validation shows that reliable results can already be achieved with only small datasets of around 50–100 training images due to the transfer learning approach. This is an important outcome as it eliminates the need to generate a large dataset which can only be obtained from time‐consuming microscopy work and manual labeling of the images.

DOI: 10.1002/srin.202200836 Martensite banding in dual-phase steels is an important research topic in the field of materials design, since it affects the local damage properties of the material largely. Therefore, it is necessary to quantify the amount and the geometrical details of the bands in a specific microstructure, for example, for simulative approaches. A convolutional neural network is trained on manually labeled scanning electron microscopy images of DP800 steel and a subsequent effort is made to transfer these results to statistical quantities for the generation of representative volume elements (RVE). As exact geometric definitions of martensite bands in 2D are difficult, the influence of different band definitions is investigated. The result of the training shows good prediction accuracy but is strongly dependent on the chosen band definition and the underlying human bias from the labeling process. A statistical analysis using cross-validation shows that reliable results can already be achieved with only small datasets of around 50-100 training images due to the transfer learning approach. This is an important outcome as it eliminates the need to generate a large dataset which can only be obtained from time-consuming microscopy work and manual labeling of the images.
which are then used to generate input to build so-called statistically RVEs (S-RVE), which rely on some kinds of averaged quantities (e.g., used by Liu et Al. [13] and Vajaragupta et al. [14] ). While models build from the real microstructure might be slightly more precise in displaying the exact morphology of the constituents, a statistical description of the microstructure makes it easier to vary specific features of the microstructure. This is inevitable for the use of RVEs for microstructure design and the quantification of the effects of single constituents of the microstructure.
Various open-source software packages can be used to generate RVE (so-called RVE generator). These include DREAM.3D, [15] Neper, [16] or Kanapy, [17] utilizing different approaches to build the microstructure model and map different properties. Recently, an RVE generator has been developed in the authors' group, which uses discrete representations of ellipsoids to represent grains. For more details on this, the reader is referred to Henrich et al. [18] . In this approach, the focus was put on the correct representation of the geometrical description of the microstructural constituents. Therefore, ellipsoids were chosen as a mapping for the grains, which are generally better suited as a representation for the grains than spheres, since geometrically preferred directions (e.g., due to rolling processes) can be better described. In order to map the distributions of the necessary grain parameters (e.g., semiaxes of the ellipsoids and inclination angle of the ellipsoids), a log-normal or Gamma distribution can be fit to the individual data series. [19] However, previous studies have shown that there are interdependencies between the individual grain parameters that cannot be represented by univariate distribution functions. Therefore, a so-called "Wasserstein generative adversarial network" is used in the RVE generator, which is able to serve as a multivariate distribution function for synthetic grains. More details about this approach have been described in previous studies. [20,21] This approach is suitable for ellipsoidal or sphere-like structures such as ferrite grains, martensite islands, voids, inclusions, etc., but not for more complicated morphologies such as martensite or pearlite bands.
Martensite banding, especially in DP steels, is an active research topic, since bands of a secondary hard phase lead to numerous changes in the local properties, most of which influence the local properties in a negative way. [22,23] However, to our knowledge, there is no method to describe the individual bands of a microstructure quantitatively, while taking into account the geometric features of the bands. While there are some approaches to quantify the amount of banding in a microstructure, [24,25] this description is not sufficient for the representation in S-RVE: Only if bands and islands can be described independently from each other and the rest of the microstructure, the individual effect of the bands on the microstructure and the mechanical properties can be analyzed. However, this leads to the question of how exactly bands and other representations (e.g., martensite in the form of islands) of the same phase can be distinguished. In dual-phase steels, where the microstructure has a generally elongated structure due to the rolling process, this question is not trivial to answer and requires time-consuming and manual classifications by experts for the respective material.
One way to accelerate such a problem is to use machine learning methods, especially from the field of supervised learning. Supervised learning is already widely used and successfully applied in the field of material characterization, for example, for the classification of damage mechanisms, [26,27] grain structures, [28] or microstructural constituents. [29,30] The advantage is that after experts have manually created a dataset for training, the model can automatically process new data without human intervention.
In this study, we present a machine learning approach for automatic detection and localization of martensite bands in scanning electron microscopy (SEM) images from a DP800 and the ensuing transfer to statistical quantities for RVE generation. For this purpose, a state-of-the-art object detection algorithm, the Faster region-based-R-convolutional neural network (CNN), is implemented and used. We show that this algorithm is able to learn and subsequently apply different definitions of martensite bands even when only relatively few data is available. The article is structured as follows. Section 2 describes two different strategies to separate the bands from the islands (so-called labeling) to generate training data. In Section 3 the used network architecture is explained. In Section 4, the results of the two different definitions are presented. Section 5 contains a discussion and comparison of the results. In Section 6, we show subsequently how the detected bands are transferred into a statistical representation for the use in RVE and which different peculiarities result from the varying definitions. Thus, we provide an end-to-end approach for data generation for RVE, ranging from the band detection to the statistical representation.

Labeling Approach for Martensite Bands
The material examined in this study is a commercial dual-phase steel DP800. The material has a phase fraction of martensite of around 32% and shows a significant martensite banding of the microstructure (oriented along the rolling direction RD]). Other phases such as bainite play only a very minor role for important properties of this specific steel and are therefore omitted from this study. The yield stress is 460 MPa, the ultimate tensile strength has a value of around 823 MPa, and the fracture strain is at 21%. There is some scatter in the fracture strain for tensile loading, with fracture strains at 19.84, 19.6, and 23.5% for three tensile tests. In a previous study, the pronounced banding was found responsible for this scatter in the fracture strain. [6] Figure 1a shows a panoramic SEM image of the microstructure with a height of 320 μm and a width of 1153 μm. The data used in this study was originally generated for another study, where the effect of local strains and stress triaxiality on the void nucleation and evolution of this DP800 was investigated. [31] The image was taken by means of a SEM-Inlens-Detector, which allows a very clear delimitation of martensite and ferrite. The image was originally taken as a combination of several 100 μm Â 100 μm images, which were then stitched together using an algorithm. The resolution of the original images corresponds to 3072 pixels per 100 μm. [31] The x-axis is oriented along the RD and the y-axis along the sheet normal (RD Â SN). A closer look on an exemplary 50 μm Â 50 μm excerpt (Figure 1b) from the panoramic SEM image reveals that the banding and the clear preferred direction along the RD comes from elongated islands and more pronounced bands with partially blocky dimensions, extending over a larger area (in this example over the whole 50 μm).
Images taken at 90º to those shown in Figure 1 (transverse direction Â SN) show a very similar microstructure. From this, it can be concluded that the (larger) bands are more likely to be 3D "plates", which have a greater extension in the RD and in the transverse direction.
For the construction of S-RVE, the statistics of the elongated islands are captured using ellipses or ellipsoids, but for the blocky bands/plates case these are unsuitable. The representation of a band as a centered oblate ellipsoid in the RVE cannot accurately represent the geometric shape of a band/plate, for example, enclosed ferrite grains cannot be represented in this way. In addition, a large ellipsoid would lead to a decrease in the thickness of the band from the center of the RVE to the edges. Therefore, the bands must first be separated from the island in order to then be displayed in the RVE. Thus, it is necessary to define what is a band and what therefore needs a special treatment in the RVE generation process and what can be represented with the standard ellipsoids. To achieve this, different definitions were made to label the bands according to different criteria. With this, training data for the object detection approach described in the next section is generated. To our knowledge, there is neither a benchmark nor "right" or "wrong" to the question "what is a martensite band?" In general, the answer to this question depends on the goal of the investigations (i.e., the accurate representation of the martensitic morphology in S-RVE and the investigation of the influence of the bands on the damage behavior).
Hence, this study examines two different definitions of bands and tests and discusses them in terms of usability. The goal is to find out whether an object detection approach is capable of handling different definitions of martensite bands. The definitions are applied to 50 μm Â 50 μm excerpts of the microstructure, with the size chosen accordingly to the edge length of the RVE, which has edge lengths in the range of 30-60 μm.
During the labeling process we mark an area as a band by drawing a rectangle (so called bounding box) around the area, enclosing the band. This was done using the "Image Labeler app" from the "Image Processing and Computer Vision Toolbox" in MATLAB. [32] The following two definitions were used to label all images: 1) Definition 1 (Def 1): A band is an elongated area of martensite that extends from the left side of the image to the right side of the image. The bands should have no or very small breaks (≤10 pixels). Usually, the bands are parallel to the x-axis, but slight deviations (5°) are allowed. The height of the band does not need to be constant throughout the length, so the band may be very thin at some points. Bands also include blocks of martensite that have enclosed areas of ferrite, as well as curved or branched sequences. 2) Definition 2 (Def 2): As opposed to Def 1, bands can be shorter than the width of the image. Breaks of the band are allowed if the band-like structure remains "recognizable" for a human with the respective domain knowledge and experience. Contrary to the first definition, the bands should have an almost constant height and be parallel to the xaxis. The bands must be different from the rest of martensite islands surrounding the band. Very thin band-like structures are not labeled as bands.
These definitions are no strict, mathematical definitions but are rather intended as guidelines for a human expert for manual labeling of martensite bands. In doing so, a certain degree of human bias will automatically be introduced into the data. The effects of this will be discussed in Section 5. Additionally, both definitions implicitly assume that all images have the same size and are not scaled. Figure 2 shows the resulting bounding boxes for both definitions. Here, it can be seen that Def 1 focuses on the coherence of the martensite structure across the entire image width and all associated side arms. In contrast, Def 2 focuses more on the blockiness of the martensite. The constant thickness is more important than the length and possible side arms or branched sequences. In both definitions, the martensite content in the band area is significantly increased compared to the average value of the material. According to the definitions, the two structures above the big band at %1=3 of the height are labeled as bands according to Def 1 because they span over the whole image, but not according to Def 2, since they to not show a constant height and are relatively thin.
The panoramic SEM-Inlens ( Figure 1) image is split into 138 excerpts. For Def 1, 97 bands were labeled in 69 of the 138 images. For Def 2, 51 bands were labeled in 40 of the 138 images, the remaining images without bands are not used. These data is used as the ground truth. The nearly doubled number of instances in Def. 1 as compared to Def 2 can indicate that Def 2 is a more strict definition.
In addition to the SEM-Inlens images, labeling experiments were also performed on image from a standard SE detector. The problem arises that on these images martensite and ferrite are only well distinguishable on the basis of the grain boundaries and there is no clear color contrast. This complicates both the labeling and the automatic binarization, which is important for the separation of martensite and ferrite for the generation of RVE input data. Therefore, we stick to the Inlens images for all further analysis.

Theoretical Foundations
This article deals with the localization and classification of objects in an image (even though the classification is somewhat needless in our case, since there are only bands and no other components to be recognized). There are two main ways to do this: image segmentation and object detection. Currently, all state-of-the-art algorithms of both model classes are based on deep learning (DL) methods when processing images or other grid-like data (e.g., videos or time series, a special form of a multilayer neural network is used, the so-called CNN). For a detailed introduction into CNNs and the convolutional operation, the reader is referred to Goodfellow et al. [33] . For image segmentation and object detection tasks, CNNs show superior results as compared to many classical image analysis algorithms and recipes, making them the basis for most of the state-of-the-art image analysis methods.
Image segmentation for a binary classification task combines the pixels belonging to an object in an image or a volume, segmenting them from the background. A well-known network architecture for such tasks is the U-Net, developed by Ronneberger et al. [34] . In contrast to that, object detection deals with the localization and subsequent classification of objects in an image by drawing a bounding box around the object. [35] A disadvantage of object detection is that the exact shape of the object cannot be detected. In this work, however, an object detection approach was chosen, since the exact shape of the bands is not crucial for the representation in the RVE (e.g., due to lower resolutions of the RVE compared to the image), and complete masks with all shapes introduce additional, unwanted complexity into the labeling problem. For object detection algorithms, a distinction is made between two-stage models, which have separate models for extracting the so-called region proposals (RPs) (regions in an image where objects may or may not lie) and classifying the objects and second one-stage models, which skip the RPs. [35] Popular one-stage architectures are presented in the study by Redmon et al. [36] and Lin et al. [37] . The benefit of one-stage models is the simpler architecture and the faster prediction speed. [37] Since the prediction speed is not crucial for the detection of microstructural features (since, unlike e.g., autonomous driving, this is not a real-time application), a two-stage architecture was chosen.
A popular family of two-stage algorithms is the R-CNN model family, where R-CNN stands for "region-based convolutional neural networks." The family consists of three subsequent models: the R-CNN from 2014, [38] the Fast R-CNN, [39] and the Faster R-CNN, [40] both of which were published in 2015.
The initial architecture is the R-CNN, proposed by Girshick et al. [38] . This architecture is divided into three parts. In the first step, a selective search (SS) algorithm is used to generate 2000 RP for each image. A RP is a location from the image, where an object may or may not lie. These proposals are resized to squares and subsequently fed to a CNN, which serves as a feature extractor and gives a feature vector with 4096 entries to a support vector machine algorithm, which classifies the objects and calculates the bounding box (given by the four corner coordinates). However, different problems arise from this architecture: At first, it is extremely inefficient in terms of time and storage space. Since features are computed and stored for each RP, the R-CNN with very deep CNN architectures as a backbone (up to several hundred layers) takes a very long time for training and predictions. In addition, several hundred gigabytes of memory are required. [39] Furthermore, the training process consists of several steps (CNN training, support vector machine training for classification, and support vector machine training for bounding box regression), which are also inefficient and prohibit training in one single loop. [39] To deal with this drawbacks, the Fast R-CNN was developed. With this, the time and memory problem is contained by no longer computing a feature map for each of the 2000 RPs, but only once for the entire image. Then the RPs are computed directly from the feature map. This saves both memory and time by reducing unnecessary computations. [39] The next design improvement, the Faster R-CNN (Figure 3), eliminates the last bottleneck, which is the generation of RPs using SS. This is illustrated by the fact that, if the time spent for the generation of RPs is ignored in Fast R-CNN, almost real-time object detection is already possible. [40] To get closer to true real-time object detection, the SS is replaced by the socalled RP network in the Faster R-CNN. This RP network shares the convolutional layer with the detection network, making the generation of RPs very efficient. [40] In Figure 3 it becomes clear that there is only one single fully convolutional layer in the Faster R-CNN. This CNN with shared computational layers can be trained with the so-called approximate joint training. This training style can speed up the training by up to 50%, compared to other training styles. [40] The Faster R-CNN is already being used in materials science, for example, for the detection of defects on the surface of steel sheets [41] or in weld seams. [42] However, these papers are more concerned with technical modifications to the structure of the Faster R-CNN and not with the effects of different labeling approaches on the statistical properties of the microstructural components to be detected. Here, an effort is made to use the Faster R-CNN in a setting were the properties to detect are not exactly defined.

Implementation Details
For this study, the Faster R-CNN is implemented in python using PyTorch version 1.11 and torchvision version 0.12. [43] With the PyTorch implementation of the model, we use a pretrained version of the network with a ResNet50-backbone (for more details on the ResNet neural network architecture for computer vision, the reader is referred to He et al. [44] ). This significantly speeds up training and allows good predictions to be made on relatively few images. This is an advantage in that the generation of good quality SEM or EBSD images with reproducible properties in high quantities is much more time-consuming than, for example, the generation of a large quantity of simple finite-element simulations. The concept of only fine tuning the weights of the most specialized layers of a pretrained model is called transfer learning and additionally accelerated our training process significantly. During the training, the model predicts bounding boxes with a so-called confidence score. The bounding boxes are recorded if the confidence score is higher than 0.8. All bounding boxes are stored in the format (x min , x max , y min , y max ) and the images were resized from 1600 Â 1600 pixels to 512 Â 512 pixels. The smaller size increases the training speed and prevents problems with the RAM during training. See Section 4 for a comparison of the results of different image sizes. As introduced in Section 2, two definitions were evaluated in this article. For both, the detector was trained for 50 epochs. This training was run using a NVIDIA V100-SXM2 GPU provided by the RWTH High Performance Computing Cluster and takes %75 min for the 69 images of Def 1 and 45 min for the 40 images of Def 2. To improve the small size of the training dataset, data augmentation techniques were applied to the images. For this, the Albumentations-Library [45] was used. We evaluated different augmentation techniques and list the best augmentations below (p ¼ xy% denotes the percentage of the training instance to which the augmentation was applied): 1) Flipping (p ¼ 20%): flips image vertically, horizontally, or in both ways; 2) Affine transformation (p ¼ 50%): applies translation, shearing, and rotation to the image (in defined boundaries); 3) MotionBlur (p ¼ 20%): applies motion blurring to the image; and 4) Blur (p ¼ 10%): blurs image with random sized kernel The choice of augmentation methods and respective parameter ranges resulted from the fact that the microscopy images were mainly aligned such that bands are always horizontal. Therefore, rotational transformation of the images with arbitrary angles is not physically reasonable and was excluded.
The original training dataset contains 69 (Def 1) and 51 bands (Def 2), which is, by comparison with training datasets, typically used in computer science applications an extremely small number. In order to check whether the results strongly depend on the selected train-test split and whether overfitting might take place, a fivefold cross-validation was performed. In a k-fold crossvalidation, the dataset is divided into k equally large subsets. Each of these k parts is then used once as a test dataset, while the remaining k À 1 parts are used as training data. Therefore, in a k-fold cross-validation, k trainings are performed and then statistically analyzed. Due to the pretrained backbone, this is possible in a reasonable time.
Note that cross-validation is only used to detect certain problems in the training data; the final model is then trained and evaluated separately based on all available data.

Evaluation Metrics
To quantify the success of the trained models, the results are evaluated based on the "recall" and "precision" metrics. These are defined as follows where T þ means true positive (band correctly located and classified as band), F þ is false positive (fraction of the image incorrectly defined as band by the model), and F À is the false negative (missed to identify image fraction as band). Another metric that is often used with the recall is the specificity T À T À þF þ . However, this cannot be clearly calculated here as there is no definition for true negatives. Theoretically, any island or martensite clustering that is correctly identified as background (everything that is not martensite band counts as background in this work, i.e., ferrite matrix, martensite islands, and possible pores) is considered a true negative, but it is difficult to draw the line here, so this value is not used in this study. A prediction is assumed to be true positive if the predicted and the ground truth bounding boxes match to at least some fractional value. For this purpose, the so-called Intersection-over-Union (IoU) threshold is used. It is defined according to the Equation (3).
In this equation, the Area of Overlap is the common area between the predicted and the ground truth bounding box. The Area of Union is the area enclosed by both boxes together. In general, a true positive is recorded if the IoU is greater than 0.5. In this study, the detectors are evaluated at IoU scores of 0.5 and 0.75. The precision and recall scores are calculated for both thresholds and the mean value over test set is used to evaluate the detector. As an additional metric, we introduce the "accuracy", which is simply the mean of recall and precision.

Results of the Cross-Validation for Def 1
The results for the fivefold cross-validation using training instances according to Def 1 are shown in Figure 4. Here, the accuracy (see Section 3.3) is shown for the respective validation part of the dataset, without data augmentation (right part, Figure 4b) and with the data augmentation (left part, Figure 4a). The different lines indicate the different folds of the cross-validation and the mean of all folds is shown with the dashed black line. It becomes clear that -although there is some scatter in the curves-all five folds, that is, all validation dataset, behave approximately in the same way. The shapes of the curves, as well as the final accuracy level, are approximately similar for both the data with augmentation and without augmentations. The only outliers are two folds for the training without data augmentations, which tend to fall again during the training process. This indicates also that the dataset without data augmentation are statistically not (sufficiently) representative; hence, data augmentation is important in this case.
Slightly different results can be seen for the training parts of the dataset ( Figure 5). When comparing the results without data augmentation (right side of Figure 4 and 5), it becomes clear that without data augmentation the dataset is too small such that the trained model suffers from overfitting. While the training curves increase constantly and are almost all at 100% accuracy from epoch 40 on, the validation curves are only at around 75% accuracy, with the already described drop for two particular folds. For the training with data augmentation (left part of Figure 4 and 5), no such behavior is evident. All curve reach values at around 80% accuracy, with slightly less oscillations in the training curves, due to the higher number of samples. Table 1 shows the summarized results for the cross-validation. For a more complete analysis, we also analyzed the effect of different image resolutions on the prediction quality: The chosen sizes were 100 Â 100 (resulting in a resolution of 2, meaning that 1 μm is equal to 2 pixel), 512 Â 512 (the standard size), and 1024 Â 1024. From the results it can be deduced that even a very low image resolution of only 2 leads to good results. The image size of 512 Â 512 has the best precision and accuracy, while the image size of 1024 has the highest Recall. These results show that an image size of 512 Â 512 pixels has the highest values in the overall accuracy, so this size is used for further analysis, although the differences are not serious.

Results of the Cross-Validation for Def 2
As in the previous section, the results of the cross-validation are shown for Def 2. These can be seen in Figure 7. In this figure, comparable results to the results for Def 1 can be seen. There is only minor scatter between the different folds and no unexpected progressions in the curves can be observed. For the results with respect to the not-augmented data (Figure 7b), there is a sharp drop of the accuracy for one of the folds down to 60% accuracy. From both of the curves, it becomes clear that the results are better on average than the results of Def 1, with accuracy values oscillating more around 90% than 80%.   As described in the previous section, the model for Def 1 shows problems with overfitting. While comparing the accuracy values for the validation and training dataset, without data augmentation for Def 2 (Figure 6b and 7b), a similar behavior is observed, with the values for the training sets being extremely close to 100% from epoch 30 on. Although, the difference between the validation and the training values is not that pronounced compared to Def 1, it indicates less problematic overfitting for Def 2. Table 2 shows the mean results for the cross validation of Def 2. Similar results as for Def 1 can be observed. Again, the accuracy is highest at an image size of 512 Â 512, but the differences are very small (less than one percentage point). Overall, the results shown in Table 1 and 2 confirm the basic assumption of an image size of 512 Â 512 as good, there is no other image size that clearly produces better results.

Results for the Final Model (Def 1 and 2)
Due to the evidence of overfitting, we discarded the results without data augmentations and consider only the results with data augmentation for a more detailed analysis of both different definitions.   For the final model, the respective dataset is split into a 75/25% train test split. The resulting metrics for Def 1 are shown in Table 3. The results are calculated for IoU threshold of 0.5 and 0.75, respectively. The accuracy values are in good agreement with the last values from the curves from the cross-validation in Figure 4a, indicating no unexpected abnormal behavior in the final model.
Both rows of Table 3 show that the recall for this model is higher than the precision, about 12 percentage points for both thresholds. This indicates that the model generates too many false positives for this dataset but recognizes the majority of the correct bands (at least for IoU threshold of 0.5). Thus, the recall is relatively high. For the IoU threshold of 0.75, similar results are observed, with all three metrics around 14 percentage points lower than for an IoU threshold of 0.5. The precision of only around 55% indicates that there are nearly as many false positives than true positives at this threshold.
In Figure 8, some examples from the test dataset for Def 1 are shown. The ground truth bounding boxes are drawn in light blue, while the predicted boxes are drawn in orange. For a very pronounced banded structure and only small martensite islands like in Figure 8a, the prediction is almost perfect. All metrics are 1, even at a higher IoU threshold of 0.75. For excerpts that are a bit more indistinct in terms of the martensite morphology (Figure 8b-d), the prediction quality differs: For the excerpt in Figure 8b, the band in the upper half of the image is correctly identified (At IoU 0.5, at IoU 0.75, the difference is to big). In addition to that, a big bounding box is predicted in the lower half of the image, enclosing a banded structure, which was not labeled according to Def 1, because it did not span completely over the whole image. For the excerpts in Figure 8c,d, it is clear that the bounding boxes at the top of the images are basically correctly placed, but incorrectly sized. In Figure 8c, the prediction is much bigger than the real bounding box, enclosing more   of the morphology in the upper 1=3 of the image than that which was labeled. In Figure 8d, the two ground truth boxes are wrongly divided by the prediction, leading to precision and recall of 0.0 in this excerpt. Additionally, there is a false positive in this excerpt, which is, similar to 8b, a banded structure which does not range over the whole image.
In order to have a better comparability with the results of Def 1, we also trained the final mode for Def 2 with data augmentation. This is supported by the fact that for Def 2 the results without data augmentation show slight signs of undesirable behavior, for example, the aforementioned sharp drop for one of the folds. For the final model, a 75/25% train test split is also applied. The results are displayed in Table 4. It becomes clear that the metrics are overall better for Def 2 than for Def 1 and the model is able to produce very accurate predictions for the test dataset. The results are also well in line with the cross-validation curves in Figure 6a

Comparison and Discussion
The results show that our proposed object detection approach is capable of separating martensite bands from the from the rest of the martensitic morphology in a good manner, with accuracy values around 75-100%. Also, the cross-validation shows that the number of samples in our datasets seems to be sufficient and there is no evidence for bigger problems with the scarcity of the data (which would, for example, be indicated by extremely differing curves for the folds in Figure 4 and 6). This is a big advantage of the chosen transfer learning procedure (using a pretrained ResNet in the Faster R-CNN), enabling us to work with a relatively small amount of data, which is enhanced by data augmentation techniques. Given that images from costly electron microscopy are needed to distinguish martensite and ferrite properly, the small amount of data required is a clear advantage of our approach.
However, the prediction quality differs strongly between the two definitions, even with the same configuration of hyperparameters, like the chosen augmentations, epochs, batch size, etc. While Def 2 is nearly perfectly mapped at around 100% with no big differences between precision and recall, there is only a prediction accuracy of around 75% of the model with respect to Def 1. Also, this model shows a skew in the metrics toward recall over precision, indicating more false positives than false negatives. Additionally, the dropoff in the prediction metrics from an IoU threshold of 0.5 to one of 0.75 is higher for Def 1 (around 14 percentage points for all three metrics) than for Def 2 (10 percentage points for precision and 5 percentage points for recall). This shows overall more exact placed boxes for Def 2 than for Def 1.
Since these differences in the results can barely be explained by "technical" reasons (the basic structure of the two final models is, as already mentioned, the same), the differences can most likely be explained with some kind of human bias which is introduced into the data during the labeling process. As mentioned in Section 2, the definitions are only guidelines for a human expert to separate the bands from the rest of the microstructure and not a strict and universally valid criterion. Therefore, it seems reasonable that all or some of the rules of Def 1 tend to introduce more inconsistencies in the respective dataset than Def 2. An additional hint toward these inconsistencies is the more pronounced overfitting for Def 1, observable in Figure 4 and 5. Although it is not clear which part of Def 1 is introducing more of the inconsistency into the dataset, some of it can be traced back to the "continuity condition", the rule that the bands should range over the whole image. These conditions, although being the most objective criterion in both definitions, can probably be problematic, as shown in Figure 8b,d: In both of images, a false positive is generated by labeling a structure with banded shapes which does not range over the whole image and was therefore not labeled according to Def 1. It is evident that the exclusion of such structures is not comprehensible or not learnable for the model. Additional inconsistencies can be suspected from the box dimensions of the predictions in Figure 8c,d. In these images, the model can find the general position of the band(s) but not the correct dimensions or allocations of the bands.
In contrast to this, the final model for Def 2 does not tend to show these kinds of inconsistencies. One reason can be that Def 2 is stricter than Def 1, which is also underlined by the fact that there are nearly twice as much bands for the first definition (97 vs. 51). With this, it seems that a rather objective criterion like the "continuity condition" is not necessary an indicator for more strict labeling. In general, it can be concluded that the "blockiness" from the latter definition is easier to learn criterion than the "continuity condition".

Generating Data for Representative Volume Elements
(Note: For more detailed information about the DRAGen-RVE Generator, which is used for the creation of the shown RVE, the reader is referred to ref. [18]. The general procedure of the band generation is already described in Pütz et al., [22] and we will briefly review this and explain the link to the Faster R-CNN predictions).
To display the detected bands in DRAGen-RVE, several additional processing steps have to be undertaken, which are schematically displayed in Figure 9. The procedure is as follows: In the first step, the trained model is used to predict the martensite bands in an arbitrary number of images. In the second step, the band(s) is/are cropped out of the images based on the bounding box predictions. A combined treatment of despeckling and small hole filling is applied to these images to facilitate statistical characterization of the bands. Thresholds of 1024 and 2048px are used for this. These values are motivated by the fact that the later RVE has much lower resolution than the real image (a common resolution for a DRAGen-RVE is 2), and microstructural components (holes, small islands) with sizes around about 1 μm are not represented by the low resolution of the RVE. After this steps, an image is obtained which contains only the martensite band. To get a statistical representation of the bands, a method is used which is adapted from the ASTM E1268-19, which deals with the "bandiness" of microstructural morphology. [24] In this method, lines are drawn in y-direction (ND, sheet thickness) of the images, and the intercepts are counted and used for the thickness analysis of the bands. Figure 9c) shows the procedure. The result of this analysis is a table containing the band thickness values based on the Faster R-CNN model, which can then be used for the creation of RVE.
The different definitions will lead to different statistical thickness distributions (see Section 2). The different distributions, using a kernel density estimation (KDE) for displaying, can be found in Figure 10. It can be seen that Def 1 skews the calculated band thicknesses to lower values than Def 2. The mean thickness for the bands calculated according to Def 1 is 2.96 μm while the calculated mean thickness for Def 2 is 4.0 μm. Furthermore, Def 1 predicts on average more bands per 50 μm edge length, namely, 1.4 bands compared to 1.12 bands for Def 2. Another difference is the percentage of martensite inside the bounding boxes which contain the bands (for the whole microstructure the percentage of martensite is around 32%). This value is 44% for Def 1% and 54% for Def 2. (These values are calculated before the despeckling and hole filling operations.) In general, these values indicate that Def 2 is the more strict definition which labels bands that are more different from the rest of the microstructure than Def 1. This was already discussed in Section 5 in relation to the results of the object detection.

Conclusions
In this article, a machine learning approach was developed to separate martensite bands from martensite islands in a DP800 microstructure. For this, we fine tuned a Faster R-CNN object detector on SEM-Inlens images using two different criteria to define what a martensite band is. The main results can be summarized as follows.1) A cross-validation shows that there are not much problems with the scarcity of the data. Data augmentation  techniques help to tackle these problems and to contain tendencies of overfitting. 2) The results are highly dependent on the definition, which should be chosen with great care. The definition is best chosen on the goal of the investigation, for example, if the bands should be displayed in an RVE of a specific size, the required length of the bands should be chose, accordingly. In addition, there is a difference of about 35% on average in the statistically extracted band thickness. 3) The human bias is the biggest problem in this kind of task and depends also on the chosen definition.
In general, the results show that the proposed approach is capable of separating the martensite bands from the remaining microstructure due to arbitrary definitions in a good manner. Also, the approach presented here allows the generation of statistical input data for RVE generation. In doing so, it enables a separate consideration of bands and the rest of the microstructure and the resulting influences. Currently available approaches only allow the general representation of microstructures with different degrees of bandiness. With our method, bands of different definitions can be classified and described and their influence on mechanical properties can be investigated.
Based on this, further improvements can be made to strengthen the results of this study: One the one hand, expert labeling of different experts for the respective microstructure can be tested. The goal of this is to try to exclude "controversial" structures out of the dataset and thereby to contain the human bias which arises if one controversial structure is labeled as a band and another similar structure is not. However, expert labeling should also be accompanied by some kinds of definitions to avoid extremely inconsistent data if each expert is labeling after his or her own definition.
Second, an attempt can be made to generalize the model, for example, by having a look at different DP steel microstructures. With this, maybe more clear criteria can be deducted from the underlying data.
Finally, we only concentrated on 2D SEM images in this study, for the sake of simplicity. In doing so, we ignored the nature of the different bands in the third dimension (which is the transversal direction of the sheet). If the study is extended to 3D microstructural data, additional information can be deducted to improve the consistency of the labeling. However, obtaining large amounts of 3D microstructure data is very costly.