On the interaction between the search parameters and the nature of the search problems in search‐based model‐driven engineering

The use of search‐based software engineering to address model‐driven engineering activities (SBMDE) is becoming more popular. Many maintenance tasks can be reformulated as a search problem, and, when those tasks are applied to software models, the search strategy has to retrieve a model fragment. There are no studies on the influence of the search parameters when applied to software models. This article evaluates the impact of different search parameter values on the performance of an evolutionary algorithm whose population is in the form of software models. Our study takes into account the nature of the model fragment location problems (MFLPs) in which the evolutionary algorithm is applied. The evaluation searches 1895 MFLPs (characterized through five measures that define MFLPs) from two industrial case studies and uses 625 different combinations of search parameter values. The results show that the impact on the performance when varying the population size, the replacement percentage, or the crossover rate produces changes of around 30% in performance. With regard to the nature of the problems, the size of the search space has the largest impact. Search parameter values and the nature of the MFLPs influence the performance when applying an evolutionary algorithm to perform fragment location on models. Search parameter values have a greater effect on precision values, and the nature of the MFLPs has a greater effect on recall values. Our results should raise awareness of the relevance of the search parameters and the nature of the problems for the SBMDE community.


INTRODUCTION
There is a growing body of research on the use of search-based software engineering (SBSE) to address model-driven engineering (MDE) activities. 1 In the intersection between MDE 2 and SBSE, 3 activities related to model maintenance are reformulated as search problems.5][6] These works use search-based optimization techniques (mainly those from the evolutionary computation literature) to automate the search for optimal and near-optimal solutions.In this work, we apply search-based model-driven engineering (SBMDE) to address location problems in models.This activity aims to identify the parts of the models that are relevant to a specific task, and the result comes in the form of a model fragment.Model fragment location is one of the most important search problems in models.For example, in a model with 500 elements, the number of model fragments that can be generated can reach the value of 10 29 . 7Since the search space is so large, it is not practical to thoroughly explore the space of possibilities.
In our previous works, 8,9 we proposed an approach for locating model fragments using an evolutionary algorithm.To achieve this, the evolutionary algorithm keeps a population of candidate solutions (in the form of model fragments) and evolves them using genetic operators that are designed to work with model fragments.
However, in the current SBMDE practice, key search parameter values (population size, replacements, mutation rate, and crossover rate) are selected by conventions and ad hoc choices.Nevertheless, the default values used by the SBMDE community 1 are borrowed from other domains, and there are no specific studies about what search parameter values should be used when working with software models.
In addition, works on SBMDE provide many details about the techniques that are used to perform the search, however, there is a lack of detail about the nature of the problems used.The proper reporting of the nature of the problems used is important so that the results obtained can be useful to other practitioners.Therefore, we characterize the nature of problems by means of five measures that define model fragment location problems (MD-MFLP) (size, volume, density, multiplicity, and dispersion). 10n this work, we perform an evaluation to determine which search parameter has the highest impact on the performance of the search based on the nature of the location problems.To do so, we apply the search strategy to 1895 model fragment location problems (MFLP) from industrial case studies using different sets of search parameter values (625 different combinations).Then, we calculate the performance by means of precision, recall, and F-measure values, 11 comparing the results provided in each case with the oracle extracted from the case studies (considered the ground truth).Finally, we perform a statistical analysis of the results obtained in order to determine whether the different search parameter values applied and the nature of the MFLPs have an impact on the performance or whether the differences are obtained by mere chance.
The results show that the choice of good search parameters can provide a boost in performance (precision mean values range from 6.48% for non-optimal parameters to values up to 73.08% for optimal parameters, while mean recall values range from 62.02% up to 75.45%).Moreover, when varying the population size, replacement percentage, or crossover rate parameters, the differences in performance are around 30%, but, when varying the mutation rate parameter, the differences in performance remain below 5%.Our statistical analysis shows that the crossover rate parameter has the largest impact on our case studies on industrial models.In addition, the nature of the MFLPs also influences the precision and recall values.Low values in the measures related to the search space obtain better values in performance, while high values in measures related to the model fragment outperform the low values.In this case, the size of the search space has the greatest impact.It is important to highlight that while the search parameter values have a greater impact on precision values, the nature of the MFLPs has a greater impact on recall values.
The evaluation presented in this article is an initial work on the optimization that can be achieved when applying an evolutionary algorithm to perform fragment location on models.We want to advise the SBMDE community that they could be missing out on the relevance of the search parameters and the nature of their case studies for SBMDE approaches.We present this evaluation with the hope that it can be useful for practitioners from the SBMDE community when looking for default values or advice on how to balance their search strategy to boost performance.
The rest of the article is organized as follows.Section 2 presents some background about model fragment location using an evolutionary algorithm.Section 3 shows the process that is followed to evaluate the impact of the different parameter values taking into account the measures for reporting MFLPs.Section 4 presents and discusses the results of the evaluation.Section 5 examines some related work.Finally, we conclude the article in Section 6.

MODEL FRAGMENT LOCATION USING AN EVOLUTIONARY ALGORITHM
In this section, we present the model fragment location problem based on the products of one of our industrial partners, the search strategy used to address the MFLPs, the measures that define MFLPs, and the search parameters of the evolutionary algorithm.

Model fragment location problem (MFLP)
Figure 1 shows the domain-specific language (DSL) used by one of our industrial partners, BSH, to formalize its products.This DSL is used to describe the models of the induction hobs (IHs) that will be part of the evaluation.The firmware of the BSH products is generated from the DSL models.The DSL used by our industrial partner to specify the induction hobs (IHDSL) is composed of 46 metaclasses, 47 relations with each other, and more than 180 properties.For legibility reasons and due to intellectual property rights concerns, in this section, we show a simplified subset of the IHDSL (see Figure 1, IHDSL metamodel).However, the evaluation was performed using the full IHDSL that is used in BSH.
The parent model in Figure 1 depicts an example of a model that is specified with the IHDSL.In this example, the IH aggregates a 150-power inverter that is connected to a single power manager through a provider channel.The power manager is connected to a double inductor through two consumer channels.The dotted lines show the metamodel concepts that are related to the model.The bottom part shows the encoding of the model.This encoding will be used to define model fragments on the parent model, which is explained in the next subsection.

Search strategy: An evolutionary algorithm for MFLP
To develop an evolutionary algorithm, the following elements must be defined: (1) an encoding of the problem that can be used to represent the individuals; (2) a fitness function that can be used to evaluate how good each individual is as a solution to the problem; and (3) a set of genetic operators that can be applied to modify and evolve the population of individuals.
Figure 2 shows an overview of the evolutionary algorithm used.The left part shows the inputs for the approach: a MFLP description and a set of product models.The center shows a simplified representation of the main steps.The "initialize population" step calculates an initial population of model fragments from the input set of product models.

Double hotplate
Textual Description: group of two inductors that can work in conjunction to heat the cookware.Each hotplate is controlled by a power level that is then translated to different power outputs for each inductor depending on their size and position.Inductors are activated depending on the detection of cookware.
150 The "genetic operations" step produces a new generation of model fragments.Finally, the "fitness" step assigns values that assess how good each model fragment is based on the description.As output, the approach provides a list of model fragments that might be relevant as a solution for the location problem.

Encoding
In this case, the candidates are model fragments.To represent a candidate solution (individual), we use a binary string.Each position of the binary string represents a model element (concept, properties, or relationships) related to a parent model.Thus, a candidate solution is defined as a set of model elements that are associated with a parent model.The size of the solution represents the number of model elements in the string.This representation defines model fragments, indicating the presence or absence of each of the model elements of the parent.Model Fragment 1 in Figure 1 shows a model fragment that was defined in the parent model.All of the model fragments are defined concerning a parent model.When the position is equal to 0, it indicates that the element is not part of the model fragment.However, when the position is equal to 1, it indicates that the element is part of the model fragment.Thus, we can indicate the model elements from the model parent that compose the model fragment.For instance, Model Fragment 1 is a model fragment that contains the power manager, the two consumer channels, and the two inductors of the parent model.

Fitness function
The search strategy is guided using a fitness function that is based on textual comparisons. 9In this case, the algorithm assesses the relevance of each model fragment in relation to the description provided by the software engineers of our industrial partners (see the upper-right part of Figure 1).The description uses natural language.To assess the relevance of each model fragment to the description, we apply a technique that is based on information retrieval.First, the text from the textual description and the models is homogenized through the use of natural language processing techniques. 12Then, we apply latent semantic indexing (LSI) 13 to analyze the relationship between the description and the solution candidates generated.The result is a ranking of model fragments that are ordered based on the similitude with the description.

Genetic operators
The genetic manipulation of the individuals is performed using four different operators.These operators were defined to work on model fragments 9,14 : • Parent selection: This operator is responsible for selecting the parent individuals that will serve as the base for the new offspring.In this approach, we use a generic roulette wheel selection, which assigns a probability of being selected that is proportional to each individual's fitness value.All individuals can be selected as parents, but the higher the fitness value, the higher the probability of being selected as a parent for the next generation.
• Crossover: The crossover operator aims to combine the genetic material from the parents into new individuals.In this approach, we use a multiple-point crossover operator that is based on a mask 14 that will combine two model fragments into two new individuals.The mask determines how the combination is done.For each element of the model fragments, it indicates if the offspring should inherit from one parent or the other (including/excluding the element depending on whether or not the element is present in the parent).As a result, two individuals are generated, one by applying the mask directly and another one by applying the inverse of the mask.The crossover operator is not always applied to the new offspring; the crossover rate parameter (p c ) determines whether or not it is applied.
• Mutation: The mutation operator aims to imitate the spontaneous mutations that occur in nature.The mutation can turn into an advantage or a disadvantage for the survival of the individual.In this approach, we use an evenly distributed mutation across the genes of the individual.The mutation operator can perform two kinds of modifications: the addition of elements to the fragment (by inverting a 0 for a 1 in the binary string), or the removal of elements from the model fragment (by inverting a 1 for a 0 in the binary string).Again, the mutation is not always applied to the new offspring; the mutation rate parameter (p m ) determines whether or not it is applied.
• Replacement: The replacement operator aims to modify the current population by combining it with the new offspring generated with previous operators.In this approach, we replace the less fit part of the current population with new offspring.Two parameters will determine the outcome of the operator: the population size (), which will be kept constant throughout the entire execution of the search, and the replacement factor (), which determines the number of individuals that are replaced in each generation.

Measures that define model fragment location problems (MD-MFLP)
Some research works 6,10 use a set of five measures that define and characterize MFLP.Two of the five measures characterize the model parent where the MFLP is going to be located (search space), and the other three measures characterize the model fragment that realizes the MFLP (solution).These measures are the following: Search space size (SS-size) measures the number of elements of the model where the MFLP is located.The elements are concepts, properties, or relationships that are present in the parent model.The SS-Size value of the model presented in Figure 1 is 14, which matches the number of bits in the encoding.

Search space volume (SS-volume)
measures the number of models that compose the search space.
The location of a MFLP may be not limited to a single model and sometimes has to be performed in more than one model.In the example from Figure 1, the SS-volume value is 1 because the search is performed in one model.If it is necessary to locate the same MFLP in more than one model, one model fragment is obtained for each model used in the location.

Model fragment density (MF-density)
measures the percentage of model elements that realize a solution, in other words, the ratio of model elements in the model fragment and the model elements of the parent model.In the example from Figure 1, the MF-density value is 0.64 (9 model elements in the model fragment divided by 14 model elements in the parent model).Model fragment multiplicity (MF-multiplicity) measures the number of times the solution appears in the search space.In the example from Figure 1, the MF-multiplicity value is 1 because the MFLP only appears once.

Model fragment dispersion (MF-dispersion)
measures the ratio of connected elements in the solution.It is computed as the ratio between the number of groups and the number of model elements of the model fragment.In the example from Figure 1, the MF-dispersion value is 0.07 ( In this work, we characterize the case studies and the results obtained with these specific measures that define MFLPs.In addition, we focus on the search parameters from the genetic operators that can be tuned to boost the performance of the approach.

Search parameters of the evolutionary algorithm for model fragment location
Evolutionary algorithm researchers acknowledge that good search parameter values are essential for good evolutionary algorithm performance. 15In addition, the literature distinguishes between qualitative parameters (finite domain with no sensible distance or ordering) and quantitative parameters (infinite domain with structure or order). 15or example, in our approach, we define our own qualitative parameter for the crossover operator parameter, which describes how the crossover will be done (the operator itself).However, since the crossover rate expresses the probability of the crossover operator occurring and can take any value between 0 and 1, it is considered a quantitative parameter.For both types of parameters, the elements of the parameter's domain are called parameter values and we instantiate a parameter by allocating a value to it.In this case study, the qualitative parameters (operators indicated in previous sections) will be fixed from the beginning; we are going to evaluate the search approach using different parameter values for the quantitative parameters.
Table 1 shows the most common quantitative search parameters that are used in current practice and will be used to drive our approach.The population size () parameter describes the number of model fragments that are maintained as candidate solutions in each generation.The replacement () parameter indicates the percentage of the population that is going to be discarded in the next iteration.Therefore, it also determines the number of new individuals that will be generated through the application of the genetic operators.The mutation rate (p m ) and the crossover rate (p c ) define the probability of applying the mutation operator or the crossover operator in each case.
A set of values for each one of the parameters is known as a parameter vector.In our case, the parameter vector p = {, , p m , p c } can take the values defined in Table 1 for each one of the search parameters.For each search parameter, we included both high and low values to properly explore the alternatives.During the evaluation, 625 different parameter vectors are used, so all of the combinations of values are explored (e.g., p1 = {100, 20%, 0.2, 0.8}, p2 = {150, 10%, 0.4, 0.6}).
The parameters and values taken into account for the experimentation are accepted parameters and values in the literature. 16,17Table 2 shows the similarities and differences between previous works in parameter tuning: Arcuri and Fraser, 16 Sayyad et al., 17

and this work.
There are two atomic performance measures for evolutionary algorithms.The first one measures the solution quality (already evaluated in our previous work 14,18 ), and the second one measures the algorithm speed or search effort (evaluated in Reference 8).In this work, we try to maximize the quality of the solution obtained by the approach by varying the parameter vectors.The approach is executed for a fixed amount of time using one of the parameter vectors.Then, the quality of solutions achieved is compared to determine which parameter vector provides the best quality.

EXPERIMENTAL SETUP
This section presents the process followed to evaluate the impact of the different parameter values taking into account the measures that define MFLPs.
RQ1 Is there any impact on performance regarding the search parameter vector for MFLPs?RQ2 Is there any impact on performance regarding the measures that define MFLPs?RQ3 Are there any relations between the measures that define the MFLPs and the search parameter vector used that affect the performance?
To evaluate the impact of the different parameter values on the results, we execute the approach using a different set of parameter values each time to solve 1895 MFLP from two of our industrial partners: BSH, the leading manufacturer of home appliances in Europe; and CAF, an international provider of railway solutions worldwide.In addition, we perform a statistical analysis to ensure the validity of the results.
We use the product models from our industrial partners as an oracle to evaluate the results.In other words, we make use of a set of product models whose MFLP realizations are known beforehand and that are considered to be the ground truth, thus allowing us to compare the results provided by our approach with the oracles.

F I G U R E 3
Overview of the evaluation with the oracle.

Setup of the case studies
Figure 3 shows an overview of the process that we have followed in the evaluation.The top-right part presents the oracle (a set of product models with the MFLP located and formalized).First, we construct a test case for each MFLP that is present in the oracle.In addition, we generate the 625 different parameter vectors with values from Table 1 to configure the algorithm (left part of Figure 3).We run each test case for each one of the possible parameter vectors for a fixed time.The allocated time is 10 s (a prior test showed that the search converges in less than the allocated time).This results in a model fragment for each of the test cases for each parameter vector.Finally, the solutions are compared with the model fragments from the oracle (considered the ground truth) in order to obtain the precision, recall, and F-measure values.The operation is repeated 100 times for each combination of the parameter vector and test case to reduce the stochastic component that algorithms of this type have.Finally, the data is aggregated into a report containing the results of the executions (shown in Section 4.1).
The MFLPs used in this evaluation were obtained from two of our industrial partners, BSH and CAF:

BSH
BSH is one of the leading companies in the home appliances sector.We collaborated with the induction division, which has been creating the firmware of IHs for brands like Bosch and Siemens for the last 15 years.The latest firmware produced includes full cooking surfaces, where heating areas are dynamically created and activated or deactivated depending on the characteristics (size, position, material, etc.) of the cookware placed on top.
There has also been an increase in the feedback that the hob provides to the cook, such as temperatures of the food being cooked in the cookware or even real-time values of the actual consumption of the IH.In this evaluation, we use 608 MFLPs extracted from the products that they develop.The oracle is composed of the description of each MFLP, the models of the products, and the model fragment that corresponds to each of the MFLP.*

CAF
We also use the models from CAF in our evaluation.CAF is a constructor of railway solutions.They produce trains in many different forms (regular trains, subway, light rail, monorail, etc.) that are distributed worldwide.A train includes different pieces of specific equipment to carry out specific tasks for the train.These are located in vehicles and cabins and are usually designed and manufactured by different providers.The DSL used by CAF can be used to describe the interaction between the different pieces of equipment on the train.Moreover, the DSL allows specifying non-functional aspects that are related to regulations, such as the different levels of redundancy present in the system or the quality of signals from the equipment.This results in a DSL that is composed of around 1000 different elements.
In this evaluation, we use 1287 MFLPs extracted from the products that they develop.The oracle is composed of the description of each MFLP, the models of the products, and the model fragment that corresponds to each of the MFLPs.† Therefore, in this evaluation, we have 1895 different MFLPs provided by our industrial partners.The use of two different domains with a wide variety of MFLPs and their casuistry has led to an improvement in the generalizability of our assessment.
We classified each of the MFLPs by means of the MD-MFLP (Section 2.3).For each MD-MFLP, we define two groups: HIGH and LOW.To do this, we use a median-based discretization by splitting each MD-MFLP by the value of the sample median. 19,20Figure 4 shows the values for each of the five MD-MFLPs of the MFLPs used in this evaluation.Values above the median are considered to be HIGH, while values below the median are considered to be LOW.For instance, the median of the MF-multiplicity of all of the MFLPs is 4; all MFLPs whose MF-multiplicity value are above 4 are defined as HIGH for MF-multiplicity; in contrast, all MFLPs whose MF-multiplicity value are below 4 are defined as LOW for MF-multiplicity.† The following video shows the train models and model fragments used by CAF: http://www.youtube.com/watch?v=Ypcl2evEQB8.

Performance measurements
Once the results from applying the approach to the test cases are obtained, we proceed to compare them with the oracle and measure them in terms of some software quality properties.Figure 5 shows an example of a model fragment from the oracle (left), a model fragment candidate obtained from the application of the approach (right), and the confusion matrix 21 used to compare both (middle).
A confusion matrix is a table that is often used to describe the performance of a classification model (our approach under evaluation) on a set of test data (the resulting model fragments) for which the true values are known (from the oracle).In this case, each MFLP realization returned by the approach is a model fragment that is composed of a subset of the model elements that are part of the product model (where the MFLP is located).Since the granularity will be at the level of model elements, the presence or absence of each model element will be considered as a classification.Therefore, our confusion matrices will distinguish between two values: TRUE (presence) and FALSE (absence).Figure 5 shows an example of the comparison process performed to compare a result from one of the evaluated approaches with the ground truth from the oracle and the resulting confusion matrix.The left part shows the actual realization of MFLP #1 (obtained from the oracle and considered the ground truth), while the right part shows the predicted realization of MFLP #1 output by the approach.The confusion matrix arranges the results of the comparison into four categories:

True positive (TP):
A model element present in the predicted realization that is also present in the actual realization (e.g., model element B is a TP).True Negative (TN): A model element not present in the predicted realization that is also not present in the actual realization (e.g., model element H is a TN).

False Positive (FP):
A model element present in the predicted realization that is not present in the actual realization (e.g., model element A is an FP).False Negative (FN): A model element not present in the predicted realization that is present in the actual realization (e.g., model element D is an FN).
The confusion matrix holds the results of the comparison between the predicted results and the actual results; it is just a specific table layout to help the visualization of the performance of a classifier.However, to evaluate the performance of the approach, it is necessary to derive some measurements from the values of the confusion matrix, which in this case are the three measurements: precision, recall, and F-measure.
F I G U R E 5 Example of a confusion matrix for two model fragments.
Precision measures the number of elements from the prediction (the result of the approach) that are correct according to the ground truth (the oracle).
Recall measures the number of elements of the ground truth (the oracle) that are correctly retrieved by the prediction (the result of the approach).
Recall = TP TP + FN . ( F-measure combines both recall and precision as the harmonic mean of precision and recall. Precision and recall values can range between 0% and 100%.Following up with the example of the confusion matrix in Figure 5, we can calculate the precision, recall, and F-measure for the model fragment (see Figure 5).The model fragment has a measurement of 66.7% in precision (two out of the three elements included in the candidate model are present in the model fragment from the oracle) and 50% in recall (2 out of the 4 elements that are present in the oracle are also present in the model fragment).This results in a combined F-measure of 57%.
Details about the implementation of the search strategy approach can be seen in References 8 and 14.We performed the execution of the approach using an array of computers with 8 core processors, clock speeds of 4 GHz, and 16 GB of RAM.All of them were running Windows 10 Pro N 64 bits as the hosting operative system and the Java(TM) SE runtime environment (build 1.8.0_73-b02).
Since the software models of the case studies are currently operating or will be released in the near future, this information is limited by confidentiality agreements with our industrial partners.Nevertheless, for purposes of replicability, the CSV files with the results of this evaluation and that were used as input in the statistical analysis are published online at: https://svit.usj.es/SPE_Roca_data/.

RESULTS
This section presents the results obtained in the evaluation and the statistical analysis that answers each research question, the findings, the discussion of the results, and the threats to validity.

Evaluation results
This section is divided into three parts.The first part shows the performance of the four search parameters at each of their five possible values.The second part shows the performance of the MD-MFLP measures for the two levels established (LOW and HIGH).The third part shows the performance taking into account both the search parameters and the measures.

Performance by search parameter value
The bottom part of Figure 6 shows the mean values and standard deviations that were calculated for each search parameter value.We also show the F-measure value that combines precision and recall into a single performance indicator.
For the population size, the  1 = 50 value achieved the best results reaching mean values of 68.45% in precision, 73.26% in recall, and 69.67% in F-measure.However, all of the values provided similar results in terms of recall with differences below 5%.For the replacement percentage, the  5 = 60% value achieved the best results reaching mean values of 66.84% in precision, 73.76% in recall, and 68.46% in F-measure.Again, the differences in terms of recall when using each of the parameter values were small (below 5%).
For the mutation rate, the p m2 = 0.4 value achieved the best results in terms of precision with a mean of 58.00%, while the p m5 = 1.0 value achieved the best results in terms of recall with a mean of 73.01%.However, the performance differences when using each parameter value were small (below 5%).
For the crossover rate, the p c5 = 1.0 value achieved the best results reaching mean values of 67.32% in precision, 74.30% in recall, and 68.85% in F-measure.Again, the differences in terms of recall when using each of the parameter values were small (below 5%).
Table 3 shows the best and worst performance means by each search parameter vector ordered by F-measure.The top five search parameter vectors have the values of mutation rate and the crossover rate in common, (i.e., both are equal to 1.0).The bottom five search parameter vectors have the values of population size ( 5 = 250), replacement percentage  1 = 20%, and crossover rate (p c1 = 0.2) in common.However, the mutation rate takes all possible values.

Performance by MD-MFLP
The bottom part of Figure 7 shows the mean values and standard deviation of the precision, recall, and F-measure values obtained by each MD-MFLP.In the measures related to the search space, LOW values obtained better values in performance, while in the measure related to the model fragment, HIGH values obtained better values.
For the SS-Size, the differences between HIGH and LOW values were above 22 points, obtaining a mean of 70.55% in the F-measure for LOW values.For the SS-volume, the differences between HIGH and LOW values were above 6 points, obtaining a mean of 63.17% in the F-measure for LOW values.For the MF-density, the differences between HIGH and LOW values were above 14 points, obtaining a mean of 67.41% in the F-measure for HIGH values.For the MF-multiplicity, the differences between HIGH and LOW values were above 13 points, obtaining a mean of 67.6% in the F-measure for HIGH values.Finally, for the MF-dispersion, the differences between HIGH and LOW values were above 11 points, obtaining a mean of 66.91% in the F-measure for HIGH values.
Table 4 shows the ranking of performance means by each MD-MFLP combination ordered by F-measure.Some of the 32 scenarios are missing due to the lack of test cases with those combinations of MD-MFLPs.The top five best results were achieved when the SS-size was equal to LOW and the MF-density was equal to HIGH.The bottom five results coincide in SS-size with the opposite value (equal to HIGH) and the MF-multiplicity value equal to LOW.Each measure is reported as a two-value factor (HIGH and LOW).

Performance by search parameter value and MD-MFLP
Table 5 presents the mean values of precision, recall, and F-measure achieved by each search parameter value and each MD-MFLP.Similarly, Figure 8 shows the graphs obtained from those values.The results show that the performance of MD-MFLP remained the same as in the previous subsection.In the measures related to the search space, LOW values obtained better values in performance, while, in the measure related to the model fragment HIGH values obtained better values.
The same occurred with the population size, replacement percentage, and crossover rate search parameters.The values that obtained the best performance for them are the ones described in Section 4.1.1.However, the best performing values for mutation rate were not those described in Section 4.1.1 for all cases.The p m1 = 0.2 achieved the best results in terms of precision for SS-size, MF-multiplicity, and MF-dispersion.
Table 6 shows the best and worst performance means for each combination of the search parameter vector and MD-MFLP ordered by the F-measure value.The top five values of performance correspond to the same scenario.Again, the scenario, shares that the SS-size value was LOW and the MF-density value was HIGH.With regard to the search parameter values, the top five values of performance had the same crossover rate value (p c5 = 1.0).Similarly, the bottom five values correspond to the same scenario.In contrast to the previous scenario, the SS-size value was HIGH and the MF-density value was LOW.Moreover, the bottom five values had the same crossover rate value (p c1 = 0.2), the same population size value ( 5 = 250), and the same replacement percentage value ( 1 = 20%).Note that the crossover value was the complete opposite of the previous one (p c1 = 0.2 bottom values vs. p c5 = 1.0 top values).

Statistical analysis
The data resulting from the empirical analysis has been analyzed using statistical methods following the guidelines in Reference 22.The goals of the analysis are: (1) to provide formal and quantitative evidence (statistical significance) regarding whether or not the different search parameter vectors and the different metrics have an impact on the performance; and (2) to show that those differences are significant in practice (effect size).
To enable statistical analysis, the algorithm must be run a large enough number of times (in an independent way) to collect information on the probability distribution for each search parameter vector.Then, a statistical test is run to assess whether there is enough empirical evidence to claim (with a high level of confidence) that there is a difference in the performance.

Statistical significance
To analyze the statistical significance of the different search parameter vectors and the different measures, we used the ANOVA test.To apply this test, it is important to comply with normality and homogeneity of variance.The box plots depicted in the previous sections show that the data does not follow a normal distribution.However, the central limit theorem states that the sampling distribution of means is normally distributed for large enough samples. 23In our case, the normality of sampling distributions is ensured by having sufficiently large sample sizes.The result of the ANOVA test showed that there were statistically significant differences between groups because, p-values obtained by precision, recall, and F-measure values were lower than 0.05 for all of the search parameter values and MD-MFLP.
As we detected significant differences between the groups, we tested which groups were significantly different.We performed an additional post hoc analysis, which consisted of a pair-wise comparison among the results for each search parameter and for each MD-MFLP in order to determine the statistically significant differences among the results of  The population size, replacement percentage, and crossover rate parameters have large effects on the precision value while the mutation rate parameter has a small effect.The population size parameter and the precision value are inversely proportional.In other words, the precision value increases as the population size parameter decreases.The replacement percentage parameter and the crossover rate parameter are directly proportional to the precision value, that is, the precision increases as they increase.However, all search parameters had small effects on the recall value.
The results show that the best mean performance was obtained by population size  3 = 150, replacement percentage  4 = 50%, mutation rate p m5 = 1.0, and crossover rate p c5 = 1.0.If we consider the search parameters separately, the population size value, the replacement percentage, and the mutation rate vary to achieve the best performance.However, the crossover value remained at p c5 = 1.0.
Our statistical analysis of the results confirms that the crossover rate parameter is the one that presents larger differences and higher effect size.The differences observed in the results when switching the value for the crossover rate are significant and yield large differences in the performance.Higher values of crossover rate yield to higher values of performance (around a 25% improvement in F-measure).
We analyzed the results to understand why the influence of the crossover rate on the performance of the algorithm is significant and the influence of the mutation rate is negligible and not statistically significant.It turns out that crossover is more beneficial than mutation due to the characteristics of the model fragments.Our research shows that the model fragments involve elements that are not always connected to each other.In other words, the location of a relevant model fragment requires identifying clusters of elements distributed throughout the model.For instance, the functionality related to braking in the CAF trains is not restricted to a specific part of the model and, therefore, affects numerous elements throughout.While the mutation operation has the possibility of incorporating elements from any part of the model, the crossover operation significantly favors gathering the relevant model elements.The utilization of real-world industry models has enabled the identification of this finding.It may be challenging to identify it in academic instances as they are simpler and the model elements are interconnected.This is partly because academic research models and model fragments are smaller in size compared to industrial models.
RQ2.Is there any impact on performance regarding the measures that define the MFLPs?There were differences in the performance metrics between the HIGH and LOW values of the MD-MFLP.The SS-size measure obtained the highest differences for precision and recall.The MF-multiplicity had a large effect on the recall value but a medium effect on the precision value.MF-density and MF-multiplicity had medium effects on precision and recall values.The SS-volume measure obtained the smallest differences for precision and recall.
The measures related to the search space (SS-size and SS-volume) obtained the best values in performance with the LOW values.However, the measures that are related to the solution (MF-density, MF-multiplicity, and MF-dispersion) obtained the best values in performance with the HIGH values.The localization problems used in the evaluation have been classified using the five MD-MFLPs measures analyzed.This information can be very useful for other software engineers facing search problems in order to decide which parameters to use in their evolutionary algorithms.

RQ3.
Are there any relations between the measures that define the MFLPs and the search parameter vector used that affect the performance?
There are interactions between the search parameter vector and the MD-MFLP that affect the performance metrics.The highest interaction between the MD-MFLP and the search parameter vector is produced by different values of SS-Size.The crossover rate is the most important parameter value, which maintains its value in many scenarios.
In addition, this evaluation allowed us to realize that the nature of the search problems has a greater effect on recall, while the search strategy has a greater effect on the precision of the solutions found.
Regarding generalization, we have performed an evaluation of 1895 localization problems from two different industrial domains classified using the five MD-MFLPs measures.Although the general trends are maintained, the parameter vector that offers better results depends on each scenario.The classification of localization problems can serve as a practical guide, enabling more informed decisions in evolutionary algorithm design from other model domains.

Threats to validity
This section presents some threats to the validity of the results presented.We have followed the guidelines suggested by De Oliveira et al. 26

Conclusion validity:
We identified three threats of this type.The first threat is not accounting for random variation.
To address this threat, we considered 100 independent runs for each of the test cases for each of the search parameter vectors.The second threat is the lack of formal hypotheses and statistical tests.In this work, we employed standard statistical analysis following accepted guidelines 22 to avoid this threat.The third threat is the lack of good descriptive statistics.In this work, we have used precision, recall, and F-measure metrics to analyze the confusion matrix obtained from the experiments; however, other metrics could be applied.

Internal validity:
We identified four threats of this type.The first identified threat of this type is the poor search parameter settings.In this work, we evaluated which search parameter values worked best when performing model fragment location with our algorithm.In addition, the choice of the k value in the application of SVD can produce sub-optimal accuracy when using LSI for software artifacts. 27The second threat is the lack of real problem instances.The evaluation of this article was applied to 1895 location problems from two industrial case studies, BSH and CAF.The third threat is the lack of clear data collection tools and procedures.The set of 1895 feature localization problems used in the evaluation has been provided by our industrial partners BSH and CAF.The test cases provided are representative of their respective domains: IHs and railway domains.The fourth threat is the lack of discussion on code instrumentation.The evolutionary algorithm used in our evaluation is not the contribution of this article.It was presented in Reference 8 and improved in Reference 9 where the source code of the algorithm was made public.

Construct validity:
We identified one threat of this type.The identified threat is the lack of assessing the validity of cost measures.To address this threat, we performed a fair comparison among the algorithms with different search parameter values by generating the same number of model fragments and allocating the same budget time.Furthermore, the precision, recall, and F-measure measures used for solution quality are widely used in the information retrieval field. 28

External validity:
We identified three threats of this type.The first is the lack of a clear object selection strategy and the second is the lack of evaluations for instances of growing size and complexity.To mitigate these threats, we used a large number of case studies from two industrial partners (BSH and CAF).Our instances are extracted from real-world problems.Also, the approach was evaluated in two different domains that varied in size and complexity.The third threat is the lack of a clear definition of target instances.We are concerned with the generalization of our findings, hence, we classify our location problems using the MD-MFLP presented in Reference 10.However, the generalization could be affected since we only address location problems in models neglecting other kinds of artifacts.

RELATED WORK
0][31][32] For instance, Reference 31 extends their previous work by defining software architectures 33 to analyze the effect of the crossover operator in genetic algorithms that are used to synthesize software architecture designs.The authors compare sexual and asexual crossover operators, applying them to two test cases to conclude that the asexual crossover (i.e., no crossover) provides better results for that domain than a regular sexual crossover.However, the work was further refined 32 to propose a complementary crossover that is capable of yielding solutions with better quality than the asexual crossover.The crossover operator proposed can find complementary parents and produce offspring that combine the best from both parents.Similarly, Reference 29 introduces a feature-driven crossover operator that is capable of providing better results when applied to optimize a product line architecture.The study compares a multi-objective evolutionary algorithm that is applied to two case studies using their crossover operator or no crossover operator at all.
Harman et al. 30 propose a new crossover operator to be applied in the context of automated software re-modularization.The new crossover operator aims to preserve building blocks from parents that are transferred without modifications to the new offspring.This leads to better results than a regular crossover operator.
These works can improve the results by using a different operator rather than by changing the values of the parameters of the operator.The different operators that are applied by the evolutionary algorithm in those works correspond to qualitative parameters.In contrast, the focus in our work are the quantitative parameters of the evolutionary algorithm, looking for the values that provide the best results with current genetic operators, which leads to great improvement.Some works focus on the tuning of the parameters of evolutionary algorithms. 16,34,35For instance, Reference 34 presents a survey on parameter tuning and parameter control and discusses several techniques to achieve it.The authors also make the distinction between parameter tuning (how to choose parameters before running the search algorithm) and parameter control (how to change parameters while the search is being performed).However, it is important to note that parameter control does not fully address the problems of parameter tuning as the introduction of the control mechanisms usually leads to more parameters that need to be set.
The work from Reference 16 deals directly with the No Free Lunch theorem 36 (it is impossible to tune a search algorithm so that it will have optimal parameter values for all problem instances).The authors perform a large empirical analysis in the context of test data generation for object-oriented software to determine the impact of the tuned parameters on the searches.They conclude that parameter tuning affects the performance of search algorithms; however, well-tuned parameters are complex to find, and default values may be enough.
In Reference 35, the authors carry out a more general study of the parameter tuning of evolutionary algorithms.Contrary to Reference 16, they conclude that by using tuning algorithms one cannot only obtain superior parameter values, but also a lot of information about problem instances, parameter values, and algorithm performance.
Nevertheless, those works are applied to the (more general) field of SBSE and therefore do not take the particularities of the SBMDE field into account.It is not clear whether search parameter values that provide good results in problem instances that do not deal with models will behave similarly with SBMDE problem instances.

CONCLUSION
More and more, researchers are reformulating MDE activities as search problems.These works use search-based optimization techniques (mainly those from the evolutionary computation literature) to automate the search for optimal and near-optimal solutions.However, these works are neglecting the influence of the search parameter values selected and the nature of their problems in their results.In this work, we have performed an evaluation to determine the impact of different search parameter values when performing model fragment location following an evolutionary algorithm and taking into account the nature of the problems to locate.It turns out that there are interactions between the search parameter vector and the MD-MFLP that affect the performance metrics.Different values of population size, replacement percentage, or crossover rate parameters produce variations of around 30% in performance, but the mutation rate parameter produces differences of less than 5% in performance.In addition, LOW values related to the search space and HIGH values related to the model fragment in the MD-MFLPs obtain better values in performance.To achieve these results, the approach has been tested using 625 different search parameter vectors applied to 1895 different MFLPs.In addition, the results have been supported with a statistical analysis that determines that the results are significant and are not due to mere chance.

4
Box plots with the value for each MD-MFLP obtained from the 1895 MFLPs and report of the search problems used in the evaluation.

F I G U R E 6
Box plots, mean values, and standard deviation of the precision, recall, and F-measure values obtained by each search parameter.

F I G U R E 7
Box plots, mean values, and standard deviation of the precision, recall, and F-measure values obtained by each MD-MFLP.

TA B L E 5
Mean values for the precision (P), recall (R), and F-measure (F) by each search parameter value and each MD-MFLP.
product model, and model fragment.Overview of the search strategy.
1 group in the model fragment divided by 14 model elements in the model fragment).This value can range from 0 to 1. Values around 0 indicate a strong connection among the model fragment elements, while values around 1 indicate a strong dispersion among the model fragment elements.
Comparison to other similar works.
MD-MFLP performance ranking ordered by F-measure.