Model-based source code refactoring with interaction and visual cues

Refactoring source code involves the developer in a myriad of program detail that can obscure the design changes that they actually wish to bring about. On the other hand, refactoring a UML model of the code makes it easier to focus on the program design, but the burdensome task of applying the refactorings to the source code is left to the developer. In an attempt to obtain the advantages of both approaches, we propose a refactoring approach where the interaction with the developer takes place at the model level, but the actual refactoring occurs on the source code itself. We call this approach model-based source code refactoring and implement it in this paper using two tools: (1) Design-Imp enables the developer to use interactive search-based design exploration to create a UML-based desired design from an initial design extracted from the source code. It also provides visual cues to improve developer comprehension during the design-level refactoring process and to help the developer to discern between promising and poor refactoring solutions. (2) Code-Imp then refactors the original source so that it has the same functional behavior as the original program, and a design close to the one produced in the design exploration phase, that is, a design that has been confirmed as “ desirable ” by the developer. We evaluated our approach involving interaction and visual cues with industrial developers refactoring three Java projects, comparing it with an approach using interaction without visual cues and a fully automated approach. The results show that our approach yields refactoring sequences that are more acceptable both to the individual developer and to a set of independent expert refactoring evaluators. Furthermore, our approach removed more code smells and was evaluated very positively by the experiment participants.

The necessity to refactor a software system in order to prevent design erosion as the system evolves has been widely accepted.The preferred approach is floss refactoring, where the system is refactored in small steps on a continual basis. 1 However, practice shows that there is also a need for root canal refactoring, 2 where development is postponed for a short period while a deeper, more radical refactoring process is carried out. 3,4rrent tool support for refactoring is limited, comprising mainly code smell detectors that highlight areas that may be in need of refactoring, and refactoring tools themselves that automate the application of a range of well-known software refactorings.In the case of the latter tools, it is only the refactoring transformation that is automated-the developer must know that the refactoring exists, what its name is, that it is applicable in the current context and how to invoke it. 5,6Clearly, there is scope for further automated refactoring support.
In this paper we are concerned with the root canal refactoring scenario, where the design has eroded to some degree and the developer is willing to undertake a more radical refactoring of the system. 7Previous attempts at automated support for this type of refactoring have focused on using a search-based process to apply a sequence of refactorings to achieve certain goals, for example, to maximize some combination of software quality metrics, [8][9][10] or to remove code smells, 11,12 or to improve the semantic coherence of the code. 13Although these approaches have had some success, they ignore developer perspective during the refactoring process.[16] An interactive evolutionary process can serve as a solution to integrate developer perspective into the refactoring process and in this way guide the refactoring process using both quantitative and qualitative criteria. 17,18Current studies show that incorporation of expert knowledge during the code refactoring process can yield a significant improvement when compared with a fully automated process. 16,19,20However, these approaches employ source code as the main artifact to communicate refactoring effects, and this focus on low-level implementation detail can reduce the ability of the refactoring process to effect radical design change.
Another approach is to support refactoring at a design level, usually based on a UML class model.This can benefit from the advantages of models such as improving source code comprehensibility 21 and understanding software design and its issues. 22,235][26][27][28] At best, this process is tedious and error prone, at worst refactorings that appear beneficial when looking at a UML model may not be legally applicable to the actual source code.
The approach we take in this paper is to try to combine design-level refactoring with source-level refactoring in order to reap the benefits of both approaches.Source code refactoring is encumbered by the minutiae of programming language detail.On the other hand, design exploration can be a freer process, but it does not handle the burden of applying the refactorings to the source code.Our proposed model-based source code refactoring approach tries to achieve the "best of both worlds" by providing a design exploration phase followed by an automated source code refactoring phase that is driven by the output from the design exploration phase.
We have developed an interactive, model-based source code refactoring approach that is embodied in two tools: Design-Imp and Code-Imp.Design-Imp is an interactive, graphical UML-level tool that allows the developer to look at the existing software design and guide a search-based refactoring exploration of the design space, leading to the desired design the developer envisages.In the second phase of our refactoring approach, this desired design is used by Code-Imp to direct an automated source-level refactoring process that refactors the source code so that its design reflects the desired design the developer created.When the source code refactoring has completed, the developer is presented with the new source code, the sequence of refactorings applied, metrics values, and so on.This will be comprehensible to them because it is a design that closely resembles the desired design that they themselves created, rather than a computer-generated design.
To aid developers to better understand the original design and the refactoring process, we provide various types of visual cue during the design exploration phase as follows: (1) colors are used to convey the cohesion of a class, (2) arrows between classes vary in thickness to represent the level of coupling between the classes, and (3) the effect of a proposed refactoring is represented by changes to the color of affected entities and the thickness of the interentity arrows.(1) and (2) help the developer to understand the design as it stands and where problems may lie (e.g., a class with poor cohesion) while (3) helps the developer to understand the effect of a refactoring and to discern between promising and poor refactoring solutions.
We evaluate our approach on three Java applications and compare it with two other approaches: our approach without visual cues and a fully automated search-based refactoring approach. 10,29Note that experimentally comparing our interactive approach with other interactive searchbased refactoring approaches 16,20,30 was not possible as these tools were not publicly available; however, we discuss these works in detail in Section 5.In our qualitative analysis, we found that the visual cues used in our approach can effectively improve developer comprehension during the interactive refactoring process and help the developers to easily find refactoring opportunities and more deeply understand the effect of the refactorings.Furthermore, we found that the refactorings proposed by our approach are more relevant to developers than those proposed by the fully automated approach.This paper makes the following contributions: (1) We present a novel interactive model-based source code refactoring approach that combines qualitative and quantitative evaluation criteria.The proposed technique uses visual cues to improve the identification of refactoring opportunities as well as enhance developer assessment of refactoring value.
(2) We implement our proposed approach in two tools called Design-Imp and Code-Imp.Design-Imp enables the developer to interactively create a desired design for the software being refactored, while Code-Imp refactors the source code of the software so that its design gets close to the desired design created by Design-Imp.
(3) We evaluate our approach on a set of three Java applications using professional developers as participants.The results show that our proposed approach outperforms a fully automated refactoring tool and our interactive tool without visual cues in several ways, most importantly, (i) it yields refactorings that are more acceptable both to the individual developer and to a set of independent expert refactoring evaluators, (ii) it succeeds in removing more code smells, and (iii) it was evaluated very positively by the experiment participants.
The remainder of this paper is organized as follows.In Section 2, we provide an overview of our approach to design exploration and code refactoring and discuss the advantages that can accrue from this approach using a worked example.The results obtained from our experiments are presented and discussed in Section 3, while the limitations of the proposed approach and threats to the validity of our study are discussed in Section 4. A survey of related work is presented in Section 5, and finally, in Section 6, we conclude and discuss future work.

| PROPOSED APPROACH
This section describes the approach developed in order to improve design quality by means of a model-based source code refactoring process.The idea is to create a desired design from an initial one using an interactive process and then to refactor, in a fully automated fashion, the source code so that its design matches the desired design as closely as possible.The approach involves two phases, design exploration provided by the Design-Imp tool and source code refactoring provided by the Code-Imp tool. Figure 1 provides an overview of our approach.We describe these phases in more detail in Sections 2.1 and 2.2 and then provide a complete worked scenario in Section 2.3.In Section 2.4, we elaborate on a number of more detailed aspects of our approach.
F I G U R E 1 Approach overview: Initially, a UML-like design model is extracted from the source code and used by the evolutionary algorithm to produce a refactored UML model.Proposed refactorings are highlighted using visual cues and presented to the developer for their evaluation.Developer feedback is included in the following iterations of the algorithm.When the developer is satisfied, the source code is refactored based on the refactorings accepted by the developer.

| Phase 1: Design exploration
Initially, a UML-like design model is automatically extracted from the program source code, as explained further in Section 2.4.1.Note that Design-Imp extracts the model from the source code only once; after that, all design-level precondition checking, fitness function evaluation, and refactoring take place on this model.
To help software developers to gain a better and quicker understanding of the software, the design model is presented to the developer as a UML class model.Classes in the UML model are depicted using different colors to represent varying level of cohesion and are connected to each other using various sizes of arrows to represent different levels of interclass coupling. 31 the second step of the design exploration phase, a metaheuristic is employed to refactor the design to a better one in terms of the employed fitness function.* This refactoring process is very fast as it operates only on the design model (Figure 2), unencumbered by the program detail that would normally be stored in a complete AST.There are two outputs to this step: the new design model and the sequence of candidate refactorings that led to this model.
The design exploration process can now start in earnest.The developer directs evolution by accepting promising candidate refactorings, rejecting unfavorable ones, editing refactorings that do not fully suit their intention or simply ignoring refactorings that they are unsure about.
Note that a rejected refactoring will not be recommended again, while an ignored refactoring may be presented to the developer again on a later iteration.The developer can also manually add any refactoring they wish to.This augmented refactoring sequence is now used to refine the fitness function, † and the initial design model is again refactored using this refined fitness function.The developer repeats the same review process with the new refactoring sequence, and this iterative process continues until they are happy to accept the proposed refactoring sequence.
Note that when a refactoring that has already been accepted by the developer appears again in a later list of proposed refactorings, it is marked as "accepted" by default.
The result of the design exploration phase is a design that has evolved from the initial design based on the employed fitness function, where this fitness function combines both the initial metrics-based fitness function and, most importantly, the developer perspective that has been accrued through several iterations of the "generate and review" cycle described above.We call this the "desired design" as it represents the design that the developer desires the software to have.In this paper, the design exploration phase is implemented using Design-Imp.

| Phase 2: Source code refactoring
After the design exploration phase, in the source code refactoring phase, the program source code is automatically refactored with the desired design as the ultimate goal.When the process completes, the resulting program will have the same functional behavior as the original and a design *See Section 2.4.5 for more detail on the fitness function we use.† A detailed description of developer interaction and how their feedback is worked into the search process presented in Section 2.4.7.
F I G U R E 2 UML meta-model used by Design-Imp to represent a Java program at design level.As well as the usual high-level elements, the model contains information about method-method and method-field interactions extracted from method bodies.
close to the one produced in the design exploration phase and confirmed by the developer.Functional equivalence is guaranteed because only behavior-preserving refactorings are applied to the program.The source code refactoring phase is implemented using Code-Imp.
The main challenge here is how to determine if a candidate source code refactoring helps to move the program design closer to the desired design or not.How this is achieved is explained through an example in Section 2.3.Also, the preconditions for design-level refactorings are not as strong as preconditions defined at source level, which allows the design exploration phase to proceed in a freer fashion than source code refactoring.However, it means that some design-level refactorings will not be directly applicable to the source code.How this is handled is also explained through an example in Section 2.3.

| A usage scenario in detail
In this section, we provide a worked example by applying the model-based source code refactoring approach to a software application, namely, the Questionnaire Management System (QMS), which is also used in Section 3.
Design-Imp first presents a graphical package view of the application to be refactored, as seen in Figure 3.This view shows how related classes are grouped, and how packages depend on each other.The developer can collapse or expand a package to better understand its nature and inner logic.The package view can be used by a developer unfamiliar with the application to gain an initial understanding of the application and to uncover quality issues related to its modular decomposition.Later, in the Section 3, we measure the time that the developer spends on this step and refer to it as the "analyze time." To aid developers in gaining a better understanding of the original design, the design is enriched with different colors and various sizes of arrows to visualize interclass and intraclass relationships (representing cohesion and coupling metrics, respectively), a technique that has been shown to be beneficial in improving developer comprehension in designing object-oriented classes. 26,31We anticipate that this improvement in understandability will enhance the developer's ability to identify refactoring opportunities (see Figure 4).
The level of cohesion is depicted using the following colors (indicating the spectrum from high cohesion to low cohesion): red, orange, pink, green, yellow, and gray, in keeping with the use of colors to represent class cohesion during refactoring process employed by Simons et al. 31 When the developer gains a sufficient understanding of the system, she/he can request that the tool generate refactoring solutions based on an objective, machine-calculated fitness function.The output of the automated refactoring process is a refactored version of the program expressed as a UML class model, and also a list of applied refactorings and metrics information gathered during the refactoring process.Figure 5 F I G U R E 3 A package view of the Questionnaire Management System as depicted by Design-Imp.The variation in arrow sizes conveys how strongly packages are coupled to each other.illustrates a screenshot of Design-Imp where the list of proposed refactorings is presented on the right-hand side and the refactored version of the QMS is displayed in the middle of the screen.The package explorer view, on the left-hand side, is especially useful when the project contains many packages and classes, and the developer can easily select the package or class they wish to focus on.As next step, to include the developer intuition in the refactoring process, the developer can direct evolution by accepting promising refactorings, or rejecting unfavorable ones.To aid the developer to make correct decisions, the effect of each refactoring is shown using visual cues.So when the developer selects any refactoring from the proposed solution represented on the right side of the screen (as shown in Figure 5), its effect on the class diagram is highlighted.This information can help the developer to understand what part of the design was changed and why.To improve clarity, the developer can request that irrelevant classes and relationships are blurred-see Figure 6 as an example.
The tool also allows the developer to modify an applied refactoring (inspired by the work of Alizadeh et al. 16 ) or directly request a refactoring by selecting the entity (class, method, or field) in the model and selecting the desired refactoring using a context menu.The developer decisions are included in the fitness function, and considered during the next iteration of the algorithm (more details in this regard provided in Section 2.4.7).F I G U R E 6 Visualization of two Pull Up Method and two Pull Up Field refactorings.The developer has requested that irrelevant classes and relationships be blurred.While Design-Imp can show inheritance and call relationships between classes, the only interclass relationship depicted here is inheritance.
The interactive process repeats until no further refactoring can be recommended by the tool (e.g., where all valid refactorings have been rejected) or the developer is satisfied with the design created.At this point, the developer requests source-level refactoring to commence, so the source code is automatically refactored to comply with the created desired design.This process of code generation is described further in Section 2.4.2.

| Approach details
The previous subsection provided a high-level scenario of how our approach might be used.In this subsection, we elaborate further on several aspects of the process that require a more detailed explanation.Section 2.4.1 elaborates on the UML-like model that is extracted from the source code, while Section 2.4.2 discusses how code is generated from this model.In Section 2.4.3, we describe the type of evolution search employed.Section 2.4.4 explains how solutions are represented in our evolutionary approach, Section 2.4.5 explains the fitness function employed while in Section 2.4.6, we explain how diversity is encouraged in the population.Section 2.4.7 finally describes how feedback from the developer is incorporated into the fitness function.

| Model extraction
The design model extracted from the source code contains the design elements depicted in Figure 2, represented as an attribute type graph.While much of the program detail is omitted in the extraction phase, the model still contains some extra detailed information other than standard UML class model.For example, the design model stores method-method and method-field interactions which are not normally part of a UML class model, but which are required to maintain the consistency between a design and its corresponding source code when the code is being refactored.
As shown in Figure 2, the extracted model does not include UML association, aggregation, and composition relationships that are part of the standard UML model.These relationships are not explicitly defined in the source code and a reverse engineering algorithm which relies only on source code may produce erroneous or inconsistent relationships.Therefore, the model extracted by Design-Imp differs from those extracted by CASE tools, such as Visual Paradigm, ‡ UML Designer, § or Pynsource ¶ that offering round-trip engineering between UML and source code.

| Code regeneration
During source-level refactoring, a refactoring is applied if its preconditions are true and it does not create any compiler errors.However, during precondition checking, it is possible that a design-level refactoring cannot be mapped easily to source-level refactorings, or it cannot be executed because of a failing precondition.In this case, Code-Imp tries to find an equivalent sequence of refactorings that is applicable to the existing source code, and whose execution has the same effect as the candidate design-level refactoring. 32 an example of design-level refactorings that cannot be easily mapped to source code refactorings, consider a case where during the design exploration phase two methods named foo() and bar() along with field baz are moved to their subclass using two Push Down Method and one Push Down Field refactorings.However, neither the private methods foo() and bar() nor the private field baz can be moved to their subclass because both methods use attribute baz, and also method foo() calls method bar().Hence, as soon as one of these elements is moved to the subclass, its reference in the superclass is no longer valid, leading to a compiler error.Therefore, these refactorings are all rejected during precondition checking.# In this case, our solution is to add new refactorings to the refactoring sequence. 32Initially, two new refactorings, Increase Field Accessibility and Increase Method Accessibility, are added to the refactoring sequence to increase the accessibility of attribute baz and method bar() to protected.Then, the required move refactorings are performed, followed by reverting the accessibility of the field and methods to private using Decrease Field Accessibility and Decrease Method Accessibility refactorings. 32veloper comprehension of the refactored code is a concern.It is central to our approach that the refactoring sequence is generated with the direct involvement of the developer, so they have agreed with every refactoring in the sequence.We anticipate therefore that the refactored code will not look too surprising to the developer.There is a risk that, for example, comments may get misplaced during the refactoring process.
This problem affects all automated refactoring research and we view as outside the scope of our work.‡ https://www.visual-paradigm.com/.
§ https://www.umldesigner.org.¶ https://pynsource.com.# As the design model contains less detailed information than the source code, design-level preconditions are weaker than source-level ones.It is not uncommon for a refactoring to be appealing at the design level, but for it to fail precondition checking at source code level.

| Evolutionary search
In this paper, we use the Non-dominated Sorting Genetic Algorithm-III (NSGA-III k ) 33 to create a Pareto-optimal front including a set of refactoring operations that improves an employed fitness function defined based on four conflicting objectives.NSGA-III is an enhanced version of NSGA-II 34 and uses reference points to maintain the diversity of the population. 33While the efficiency of most multiobjective evolutionary algorithms decreases with an increasing number of objectives, the performance of NSGA-III on some unconstrained problems with three to 15 objectives has been shown to be efficient. 33The efficiency of NSGA-III is also investigated in the field of search-based refactoring, where NSGA-III performs significantly better than some mono-objective and multiobjective evolutionary algorithms including NSGA-II. 35,36Full details of the NSGA-III algorithm are provided in the original paper cited above.In the following subsections, the details of each component of the implemented algorithm are discussed.

| Solution representation and initial population generation
Each solution in our problem is represented in the form of a one-dimensional array, where each cell in the array contains a refactoring to be applied to the program.The size of a solution is randomly selected between lower and upper limit values.At the start, each solution is initialized randomly with concrete refactorings of types shown in Table 1.Two constraints are taken into account in creating this refactoring sequence: • Refactoring preconditions are taken into consideration.As an example, a method is not renamed to a method of the same name in its class.
• Refactorings may be added based on previously added refactorings.As an example, when an Extract Class refactoring is generated, a new class is added to the program, and then, it is possible to generate refactorings that move methods or fields to this newly created class.
These two conditions are vital; otherwise, the majority of generated refactorings would not be applicable to the program due to their preconditions failing, or else, the generated refactorings would have no relationship to each other.This latter case is unrealistic-in a real-world scenario developers refactor the program as a set of interrelated transformations, rather than by applying refactorings in isolation. 37,38

| Fitness function employed
To measure the quality of the proposed refactoring sequences, we define a fitness function, to be evaluated at the model level, based on four different objectives.A refactoring sequence has a high merit if it (i) maximizes the quality of the refactored design, (ii) minimizes the number of code smells, (iii) minimizes code changes, and (iv) is acceptable from the developer's point of view.We use a combination of different objective metrics to prevent the search being too biased and creating designs that, while optimal in terms of the employed metric, appear strange to a human software developer. 10It is also worth mentioning that the effectiveness of the first three objectives has been proven in previous searchbased refactoring studies. 16,20,36,39,40 measure the effect of the proposed refactoring sequences in terms of design quality, we used the Quality Model for Object-Oriented Design (QMOOD) model. 41The QMOOD quality model measures six design attributes, namely, reusability, flexibility, understandability, functionality, extendibility, and effectiveness. 41To combine these six attributes into a single objective in the fitness function, we first normalize them by dividing each one by its value in the initial design, 41 and then use their average as a measure of the quality of the software.
However, it has been established that some design attributes of the QMOOD model are not suitable for search-based software refactoring. 10ploying the QMOOD reusability, effectiveness and extendibility functions to guide search-based refactoring tends to create solutions with empty classes. 10This mainly happens due to the positive effect of the design size property used in the reusability function, and hierarchies as well k The algorithm is provided by the MOEA framework, and it is optimized for performance.as abstraction properties used in effectiveness and extendibility functions.In fact, a newly created class in the hierarchical structure can positively affect these three design attributes and thus have a high chance of being accepted by the search algorithm.To avoid creating any empty classes in the refactored design, we defined one additional constraint that informs the search process that solutions involving empty classes are to be avoided.** There is a relationship between refactorings and code smells, and both affect quality attributes. 42Therefore, to consider this feature in the refactoring process, we define another objective in the fitness function that rewards minimizing the number of code smells in the program.† We also aim to minimize the changes applied to the initial design, and this is measured through the number of refactorings applied to the design.The main reason is to simplify the evaluation of refactorings by the developer.Furthermore, it has been shown that most developers prefer refactoring solutions that minimize the number of design changes they make. 16veloper feedback is also included in the fitness function from the second iteration of the algorithm.Further details of this is in Section 2.4.7.

| Diversity promoting operators
We used crossover and mutation as two genetic operators to promote diversity in the solutions.To select the best solutions on which to perform crossover, we used Stochastic Universal Sampling (SUS) selection.For crossover, for each two selected parents, a crossover point is randomly determined, and all refactorings from that point are swapped between the two parents to produce two offspring (single-point crossover).It is possible that crossover will render a swapped refactoring inapplicable as it may depend on (an) earlier refactoring(s) in its original parent.To resolve this, when a refactoring is copied to a child, any refactorings it is dependent on are also moved along with the dependent refactoring even though they are before the crossover point.This process can result in refactoring solutions of different lengths.Duplication of refactorings is not allowed during the crossover operation.
Using this technique, refactorings are guaranteed to come after any refactorings on which they depend.Note that a solution containing interrelated refactorings is usually easier for the developer to validate, as the dependencies grant the sequence a certain semantic coherence.
In the next step, we apply the mutation operator to the resulting offspring to slightly modify the solution, thereby improving solution variability and possibly improving fitness.To do that in each resulting offspring, we randomly select a refactoring that is accepted by the developer during the interactive process and then generate a new refactoring that is relevant to the selected refactoring.The newly generated refactoring is then added to the solution.
For example, if an Extract Class refactoring is selected, we consider moving fields and methods defined in the original class but related to elements defined in the extracted class.In this case, the related field or method is moved to the newly extracted class.As another example, when an Extract Method refactoring is selected, we consider moving the extracted method to classes that use it more than its original class.Note that in real-world scenarios, developers usually perform composite refactorings rather than unrelated ones. 37,38It is also worthy of note, as confirmed by Rebai et al 20 and Alizadeh and Kessentini, 30 that including developer preferences on where the refactorings should be applied (code location) can reducing interactive refactoring effort. 20,30Therefore, the proposed mutation operator can help developers focus more on the code locations they are most interested in and this way prevent refactorings being applied incoherently to a large number of code locations.

| Incorporating developer interaction into the fitness function
This section provides a detailed description of how developer interaction is included in the fitness function.It should be emphasized that while we are proposing a novel way to improve accuracy of the developer feedback during the refactoring process, the approach we employ to include developer feedback in the fitness function has already been successfully used in earlier research works such as those performed by Alizadeh et al 16 and Kessentini et al. 44 To avoid presenting a diverse population of Pareto front solutions to the developer, a single Pareto front solution is selected at random by the MOEA after each iteration of the algorithm and presented to the developer.The developer can then express her/his opinion about the proposed refactoring sequence and refactorings to be included in the solution.While accepting/rejecting the entire proposed refactoring sequence as a whole is quick, it is recommended that the developer provide an opinion about each individual refactoring in the solution.This is very valuable in early iterations of the algorithm to guide the search process more efficiently.
**Note that we do not change the QMOOD model to prevent empty classes.If a proposed refactoring (e.g., Extract Class) were to create an empty class, and no fields or methods were moved to this newly created class by the following refactorings in the sequence, then the Extract Class refactoring would be deleted from the refactoring sequence.
† It is shown that refactoring tools may introduce new code smells. 43To alleviate this issue, the refactoring process introduced in this paper is supervised by the developer.
The developer may have no preference, and so does not provide any opinion about the refactoring sequence and its refactorings, or in contrast, she/he can accept/reject the refactoring sequence or any of its refactorings.The developer can also modify a proposed refactoring according to her/his preference (e.g., changing the target class of a proposed Move Field refactoring).This technique has been used before by Alizadeh et al 16 and Kessentini et al, 44 where a set of recommendations with regards to the selected refactoring is provided to the developer.If a refactoring is modified by the developer, the original refactoring in the solution is replaced with the modified one, and it is marked as an accepted refactoring from the developer's point of view.
At the end of each iteration, developer feedback is incorporated into the fitness function as follows: (i) The score for each accepted/ modified/added refactoring in the refactoring sequence is increased, and (ii) explicitly rejected refactorings are removed.In this way, we increase the chance of accepted refactorings being in the next generation while preventing rejected refactorings being proposed again by the algorithm. 16,44

| EVALUATION
To evaluate the feasibility of our approach in a practical environment, we conduct a series of experiments on Java applications with industrial developers.We compare three approaches: (1) a fully automated refactoring tool, 10 (2) our interactive tool excluding visual cues, and (3) our interactive tool based on visual cues.In terms of Stohl and Fitzgerald's Actors, Behaviors, Context (ABC) model of software engineering research, our approach can be categorized as an Experimental Simulation.Our participants (Actors) are professional software developers and we measure their refactoring behavior using accepted techniques such as number of code smells fixed.The Context in which they perform the work is not their regular work environment, but a set of three pseudo-realistic exercises chosen from open-source applications.
In this section, we first present our research questions and validation methodology followed by the evaluation setup.We then evaluate the three approaches and discuss the results.The complete results of our experiments have been made publicly available. 45

| Research questions
We defined four main research questions to assess the value of our approach, as compared with a fully automated refactoring approach and our interactive approach excluding visual cues.More specifically, our experiment addresses the following research questions: RQ1: Does our approach improve the appropriateness of the proposed refactorings compared with the other two approaches?
RQ2: Does our approach improve the correctness of the proposed refactorings compared with the other two approaches?
RQ3: Does our approach improve developer comprehension during the refactoring process compared with the other two approaches?
RQ4: How does our approach perform in terms of execution time and memory consumption?
Appropriateness is a measure of how satisfied the developer is with the proposed refactoring sequence.Correctness is a combination of how well the proposed refactorings match the refactorings verified by external experts and the number of code smells removed by the refactoring sequence.Developer comprehension is measured using a poststudy questionnaire.

| Evaluation setup
To investigate these research questions, we conducted experiments with nine developers from the software industry and empirically assessed the benefits of our proposed approach on three small-to medium-sized Java projects.Of the nine developers, seven were former Masters or Bachelors students of the first author, and were known to him as strong programmers.Two other were unknown to the authors but were invited by the other developers because of their programming ability and interest in refactoring.
At the beginning of the study, a 45-min presentation was given to the developers to familiarize them with the employed tools and the experimental setup.We also provided a brief overview of the three Java projects used in the experiments.We then asked developers to fill in a prestudy questionnaire to collect their background information (see this link 45 for more details).
Before conducting the experiments, we divided the developers into three skill-equivalent groups based on information provided by them in the prestudy questionnaire.Therefore, each group consisted of three developers, and 27 experiments were performed in total, where each approach was evaluated by nine developers on different Java projects.We used the Latin square design 46,47 to assign the employed approaches randomly and minimize learning bias.Our goal is to compare three treatments (i.e., interactive with cues, interactive without cues, and fully automated) and block two variables: (i) three groups of developers and (ii) three different Java applications.Table 2 illustrates the design of our experiment.The size of the Latin square is 3 Â 3, in which the x axis is the group of developers and the y axis is the Java applications used in our experiments.As shown in Table 2, each treatment appears only once in every row and column.As a result, each group of developers use all three approaches on different Java applications.In other words, the employed approaches are allocated randomly in such a way that each one is used once for each group of developers (row) and once for each Java application (column).
In the interactive approaches a UML class model representing the design of the Java project under investigation is extracted by the tool and shown to the developer.The developer is allowed to review the presented design using the provided graphic facilities.The developer is also allowed to review the source code of the project.The time spent in this step is measured and referred to as the analyze time.No time limit is set for this step and the developers start the refactoring process whenever they feel ready.
When the developer starts the refactoring process, a refactoring sequence based on the employed fitness function is created and presented to her.At this stage, the developer can accept/reject/ignore or modify the proposed refactorings.She can also apply manual refactorings based on the current software design and her understanding of how the design may be required to evolve.This is achieved simply by requesting the desired refactorings using a context menu.The developer can request another iteration, and in this case, the tool produces another sequence of refactorings based on the employed fitness function, but this time, the developer feedback on the previously generated refactorings is included in the fitness function.This process repeats until the developer is satisfied with the results.In the experiment performed with interactive approaches, we collect the number of interactions that developer had with the tool, and the time spent in each iteration.We also collect the number of accepted/rejected/ignored and manually applied refactorings.
In the fully automated approach we employ, the fitness function is based on (i) maximizing the quality of the refactored design, (ii) minimizing the number of code smells, and (iii) minimizing the number of code changes.Each component metric is normalized to the same range and the fitness function is defined as the unweighted sum of these metrics.This is a standard fitness function for fully automated search-based refactoring, and its effectiveness has been proven in several existing studies. 16,20,36,39,40The fully automated refactoring process was applied five times ‡ ‡ to each application for each developer, and the best refactoring sequence chosen to be presented to the developer, who then evaluated the appropriateness of each of the proposed refactorings.
At the end of experiments, we asked the developers to fill in a poststudy questionnaire (see this link 45 for more details) to elicit their opinions about our approach compared with the other two approaches.
The functions we use in our evaluation are defined in the three subsections below.Note that the notion of correctness used in RQ2 is a combination of precision (Section 3.2.2),and the ability of the approach to remove code smells from the application being refactored (Section 3.2.3).

| Appropriateness function
Regarding RQ1, to measure the appropriateness of refactorings proposed by each tool, we employed a formula used in many similar research works. 16,20,44,48More specifically, as expressed in Equation ( 1), we divide the number of refactorings accepted by the developers by the total number of refactorings in the recommended refactoring solution. 16,20,44,48propriateness ¼ # Accepted Refactorings # Recommended Refactorings ½0,1: ð1Þ

| Precision
Regarding RQ2, one measure we use to assess correctness is to calculate the precision of the refactoring sequence.For this, we use the precision formula expressed in Equation ( 2).‡ ‡ In our pilot experiments, we observed that there was no significant difference in the fitness value for the selected solutions after 10 runs.We therefore decided that five executions of the algorithm would be sufficient to get a "good" solution to represent the fully automated approach.
T A B L E 2 Experimental design based on a Latin square.
To measure precision, we first created a dataset containing all refactorings accepted by the developers in all experiments.These refactorings were then manually evaluated by a team of three professional software developers and the first author of this paper.These developers did not take part in the experiments, but they are all very familiar with the three Java projects under consideration.The first author of this paper was involved in the validation process as he is the main programmer of one of Java projects used in the experiments, namely, RefDetect.At the end, the validated dataset contains refactorings accepted by at least one of the developers and subsequently categorized as a valid refactoring by all of the evaluators.While refactoring is a subjective process, the demand that all evaluators must independently agree that a refactoring is valid is a rigorous test and gives us confidence that the validated refactorings are indeed ones that should be applied.Note that recall is not calculated as we cannot determine all possible valid refactoring opportunities that exist in an application.

| Percentage of fixed code smells
For RQ2, we also measure the percentage of code smells fixed (PF) by the accepted refactoring solution proposed by each tool.We employ a formula used in similar research works. 16,19More specifically, as expressed in Equation (3), we divide the number of code smells fixed by the employed approach by the total number of code smells that exist in the initial version of Java project under consideration. 16,19 ¼ # fixed code smells # total code smells * 100 ½0,100: ð3Þ In our experiments, we consider five types of code smells that are reported along with their description in Table 3.These code smells were chosen as they are the ones most likely to be resolved by the refactoring types supported by our approach, as described in Table 1.Note that for some code smells, we only consider specific cases depending on the information that exists in the model.As an example, for code smell Duplicate Code, we do not support Extract Method refactoring as necessary information for this refactoring is not exist in the model.As another example, for code smell Refused Parent Bequest, we only support pushing down methods and fields to subclasses that use them and we do not support cases that the refactoring Replace Inheritance with Delegation is the solution because this refactoring is not supported by Design-Imp.
To create a dataset of code smells in the applications, we employed three code smell detection tools, namely.JDeodorant, § § JSpIRIT, ¶ ¶ and a version of the Organic tool ## that we augmented to detect the Refused Parent Bequest smell, where a field in a superclass is only used in certain subclasses.We added this detection ability because, while we observed several instances of this smell in the ATM application (see Section 3.3), this smell is not detected by JDeodorant, JSpIRIT, or the original version of Organic.Using these tools, we created a tentative set of code smells.
The completeness of this dataset cannot guaranteed of course, because there may be code smells not detected by the employed code smell detection tools.However, all detected instances were rigorously manually validated by four professional developers familiar with the systems under investigation, which gives us reason to have confidence in this dataset.
T A B L E 3 Code smells considered in our study.

| Studied projects
We conducted our empirical study on three small-to medium-sized Java projects, summarized in Table 4 and elaborated upon in the following paragraphs.
The first project is a questionnaire managment system (QMS) that can be used by teachers to create, manage, and publish questionnaires and used by students to complete a given questionnaire. 49It has a simple design, but contains some design defects as discussed by Lin et al, 49 and this made it a perfect choice for refactoring.
The second project is a simplistic Automatic Teller Machine (ATM) simulation application that was developed as an example of objectoriented programming. 50Its exemplary design meant that there were few if any opportunities to improve its design.Therefore, for this application, we manually changed its code structure to introduce some code smells to its design.For example, in several cases where a method uses some fields or methods from another class, we moved the method to the other class, thereby introducing the feature envy code smell.The changed project is then used in the experiments to provide more refactoring opportunities in this application.
The final project is the RefDetect application developed by the first author of this paper as a multilanguage refactoring detection tool. 51We choose to study RefDetect as it is of medium size and the authors know the application very well, something which is useful when it comes to assessing the various designs proposed by Design-Imp.A further benefit of using RefDetect is that our participants are familiar with, and interested in, refactoring, which helped them in becoming familiar with RefDetect's design.
We avoid using large projects because developers would not have the time to develop an understanding of them during the experiments, and this would reduce the accuracy of their assessments of the appropriateness of the applied refactorings.However, we acknowledge that the scale of the projects used in the experiments may be small compared with industrial projects.Therefore, a more complete analysis of how well the proposed approach works with industrial applications is left for future work.

| Study participants
We recruited nine participants with industrial backgrounds from different software companies of small and medium size.All the participants have experience in Java programming and have between 1 and 4 years programming experience in the industry.None of the participants had prior experience using our approach, and they were not the main developer of any of the Java projects used in the experiments.This information is gathered using a prestudy questionnaire with questions about the participant background information.The participants were not paid, and voluntarily took part in the experiments.Note that we generally refer to the participants in our experiments simply as "developers."

| Parameter settings
As mentioned, we used NSGA-III as implemented in the MOEA framework.We changed some initial values assigned to the parameters of the algorithm as follows: population size = 500, crossover probability = 0.4, mutation probability = 0.8, and maximum fitness evaluation = 100,000.
All these values were obtained by trial and error.In each interaction step, one refactoring sequence chosen at random from the Pareto front is shown to the developer for her/his evaluation.Other approaches are possible here, for example, choosing a solution that best matches the personal preferences of the developer or choosing a "knee point" on the Pareto front. 52

| Results and discussion
This section reports results for the four research questions originally presented in Section 3.1.
T A B L E 4 Java projects used in this study.RQ1 refers to the appropriateness of the proposed refactorings, that is, to what extent do they satisfy the developer?Table 5 shows descriptive statistics for the appropriateness of the refactoring sequences suggested by our interactive approach for each Java project.As shown, on average, the refactoring sequences proposed by our approach are found to have an appropriateness value of over 0.8, that is, over 80% of the refactorings in each sequence are deemed acceptable by the developer.A degree of left skew is observed in each case.On inspection, this tail of relatively inappropriate refactoring solutions all occur in the early iterations of the refactoring process, indicating that our interactive approach is successfully leading developers to more appropriate solutions.
Figure 7 shows a breakdown of the all refactorings from the developer point of view across all experiments.Our interactive approach with visual cues has resulted in fewer accepted refactorings than the other two approaches.In fact, only around 44% of proposed refactorings were accepted by the developers using our interactive approach, ‡ ‡ ‡ which is less than the other two approaches.
A naive interpretation of this data suggests that our feedback to developers and visual cues are simply not helping.However, we will now explore this further and show that our approach is using developer feedback to produce more appropriate refactorings from the developer perspective.Also, when we consider RQ2, we will see that the results are more precise (i.e., unanimously agreed as valid by the evaluators), which implies that some of the refactorings accepted by the other two approaches are not really valid refactorings.This demonstrates that the visual cues provided by our tool helped the developers to better decide about correctness of the proposed refactorings.
Figure 8 shows the average appropriateness of the refactorings proposed by our interactive approach throughout the execution of the algorithm.While there are some fluctuations in the graph for all three projects, the suitability of proposed refactorings rises steadily in all projects.
This indicates that the employed search-based technique has been effective in including developer feedback during the refactoring process.We also asked the developers in a poststudy questionnaire "How did you find the ability of our interactive approach to learn from your feedback to suggest more appropriate refactorings?,"and based on a 5-point Likert-type scale, the average agreement of the developers for this question was 4.2, which is very positive.
To gain a better understanding of the refactoring process, Figure 9 shows the average number of refactorings rejected by the developers during the execution of the algorithm in all experiments performed with both interactive approaches.As reflected in the figure, the number of rejected refactorings progressively decreases, which suggests strongly that the algorithm is producing more appropriate refactorings as the refactoring process moves forward.The natural explanation for this phenomenon is that in the early stages of the algorithm not enough feedback has been provided by the developer for the algorithm to suggest appropriate refactorings.However, over time, the algorithm uses the developer feedback to direct the search process more efficiently and find relevant refactorings based on developer preference.
Other valuable information that can be extracted from the graph illustrated in Figure 7 is related to the refactorings manually applied by the developers (shown in black).As shown in Figure 7, the number of refactorings manually applied by developers while using our tool is around 80% more than those applied by developers using the other interactive approach (66 vs. 14).A careful examination of these manually applied refactorings showed that around 50% of these refactorings were proposed by the algorithm in other experiments.Therefore, it is very likely that had they not been manually applied by the developers, these refactorings would be proposed by the algorithm in subsequent iterations, and if approved by the developer, the number of refactoring accepted by the developer would have increased, thus increasing the measured ability of our tool to propose appropriate refactorings.Put another way, the fact that interacting with our tooling led to developers manually applying useful refactorings shows that the interactive refactoring process is really helping them get to grips with refactoring the code.
Finding 1: Over the nine experiments performed with our interactive tool, on average 44% of the proposed refactorings are accepted by the developers.As the refactoring proceeds and developers provide more feedback, the tool is able to propose more appropriate refactorings.
Finding 2: More manual refactorings were applied by the developers who used our approach.This can be attributed to the help of visual cues provided by our approach, as confirmed by the developers who applied most manual refactorings.‡ ‡ ‡ Note that Figure 7 looks at individual refactorings, while Table 5 reports on refactoring sequences.An individual accepted refactoring usually occurs on several refactoring sequences, hence the apparent anomaly in the percentage of accepted refactorings reported between Figure 7 and Table 5.

| Results for RQ2
RQ2 refers to the correctness of the proposed refactorings, that is, to what extent do they (1) correspond to the set of externally validated refactorings and (2) succeed in removing code smells?To answer this research question, the results were analyzed using two metrics: (1) the precision of the accepted refactorings (measured by Equation 2) and ( 2) the ability of the approach to fix code smells (measured by Equation 3).
Figures 10 and 11, as well as Table 6, summarize the results in this regard.
According to the results from Figure 10, the solutions provided by our approach have the highest average precision value on all three Java projects.In fact, the average precision value for our approach is 89%, which is around 14% better than the second approach and 18% better than the fully automated approach.We also applied Cliff's delta (δ) 53 as an effect size measure to quantify the amount of difference between the employed approaches.In our context, a positive value means our approach yields better results than the other approach.As shown in Table 6, our interactive approach outperforms the other two approaches with an average δ value of 0.7, which corresponds to a large effect size.
The average appropriateness of the refactorings proposed by our interactive approach throughout the execution of the algorithm in the three Java projects.
F I G U R E 7 Breakdown of all refactorings across all applications from the developer point of view.In this breakdown, a refactoring is only counted once for each approach, regardless of how many iterations it appears in.
T A B L E 5 Descriptive statistics regarding the appropriateness of the refactoring sequences suggested by our interactive approach.Note: In this breakdown, a refactoring may be counted many times, once for each refactoring sequence in which it appears.
F I G U R E 1 0 The precision value of our approach compared with the other two approaches on three Java projects.Each box represents three values, one from each developer."x" marks their mean, while the middle value also represents the median.
The percentage of code smells fixed by our approach compared with the other two approaches on three Java projects.
The average number of refactorings rejected by the developers throughout the execution of the algorithm in the 18 experiments performed in both interactive approaches.
One point to note is that not only is the number of refactorings evaluated by the developers in our interactive approach is nearly double the number of refactorings evaluated in the fully automated approach (792 vs. 408 as shown in Figure 7), the average precision of the refactorings in the fully automated approach is much lower, as shown in Figure 10.The possible reasons for the superior performance of our approach will be explored in Section 3.6.3on RQ3.
We observe that the precision score in our approach is not consistently high across all Java projects.It is apparent from Figure 10 that a reduction in precision is observed with increasing project size, in particular in the case of the RefDetect application.We calculated the Pearson correlation coefficient for average precision and project size and found a moderate negative correlation of À0.67, with a p value of 0.047.This is attributable not just to RefDetect being the largest application, but its application domain being one with which the professional developers would have had little experience.According to the results from Figure 11, the percentage of fixed code smells by our approach for all three Java projects is higher than the other two approaches.In fact, using Equation (3), on average, 58% of the detected code smells are fixed by our approach, which is around 12% better than the second approach and 53% better than the fully automated approach.The results of Table 6 also confirm this superiority, where our approach works better than the other approaches especially compared with the fully automated approach.
The lowest score in fixing code smells is for the fully automated approach.This can be partially explained by the fact that fewer refactorings were accepted by the developers using the fully automated approach as illustrated in Figure 7.We also can see even in our approach on average slightly more than half of the code smells are fixed.This can be mainly due to this fact that the fitness function that guides the refactoring process is defined as a combination of objective and subjective metrics, which do not necessarily correspond to each other.In fact, fixing code smells does not necessarily correlate with improving software quality metrics at all. 36Furthermore, finding suitable quality metrics to detect all code smells is not a straightforward task. 54On the other hand, the developers may reject some refactorings which fix some code smells just as they simply find those code smells irrelevant. 16In fact, as proven by previous studies the agreement on smell detection and correction is low. 55,56gure 11 also reveals another interesting observation that there is no correlation between project size and the percentage of code smells fixed using the interactive approaches.As illustrated, the weakest results were obtained for RefDetect, the largest system used in our experiments.On the other hand, more code smells are fixed in the ATM application than in the QMS application although QMS is the smaller application.We find three reasons for inconsistency between the application size and percentage of fixed code smells in our experiments.First of all, RefDetect contains the most code smells of the applications studied, and it is simply not possible to detect and fix a large percentage of these smells in one refactoring session.Furthermore, as mentioned earlier, RefDetect is not the type of application that professional developers usually work with and so their understanding of the system was lower than the other two systems.This fact along with the size of the system caused some of the correctly proposed refactorings to be ignored or rejected by the developers.Finally, as described in Section 3.3, the ATM application was developed based on accepted object-oriented principles, and therefore, for this project alone, we reworked its code structure and manually injected a number of code smells.It is striking in Figure 11 that for the ATM application the interactive approaches stand out in removing a far higher fraction of the code smells.The obvious explanation for this is that human-injected code smells are very likely to be spotted and fixed by a human developer.By contrast, the fully automated approach, which lacks human insight of course, performs only slightly better on the ATM application than it does on the other applications.
To further investigate the refactoring types proposed by the three employed approaches, Figure 12 presents an overview of the refactorings proposed across three approaches.As illustrated, 10 different types of refactoring were proposed overall by the approaches, where Move Method, and Move Field were the most common refactoring types proposed.The reason for the large number of these refactoring types is mainly because of their positive effect on fixing Large Class and Feature Envy code smells.In addition, these types of refactoring have a positive effect on quality attributes such as QMOOD Functionality and Reusability that form part of our defined fitness function.
We also investigated what impact developer experience had on the results.The developers were initially divided into three skill-equivalent groups, so all groups were almost the same in terms of experience.However, we analyzed the results for each developer to determine how their experience affects their performance.Figure 13 shows the results for each developer according to RQ2.
T A B L E 6 Cliff's delta (δ) effect size for two metrics: precision and fixed code smells.Note: Label of the approaches: A1 (our interactive approach), A2 (interactive approach without visual cues), and A3 (fully automated approach).Cliff's delta ranges from À1 to 1, where a value of À1 or +1 shows the absence of overlap between the two groups, and a 0 shows that group distributions completely overlap. 53s shown, each group contains three developers, depicted here in order of increasing experience.Interestingly, we observe that in each case, the more experienced developer achieves better correctness.On a closer examination we discovered a few individual cases where a less experienced developer achieved better correctness than a more experienced developer in that group, for example, Developer 8 had a better result than Developer 9 in the interactive approach without visual cues (0.67 vs. 0.45).This corresponds to the intuition that while more experienced developers are better overall, less experienced developers may well perform better on specific tasks.
Finding 3: Our approach based on visual cues helped developers to select correct refactorings, that is, those that were externally validated as correct, with precision values sometimes close to 100%.This is a key factor in developing trust in our approach.In addition, our approach was found to be more successful in fixing code smells than the other two approaches.

| Results for RQ3
RQ3 relates to the ability of our approach to improve developer comprehension during the refactoring process.To answer this research question, we used a poststudy questionnaire to collect the opinions of developers about their experience using refactoring tool.The answers to the questionnaire questions can be seen in the this link. 45In this section, we summarize developer feedback for these four questions.
Refactoring types distribution proposed by the three employed approaches.
The average correctness of refactorings accepted by each developer, averaged across all the experiments the developer took part in.
1. How would you rate our tool regarding its usefulness for understanding software design?
2. How would you rate our tool regarding its help in locating refactoring opportunities?
3. How would you rate our tool regarding its usefulness for showing the effect of refactorings?
4. How would you rate your overall experience with our interactive tool?
Findings based on a 5-point Likert-type scale are presented graphically in Figure 14.They show that the average agreement of the developers for the first, third, and fourth questions are 3.9, 4.7, and 4.8, respectively.We note that overall eight out of nine developers had an excellent experience with our interactive, which is a clear message that our proposed approach is a forward step in improving the usability of refactoring tools.However, the average agreement of the developers for the second question is only 3, which requires further exploration.
Unsurprisingly, we find that the developers who applied most refactorings manually answered to the second question more positively than those who applied none or few manual refactorings.One of these developers stated that displaying the effect of a refactoring helped them to understand the reason for the refactoring very quickly and that the colors used to display the refactoring opportunities in the classes related to the applied refactoring helped them to identify other related refactorings and apply them manually.
Examination of the data collected during the experiments shows that the developers who spent more time in the analysis phase (the phase before starting the iterative refactoring process proper) for the interactive approaches gave a higher score to the first question and also had excellent results for the precision of their applied refactorings.For example, Developers 2, 5, and 9 found our tool very useful in understanding software design.These developers, as shown in Table 7, also spent the most time in the analysis phase.Collectively, they also got better overall results in terms of precision § § § than the other two developers who were using the same tool on the same Java project.One possible explanation for this observation is that the developers who spent more time in the analysis phase are probably those who have a "better eye" for software design.Their being better software designers could explain why they appreciated the tool's ability to help them engage with the software which in turn improved their ability to refactor the software.
The only negative responses apparent in Figure 14 are those to Q2 regarding help in locating refactoring opportunities.It may be that while our approach proposes refactorings, it does not direct developers to the specific locations in the source code, and this may have been seen some developers as an omission.It is worth nothing that two of these developers who ranked our approach as "little useful" (Developers 4 and 7) were relatively inexperienced and generally did more poorly in the experiments, as detailed in the next paragraph.
An interesting observation could be made when examining developer responses individually.In the questionnaire we asked "How difficult was it to interact with our interactive tool?"Five developers says it was "very easy," two developers said it was "easy," and two developers selected "neutral."These two developers (Developers 4 and 7) also had the most ignored refactorings and the most rejected refactorings.From Figure 13, we can see these developers had the lowest value for correctness (RQ2).Based on Table 7, these two developers spent the minimum time in analyze phase when using our approach.They are also the least experienced developers in their group and gave the lowest marks to the questions in the poststudy questionnaire compared with the other developers.
Finding 4: The poststudy questionnaire results show that eight out of nine developers rate their overall experience with using our interactive search-based refactoring approach very positively.More specifically, we had a high agreement on the benefit of approach in helping developers to better understand the program design and also to deeply understand the refactoring effect.

| Results for RQ4
RQ4 relates to the performance of our approach in terms of execution time and memory consumption.In our experiments, we measured the time taken by NSGA-III to find a sequence of refactorings based on developer feedback and also measure the time spent by the developer in evaluating the proposed refactorings in interactive sessions.All experiments were run on three near-identical desktop computers with these hardware and software specification: Intel Core i5-3470 3.2 GHz with 8 GB of DDR3 memory, and 500 GB 7200 RPM HDD, running Windows 10 64-bit OS, and Java SE 11 x 64.§ § § Developers 2 and 9 were the best in their group, while Developer 5 achieved a precision just 0.03 short of the best developer in their group.
Table 8 shows details of the time used by the employed search-based algorithm (NSGA-III) to find a sequence of refactorings and present it to the developer for evaluation.During this time, the developer has to wait to receive the results, so a short response time is very desirable.The results in Table 8 are based on 79 iterations performed in nine experiments using our interactive approach (as shown by the green box in Figure 16).As shown in Table 8, it took an average of 5.4 s for the search-based algorithm to find a sequence of refactorings based on the employed fitness function.This time is very small compared with the average time spent by the developer in an interactive session (5.4 s vs. 1740 s).As shown in Table 8, the search-based algorithm in all experiments performed using our interactive approach took a total of 7 min, which is only 2% of the total time spent in the experiments performed using our approach (7 min vs. 300 min).These results indicate that our interactive approach is efficient in terms of execution time.
We also measured the memory consumed by our interactive approach, using methods from the Runtime Java library.We estimated memory consumption in a crude fashion by simply subtracting the memory used at the start of the process from the memory in use when the process terminates.Table 9 shows details of the memory consumed for all experiments performed with our interactive approach.On average our tool used 394 megabytes of memory which is reasonable given that an average laptop will currently contain eight gigabytes of RAM.
While a fast response time can be effective in preventing the developer from getting tired and reluctant to continue the refactoring process, 57 repetitive and time-consuming interactive sessions also can result in developer fatigue and quality reduction. 30To show that our proposed interactive approach is also efficient in this regard, we measured the number of "generate and review" iterations the developer engaged in and also the time the developer spent in each iteration evaluating the proposed refactorings.
As illustrated in Figure 15, on average the developers, who used the fully automated approach spent less time in evaluating the proposed refactoring solution.This was predictable because fewer refactorings were investigated by them as shown in Figure 7.However, on average, the developers who used our proposed approach spent only 14% more time than those using the fully automated approach without visual cues.This clearly means our interactive approach is also time efficient.Looking at the overall results also reveals that, on average, the interactive approach T A B L E 7 Time spent by each developer in the analysis phase, both as a duration in minutes and as a percentage of the total time spent in the refactoring process.

Note:
The times marked in bold correspond to our interactive approach with visual cues, while those marked in normal font correspond to our interactive approach without visual cues.The analysis time is not applicable to the fully automated approach.
F I G U R E 1 4 The developers' answers based on a 5-point Likert-type scale to four out of nine questions of the poststudy questionnaire.
without cues tended to cause time-consuming interactive sessions.However, the average Cliff's delta (δ) value for the two interactive approaches is 0, which means there is no statistical difference between execution time of the two interactive approaches.The same happens for number of interactive sessions as shown in Figure 16.This is a positive result-our approach with visual cues is yielding benefit while not putting a measurable additional burden on the developer.Findings 5 and 6: The performance of our approach based on run time and memory consumption is perfectly acceptable from a pragmatic point of view.In addition, on average, the interactive time for our approach is only a bit more than the fully automated approach (30 min vs. 26 min).However, this small increase in time is more than outweighed by the higher precision achieved.

| THREATS TO VALIDITY
We consider four principal types of threat that can affect the validity of our experiments, namely, internal validity, construct validity, external validity, and conclusion validity.
Internal validity in concerned with the relationship between the treatment and its outcome.This must be a causal relationship, that is, the outcome must be as a result of the applied treatment and must not be as a result of some other factor that we have not taken into account.In our experiments the treatment is the refactoring approach: either fully automated, interactive, or interactive with visual cues.Other factors that may have caused the observed effects are the nature of the applications (one may be easier to refactor the another) or the developers themselves (one may be better at refactoring than another).We ameliorated this threat by designing the experiments as a Latin square so that each developer applied each refactoring approach to each application, so observed effects cannot be purely attributable to either developer or application alone.
Construct validity is concerned with the relationship between theory and observation, both in the treatment and the outcome.There is little concern regarding the treatment in our experiments, but matters are not so clear when it comes to determining what is a correct refactoring and how to measure the effect of refactoring.To reduce this threat, we regard a refactoring as correct only if four independent developers not involved in the experiments assess it as correct.This sets a high bar for acceptance and gives confidence that the refactorings we regard as correct truly are correct.While we use software metrics to measure the effect of a refactoring, this is only to give a hint to the developer if the refactoring is likely to be of benefit.The key effect we measure is whether the developer accepts the refactoring or not.We also use code smells to measure refactoring effect, although it is known that code smells are frequently ignored by developers in practice.We reduce this threat by combining this measure with other more reliable measures (appropriateness and correctness).
External validity refers to the generalisability of our findings, and there are several concerns here.The software applications used in our study were smaller that the applications usually developed in industry.However, it would not have been practical for developers to refactor a large application that they have no experience of.An observation that ameliorates this threat is that in practice developers do not usually refactor entire applications, they refactor only part of it, which resembles refactoring a smaller application.Another threat to generalizability is the participant cohort themselves.Each participant had between 1 and 4 years experience in the software industry.Results may have been different with a more experienced cohort.As explained in Section 3, seven participants were known to the first author, and this may have led to a bias in their responses.While this may be an issue for RQ3, which is based on a questionnaire, RQ1 and RQ2 are both based on objective measure that would not be prone to response bias.The number of participants, nine, is also a threat to generalizability.Although the number of participants seems small, it is close to the mean number of participants taking part in experiments reported in search-based software refactoring papers. 18nclusion validity is the degree to which conclusions we reach are warranted.The numbers of participants and software applications, nine and three, respectively, were too small for formal hypothesis testing to be used.We instead employed a variety of tables, graphs, boxplots, and bar charts to visualize the data, observe what effects were occurring and determine possible explanations for these effects.Cliff's delta was used to measure an effect size where appropriate.

| RELATED WORK
The existing work related to this paper can be divided broadly into two research areas: refactoring of UML class models and interactive software refactoring.These topics are discussed in the subsections below.

| Refactoring of UML class models
The earliest work in the area of refactoring of UML models is that of Sunyé et al, 58 who present a set of design-level refactorings and show how they can be designed so as to preserve the behavior of the UML model to which they are applied.Since then, a large body of work has addressed various aspects of design-level refactoring as can be seen in recent surveys of the field. 59,60According to Misbhauddin and Alshayeb, 59 class diagram due to its close similarity to the structure of object-oriented programs is the most frequently used UML model for model refactoring.
Fourati et al 61 propose an approach that detects antipatterns in UML models through the use of existing and newly defined design-level quality metrics defined on class and sequence diagrams and show how their approach can be used to identify a number of well-known antipatterns.This work can be used to locate areas where design-level refactoring should be applied.In contrast, Jensen and Cheng 25 use refactoring in a genetic programming environment to improve the quality of the design by introducing design patterns.Their idea to apply refactorings to the design extracted from source code to facilitate radical changes is similar to ours, but the authors do not provide a method to apply the proposed design-level refactorings to the source code, which is a key part of our contribution.
Enckevort 62 built a prototype tool that could detect undesirable UML model features based on a selection of syntactic and semantic rules and metrics, and on the basis of this could propose repair actions that would improve the model.Although not couched as such, this is essentially a smell detector that can suggest useful refactorings to the user.In their experiments with four industry models, they found that this prototype was able to improve model quality both in terms of measurable metrics and, more importantly, subjective comparison.Ghannem et al 27 used an interactive genetic algorithm (GA) that interacts with users while allowing feedback to a normal GA.The implemented tool was used to suggest sequences of refactorings which could be applied to models in the form of UML class diagrams.The extent to which an interactive approach could be applied and validation of the correctness of proposed refactorings were explored and both showed promise.
The approach that is somehow related to ours is the one proposed by Mansoor et al. 28 Mansoor et al 28 propose a multiview refactoring approach based on a multiobjective evolutionary algorithm to improve design quality of class diagram and its related activity diagrams.While the class diagram is main artifact used during design exploration phase, the activity diagrams are mainly used to minimize the number of violating behavioral preservation constraints defined on the activity diagrams.The implemented approach is compared with a mono-objective genetic algorithm 63 and one of our previous works not based on a heuristic search technique, 32 and the authors report that the proposed approach produces more meaningful refactorings in comparison with others.While in the approach presented in this paper, a multiobjective evolutionary algorithm is employed to transform the design to a better one, our approach is supervised by the developer to prevent the problems associated with the difficulty in comprehending the consequences of applied refactorings.Furthermore, we refactor the source code to match with the desired design created in the design exploration phase, an important feature not implemented by Mansoor et al. 28 Another work most closely related to ours is that of Simons and Parmee 24,26 who use an interactive, multiobjective evolutionary search technique to refactor a UML class model.Their aim is to optimize the class structure through the appropriate assignment of class features (methods and fields), based on subjective designer preferences and an objective, machine-calculated fitness function based on design coupling and cohesion.They also define four novel subjective metrics to study the role of symmetry and elegance during the design process.However, they consider neither interfaces nor inheritance between classes during the design process, even though these programming constructs are usually involved in a typical class model.In addition, the implemented approach only supports Move Method and Move Field refactorings, so the class structure cannot be changed through the addition or removal of classes during the design process.Our work differs from the work of Simons et al. in that their work takes place during upstream software design before the design has been implemented, while our approach is applied to source code and so can be used in an Agile context where no formal design documentation is produced.Furthermore, our work takes the newly created UML design and uses it to refactor the source code, something that is not addressed in any of the existing literature.
Improving the program design towards a high-quality desired design, expressed as a class diagram, has been partially explored in one of our previous works. 32In this work, the proposed approach is divided into two phases: a detection phase and a reification phase.The detection phase involves activities related to detecting structural differences between the original and desired designs and expressing the differences as refactoring instances, and the reification phase includes activities related to refactoring the original program source code based on the detected refactorings.However, the method presented in this paper is fundamentally different from the previous one.First of all, in the previous paper, the desired design is manually created by the programmer based on the current software design and their understanding of how it may be required to evolve.However, in the current paper, we employ an interactive search-based design exploration phase to create a UML-based desired design from the initial design extracted from the source code.We also provide visual cues to improve the developer's ability to uncover refactoring opportunities and enhance their assessment of refactoring value during the design exploration phase.

| Interactive software refactoring
The automation of the refactoring process is an active area, 64 and impressive performance has been achieved to date. 65In this regard, the use of search-based refactoring has been growing over the years. 66,67However, there are still some barriers and challenges that should be addressed. 17,68,69Interactive multiobjective search has been proposed to improve the refactoring process by including developer feedback into the loop. 17However, this brings its own challenges.For instance, the possibility of developer error, or how to determine the number of interactions with the developers and amount of feedback they should receive are some issues that need to be addressed properly. 18he idea of considering object-oriented source code improvement as a combinatorial optimisation problem is initially studied by O'Keeffe and Ó Cinnéide, 70 who proposed an automated search-based refactoring approach for improving the quality of object-oriented programs based on standard software quality metrics.However, it is proven that an automated search-based refactoring technique does not necessarily lead to an improved version of the program. 10,71,72Including developers in the refactoring process, as suggested, 20,72,73 may be a way to refactor software to be more helpful to the developers.
As one of the first works in this regard, Seng et al 8 formulate the refactoring process as an optimization problem.However, contrary to O'Keeffe and Ó Cinnéide, 70 it is the designer's responsibility to take the decision that a proposed refactoring should be applied to the program or not.While it helps to capture the qualitative aspects of software quality, in a case that the designer rejects some of the proposed refactorings, it might result in a refactoring sequence that is worse in terms of quality than other sequences that have not been suggested as good solutions to the designer. 74izadeh et al 16 propose an interactive multiobjective refactoring recommendation approach, based on NSGA-II, where the fitness function that guides the search is defined as a combination of (i) reducing the number of code smells, (ii) improving semantic coherence of the program, (iii) improving the quality metrics in used, and (iv) minimizing the number of proposed refactorings.To increase the acceptability of the proposed refactorings from the developer's point of view, the proposed refactorings are ranked based on their occurrence in the nondominated solutions.
The hypothesis is that the most repeated refactorings in the nondominated solutions are possibly the good ones. 16The highly ranked refactorings are presented one by one to the developer, who can accept, reject, or modify the proposed refactorings.Rankings between refactorings is dynamically updated based on the developer feedback on each refactoring.In order to update nondominated refactoring solutions, the refactoring recommendation algorithm is run periodically after a number of interactions with the developer.However, this time apart from the objectives mentioned above, the feedback from the developer is also included in the fitness function.The authors evaluate their approach against five other multiobjective refactoring detection approaches, including one based on NSGA-II with similar objectives mentioned above, but without interaction.The results show the effectiveness of developer interactions on improving the proposed refactorings, where more appropriate refactorings from the developer's point of view were suggested by the proposed approach.
In order to reduce the search space, and this way reduce the developer's interaction during the refactoring process, Alizadeh and Kessentini 30 extend their previous approach 16 and use a clustering technique to group the solutions based on their similarity in the objectives space.The resulting clusters are shown to the developer using colored graphical line charts, and the developer can select each cluster for further investigation.The feedback of developer on each cluster and refactorings included in the investigated clusters are included in the fitness function to produce more appropriate solutions in the next iterations.In fact, when a cluster is confirmed by a developer as an appropriate one, the focus of the algorithm is more on solutions in that cluster, and this way removes solutions not relevant to the developer.The proposed approach is compared with other three existing approaches, including one presented in their previous work 16 (without the clustering feature), and the results show the ability of the clustering-based multiobjective technique in reducing search space and proposing more appropriate refactorings to the developer.
In a recent work, Rebai et al 20 extend the approach proposed by Alizadeh and Kessentini 30 by introducing another level of clustering to the refactoring process in order to group the solutions based on their code locations as well.The new clustering algorithm is applied to clusters of the objective space, which are selected by the developer as appropriate ones in the first level of interaction.The new added clustering level allows the developers to be more focused on code locations that are more relevant to them.The proposed approach is compared with four other existing approaches, including one presented by Alizadeh and Kessentini, 30 and the results show the ability of the proposed approach in reducing interactive refactoring effort.
Lin et al 49 extend the idea of refactoring towards a desired design 15,32 and provide an interactive tool, Refactoring Navigator, capable of recommending a sequence of refactorings to increase the consistency between the initial and desired designs.In the proposed approach, as first step, a reflection model 75 is extracted based on discrepancies between the initial design and the desired one depicted manually by the developer.As next step, Refactoring Navigator using a search-based technique identifies refactorings that can reduce the discrepancies between two designs.The recommended refactorings are presented to the developer, who can accept, reject, or ignore recommended refactorings.The reflexion model is updated according to the accepted refactorings and the developer feedback is included in the search process.The process repeats until the desired design is achieved or no refactoring can be recommended by the tool.The authors evaluate their tool using a controlled experiment with 18 students, where students, who used Refactoring Navigator accomplished their refactoring tasks more effectively and efficiently. 49 summary, the similarity between the search-based refactoring approach presented in this paper and the aforementioned works, aside from that all take place after the software is implemented as source code, is that they all consider design improvement as an interactive optimisation problem.However, our work is different as it proposes a model-based source code refactoring approach to refactor the program based both on its desired design and on its source code.The desired design is created through an interactive multiobjective evolutionary search technique and the source code is then automatically refactored to the desired design.

| CONCLUSIONS AND FUTURE WORK
We have presented a novel interactive refactoring approach that involves both design and source code in the refactoring process.Initially, a UML-based model is extracted from the source code, enhanced with visual cues to highlight areas of poor quality, and presented to the developer.Then, an iterative refactoring process commences that allows the developer to guide the software design from its current state to a more desirable design.Finally, the source code is automatically refactored to comply, as closely as possible, with the desired design.To reify these ideas, we have constructed two software refactoring tools called Design-Imp and Code-Imp as design-level and source-level refactoring tools, respectively.
We evaluate our interactive tool on a set of three Java applications using nine professional developers as participants.The results show that our interactive approach is more efficient than a fully automated refactoring tool and our interactive tool without visual cues.More specifically, the results demonstrate that our approach yields refactoring sequences that are more acceptable both to the individual developer and to a set of expert refactoring evaluators, as well as being more successful at removing code smells.
[78] There are several directions for future work in this area.Extending the tools with further refactorings and evaluating them with other design-level and source-level metrics may yield better results in terms of design exploration and subsequent source code refactoring.There is also a tension between how free and unfettered the design exploration process is, and the ability of source code refactoring to reach the desired design.Further experimentation is required to determine where the "sweet spot" lies.It is also important to test how our approach would work on larger systems.In Section 3.6.2,we noted that as the interaction with the developer progresses our algorithm is able to make more and more acceptable refactoring proposals.This is reassuring and provides some cause to be optimistic that in refactoring a larger application over a longer period of time with many iterations, the percentage of accepted refactorings would also improve.
Probably the most important area for future work is that of developer interaction in the course of the design exploration phase.Issues such as the number of developer interactions during the refactoring process, the number of refactorings shown to the developer in each interactive session, and the manner in which developer feedback is incorporated into the fitness function are all underexplored areas that require further research. 18 claim that model-based source code refactoring with visual cues as proposed in this paper is a natural extension of the dominant practice of source-level refactoring and that the experiments provide solid evidence that the approach helps developers to create appropriate refactoring sequences.We anticipate that it can serve a role in aiding developers in performing radical refactoring in a practical setting.

F
I G U R E 4 Part of the class diagram of the Questionnaire Management System.Cohesion is represented here using three colors: red = high cohesion, yellow = intermediate cohesion, and gray = poor cohesion.A poor quality class is a likely candidate for refactoring.While Design-Imp can show inheritance and call relationships between classes, the only interclass relationship depicted here is inheritance.F I G U R E 5 Screenshot of Design-Imp.

T
A B L E 1 A list of refactorings provided by Code-Imp and Design-Imp.Method-level refactorings Push Down Method, Pull Up Method, Move Method, Rename Method, Decrease/Increase Method Accessibility.Field-level refactorings Push Down Field, Pull Up Field, Move Field, Rename Field, Encapsulate Field, Decrease/Increase Field Accessibility.Class-level refactorings Make Class Concrete/Abstract, Rename Class, Extract Hierarchy, Collapse Hierarchy, Extract Subclass/Superclass, Extract Class, Inline Class, Replace Inheritance with Delegation, Replace Delegation with Inheritance.Note: The refactorings Replace Inheritance with Delegation and vice versa and Make Class Concrete/Abstract are provided by Code-Imp but not by Design-Imp.

T A B L E 9 5
Memory consumption in megabytes.Time spent by the developers interacting with Design-Imp.Each box represents three values, one from each developer, and "x" marks their average.F I G U R E 1 6 Number of iterations in the two interactive approaches (with and without cues).Each box represents three values, one from each developer, and "x" marks their average.T A B L E 8 Search-based execution time in seconds.
Groups 1-3 indicates the three groups of developers.Abbreviations: AUT, fully automated approach; CUE, interactive with visual cues; INT, interactive without visual cues.
Large ClassA class having huge dimension and implementing different responsibilities.Extract Class/Subclass, Move Method/Field Feature Envy A method making too many calls to methods or fields of another class.Move Method/Field Refused Parent Bequest A subclass does not use the protected methods/fields of its superclass.