A new perspective on the competent programmer hypothesis through the reproduction of real faults with repeated mutations

The competent programmer hypothesis is one of the fundamental assumptions of mutation testing, which claims that most programmers are competent enough to create correct or almost correct source code. This implies that faults should usually manifest through small variations of the correct code. Consequently, researchers assumed that the synthetic faults injected in source code through the mutation operators closely resemble the real faults. Unfortunately, it is still unclear whether the competent programmer hypothesis holds, as past research presents contradictory claims. Within this article, we provide a new perspective on the competent programmer hypothesis and its relation to mutation testing. We try to re‐create real‐world faults through chains of mutations to understand if there is a direct link between mutation testing and faults. The lengths of these paths help us to understand if the source code is really almost correct, or if large variations are required. Our experiments used a state‐of‐the‐art benchmark database of real faults named Defects4J 2.0.0. It contains 835 reproducible real‐world faults in 17 open‐source projects that comprise a total of 1044 bug‐fix pairs of files. Our results indicate that while the competent programmer hypothesis seems to be true, mutation testing is missing important operators to generate representative real‐world faults.

2. The competent programmer hypothesis claims that most programmers are competent enough to produce source code that is either correct or that only differs slightly from the correct code.Conversely, the small variations introduced by a mutation testing framework would then be able to mimic most real types of faults.Therefore, if tests are not able to distinguish between the original code and mutants that mimic real faults, then the test is not able to distinguish between correct and defective code either [5][6][7][8].
There are multiple extensive and rigorous studies on the coupling effect (e.g.[9][10][11][12][13]) that come to the same conclusion: the coupling effect exists and therefore, the capability of a test suite to detect mutants is correlated with the capability to detect real faults.
The evidence regarding the competent programmer hypothesis is not conclusive.For example, Andrews et al. [9] found that mutants are similar to real faults, while Gopinath et al. [7] found real faults differed significantly from mutants.Papadakis et al. [1] note that there is a lack of new evidence regarding the competent programmer hypothesis in recent years.
Due to the importance of the competent programmer hypothesis for the foundations of mutation testing, we believe that sound evidence is important to understand the underlying principles of mutation testing and define effective and efficient mutation testing strategies.Within this article, we provide new evidence regarding the competent programmer hypothesis and use a novel approach to study the relationship between real-world faults and mutation testing.Instead of directly comparing real-world faults with mutants, we rather try to determine if we could reproduce faults through the repeated application of mutation operators.Such repeated mutations are typically referred to as higher-order mutants [1].We apply graph-based pathfinding to the Abstract Syntax Trees (ASTs) of the fixed and faulty programmes.The paths themselves consist of chains of mutation operations.If we can find a chain of mutation operators that transforms the AST of a fixed fault back to the original fault, this means that there is a higher-order mutant that could exactly reproduce the fault.Otherwise, the fault cannot be reproduced by the available mutation operators.Hence, the existence of the path allows us to directly study the similarity of faults with higher-order mutants.Consequently, the research question we try to answer within the article is as follows: RQ: Can real software faults be recreated by higher-order mutants?
The answer to this question is directly related to the competent programmer hypothesis.If we can recreate the faults, the small mistakes that are introduced by mutations are not only correlated but also representative of real-world faults, which would support the hypothesis.Otherwise, we would have evidence against the competent programmer hypothesis, as not only single mutations would not be representative of real faults but also their combinations.
The contributions of our study are the following: 1. We found that while the competent programmer hypothesis seems to be true, commonly used mutation operators are not representative of real-world faults and cannot be reliably used to reproduce these faults.2. We identified that one significant aspect lacking in mutants is the introduction of new method calls and code blocks, which would make them more closely resemble real faults.To address this, one feasible solution could be the introduction of learning application-specific mutation operators since the generic mutation operators designed for a programming language are likely not effective.
The remainder of the article is structured as follows: We discuss the related work in Section 2 followed by our approach in Section 3. Section 4 describes our experiments, including the data, mutation operators, measurements and results.We discuss our results in Section 5 and consider threats to the validity of our work in Section 6.Finally, we conclude in Section 7.

| RELATED WORK
The focus of the discussion of the related work is on the competent programmer hypothesis.For a general discussion of the literature on mutation testing, we refer readers to the recent review by Papadakis et al. [1].
While the competent programmer hypothesis was studied in the past, there are still open questions.When DeMillo et al. [3] proposed mutation testing they recognised that mutation testing relies on the assumption that 'Programmers have one great advantage that is almost never exploited: they create programs that are close to being correct!'.However, DeMillo et al. [3] provided no proof that this assumption is true.Initial empirical works to provide evidence for the correctness of the competent programmer hypothesis were conducted by Demillo and Mathur [4] and Daran and Thévenod-Fosse [14].In their studies, they manually evaluated sets of faults from single projects and came to different results.While Demillo and Mathur [4] found that about 20% of faults were simple, Daran and Thévenod-Fosse [14] found that about 85% of faults were simple and in line with the competent programmer hypothesis.Later, Andrews et al. [9] came to the conclusion that real faults are similar to mutants generated by mutation operators, but that mutants were harder to detect.However, Namin and Kakarla [15] raised several concerns regarding the study by Andrews et al. [9], for example, due to the choice of mutation operators and the selection of seeded faults for a single project.Therefore, while most early work seems to support the competent programmer hypothesis, it is unclear how the results generalise due to the limited scope of the studies and the problems with the validity that were raised.
To the best of our knowledge, the only large-scale analysis of the competent programmer hypothesis was conducted by Gopinath et al. [7].They evaluated over 4000 open source projects in C, Java, Python and Haskell and analysed the number of changed tokens by bug fixes tracked through the issue tracker on GitHub.They found that most real faults differed significantly from the correct programme version and concluded that: "… our understanding of the competent programmer hypothesis, at least as suggested by typical mutation operators, may be incorrect" ( [7], p. 197).
However, there are two potential issues with the validity of these results.First, just because something is marked as a bug in an issue tracker, does not mean that this is really a bug.Research shows that about 40% of bugs are mislabeled and actually requests for improvements or other changes [16,17].Second, the analysis through tokens may overestimate the differences.A single mutation operator can modify multiple tokens.Additionally, tangled changes could lead to an additional overestimation of the difference, because there is a non-trivial amount of unrelated changes within bug fixing commits [18,19].Finally, we believe that Gopinath et al. [7] use a very strict interpretation of the competent programmer hypothesis to come to their conclusion.They assume that the competent programmer hypothesis requires single mutations, that is, first-order mutants, to be sufficient to reproduce faults.We disagree with such a strict reading of the hypothesis because this would basically mean that competency and small variations are only achieved, if faults can be fixed by touching a single line of code, usually without modifying the complete line.This would also mean that higherorder mutants are not in line with the competent programmer hypothesis, which we also believe is too strict.
We want to overcome the weaknesses of the study by Gopinath et al. [7] in two ways.First, we use a set of validated bug fixes as the foundation for our analysis.This avoids noise due to mislabeled issues and tangling.In comparison to other work with validated data [4,9,14], our analysis is not limited to a single project.Second, we try not to measure the difference in tokens, but rather the order of mutation required to reproduce the fault.This way, we determine not only how many mutations are required but also if the mutation operators we use are sufficient to reproduce faults.The drawback of our work in comparison to Gopinath et al. [7] is that the scope is smaller, that is, we consider only 17 Java projects.Thus, our study should be seen as complementary to the work by Gopinath et al. [7], with greater construct and internal validity at the cost of external validity.

| APPROACH
In this section, we present our approach to study the relation between mutation operators and real-world faults.Our study explicitly does not cover if the mutation operators we use are 'good' or 'bad' in the sense that 'if these mutations are killed, we will likely also be able to find real faults,' as this would be a study of the coupling effect.Instead, the main question is whether it is possible to find a sequence of mutations that can recreate real faults from corrected source code.Through this, we want not only to study if this is possible with a given set of mutation operators but also which additional mutation operators may be useful to create mutations that mimic real faults.Additionally, our approach should enable us to analyse the necessity for higher-order mutation operators or whether first-order mutation operators are sufficient.This means that we are interested in the shortest sequence of mutations that can recreate a fault.
To achieve this, we propose an algorithm that takes a pair of source code files and returns a sequence of mutation operators, together with the locations where they were applied and additional information about the values changed.Figure 1 summarises our approach.Based on a database that contains bugs and the associated bug fixes, we create the ASTs of the fixed and faulty files.Then, we use the pathfinding algorithm to find a sequence of mutations that transforms the AST of the fixed file into the faulty file.We now describe this approach in greater detail.

| AST transformations
Instead of trying to find a set of mutation operators that would mutate a file of source code from a fixed to a defective state directly, we first transform the source codes into ASTs.Applying mutation operators on the AST instead of the source code directly has several benefits.We avoid issues due to whitespaces or comments.Furthermore, with an AST parser in place, the implementation of mutation operators is a lot simpler, because ASTs are easier to reason about.In contrast to this, the comparison of source code directly can be challenging, because string matching is prone to noise [20].Moreover, the usage of ASTs allows us to determine sequences of mutation operators that could reproduce faults without actually mutating the source code.Instead, we have a graph transformation problem, where the mutation operators become modifications of the AST, for example, the addition, deletion, replacement or movement of AST nodes.Through the repeated application of mutation operators, we can achieve more complex transformations of the AST.
We can formally describe this as follows: Let S fix be the corrected source file and S bug the source file that contains the real-world bug with ast S fix À Á and ast S bug À Á the corresponding ASTs.Moreover, we define a set of mutation operators M ¼ m 1 ,…,m n f gas functions m i : A !A where A is the space of all possible ASTs and i ¼ 1, …, n.A successful transformation of S fix into S bug is possible, if there is a finite sequence p ¼ m 1 , …, m k À Á M k such that: We also refer to a sequence of mutations as path.Please note that we use p to refer to the path both as a sequence and as the concatenation of the mutation operators.The length of the path k is then the order of the mutation, that is, if we need a path of k mutations, we have a k-th order mutant.
For example, in the case of Figure 1, consider a set of two mutation operators M ¼ m 1 , m 2 f g.The first operator m 1 can replace the AST node Token '+' in Figure 1a with the node Token '-'.The second operator m 2 can replace the identifier in the last node Identifier 'b' in Figure 1b with Identifier 'a'.The path m 1 ð , m 2 Þ can then mutate the AST in Figure 1a such that the fault is reproduced as illustrated in Figure 1c.Thus, there is a second-order mutant that can reproduce this fault.

| Finding mutation sequences
The existence of the path tells us if the error made by the developer that led to the fault can be reproduced by the artificially defined subset of errors that are the mutation operators.The length of the path indicates the number of simple errors required by the mutations model to reproduce real faults, that is, which order of mutations is required to generate mutants that closely resemble real faults.
The execution of the general approach on example Abstract Syntax Trees (ASTs).
The naive approach would be to exhaustively search the space of possible paths, starting from the shortest path with a breadth-first search.This means that we would first consider all mutation paths of length k ¼ 1, then of length k ¼ 2, and so on.However, the size of the search space grows exponentially with the number of mutation operators.Additionally, mutation operators can usually be applied not only to a single AST node but also to many AST nodes.Even in the short example from Figure 1, the replace identifier mutation could be applied to two nodes.Thus, such an exhaustive search is not feasible.
Instead, we decided to use the A* algorithm [21] to search the space of possible mutation paths efficiently.The A* algorithm searches the possible paths with a best-first approach.Paths are generated based on the costs of a path so far g and an estimation of the remaining costs h to reach the target.With suitable functions to estimate the costs g and h, the A* algorithm is asymptotically optimal [22].
To formulate our problem as a path search problem, we define a directed graph The set of edges E & A Â A represent the mutations.For two nodes ast 1 , ast 2 V a directed edge ast 1 , ast 2 ð Þ exists, if and only if m ast 1 ð Þ¼ast 2 for any mutation operator m M. We define the costs of a path as the number of mutations required, which is equivalent to the number of edges of the path in the graph problem, that is, For instance, the number of mutations needed, denoted as k, to transform the fixed AST shown in Figure 1a into each of the ASTs given in Figure 1a-c is 0, 1 and 2, respectively.
The estimation of the remaining costs is more difficult, as this requires us to estimate how many more mutations are required to reach the target AST.The exact value can, by definition, only be known if we already know the minimal path.Thus, we need an indirect way for the estimation of these costs.In our case, we rely on GumTree [23], a tool that can determine a path of transformations between ASTs to perform AST differencing.In comparison to our approach, GumTree can freely manipulate the ASTs, whereas we are limited to the modifications supported by the set of mutation operators M. Due to this, the paths of GumTree may be completely different from our mutation paths.Regardless, the length of the GumTree path is a good measure of the AST similarity, as a shorter GumTree path means that fewer modifications are required to finish a mutation path.Thus, we can estimate the remaining effort for a path p as: where ast_diff : A Â A ! ℕ is the number of edit operations determined by the AST differencing algorithm of GumTree.For example, consider Figure 1a as an S fix and Figure 1c as an S bug to compute the remaining cost of the path by using Equation 3. The GumTree AST differencing algorithm generates four edit operations, that is: (1) Insert node Token '-', (2) Insert node Identifier 'a', (3) Delete node Token '+' and (4) Delete node Identifier 'b'. Figure 1b depicts the situation after the application of two edit operations, that is, 1 and 3, in Figure 1a.Now, the remaining cost of the path from the AST given in Figure 1b to the AST in Figure 1c is two remaining edit operations.Hence, the values of h p ð Þ are 4, 2 and 0 by considering the ASTs given in Figure 1a-c as an S fix , respectively, while maintaining the AST given in Figure 1c as the S bug .

| Partial reproduction of faults
Because the mutations of the AST are limited by the expressiveness of the mutation operators in M, it is likely that not all bugs S bug can be reproduced through a mutation path p starting from ast S fix À Á .However, how much of the difference between ast S fix À Á and ast S bug À Á can be reduced by a path p is also important information for our study.We use the AST differencing of GumTree for this to measure the progress of the path p towards a complete reproduction of a fault as: To compute the progress of the path p for the ASTs given in Figure 1, consider the values of h p ð Þ calculated for each AST depicted in Figure 1a-c below Equation 3. The GumTree AST differencing algorithm generates four edit operations when we consider the fixed AST given in Figure 1a as the S fix and the faulty AST illustrated in Figure 1c as the S bug .The values of progress p ð Þ from the S fix towards the S bug are computed at each AST shown in Figure 1a-c.The calculations are as follows: 1 À 4/4 = 0, 1 À 2/4 = 0.5 and 1 À 0/4 = 1, respectively.
Since the A* algorithm minimises h p ð Þ, we can still use the algorithm to find a mutation path with the best progress towards reproducing the bug, even if the algorithm does not converge and find a complete path.We modified the algorithm to store partial paths in a cache, once the path search fails to find a successor that is a better solution.

| EXPERIMENTS
We conducted an experiment based on Defects4J data [10] to get insights on the competent programmer hypothesis.The nature of our experiment is exploratory.The reason for this is that the prior work on the competent programmer hypothesis is inconclusive and does not allow for the derivation of a clear hypothesis that we could confirm.Instead, our goal is to provide additional evidence regarding the competent programmer hypothesis that could be confirmed by future work.In the following, we describe the data, mutation operators, evaluation measures, methodology of the experiments and results.Our results and implementation are available online.* †

| Data
The foundation of our experiments is the Defects4J dataset first published by Just et al. [10].Table 1 gives an overview of the data of 17 projects provided in the currently available version of Defects4J (v2.0.0).‡ Overall, we used 835 bugs in our experiments.Unfortunately, we failed to execute our approach on some of the available bug-fix pairs due to instrumentation issues, that is, because GumTree failed to find a path between two ASTs as the change size or the programme size was too large.Due to this, we had to exclude 33 bug-fix pairs from the dataset for our experiments.
Since our approach is limited to the modification of a single AST, we cannot directly apply it to the reproduction of faults that affect multiple files.There are three possible solutions: 1. Exclude all faults that affect multiple files; 2. Consider a fault as reproduced if and only if we determine that the AST of all affected files can be transformed; or 3. Consider all files individually and consequently treat faults that affect multiple files as multiple faults.
The first approach would reduce our data and also skew the data to possibly simpler faults, further affecting the validity of the results.The second approach has the advantage, that we would still consider faults as the atomic units of our analysis.However, the results would be harder to interpret and possibly hide valuable information.For example, if a fault affected three files and two can be fully reproduced, but the third file would not be reproduced, this valuable information would be hidden.This is the advantage of the third approach, that is, we see for each file change if it can be reproduced.The drawback is that our analysis unit effectively becomes partial faults, because we do not consider the complete bug at once, but only changes to an individual file.However, in most cases, exactly one file was changed, so there is only a small amount of noise.We only have 25% more file changes than faults.Consequently, we decided on the third approach and reported the results for each file change as part of a bug fix.Overall, we have 1044 file changes, but only 1011 were included in the study.

| Mutation operators
The choice of mutation operators directly affects our analysis.A large set of mutation operators is more expressive and could reproduce a larger number of faults.However, our pathfinding algorithm A* has a worst-case exponential mem- , where b is the branching factor, that is, the average number of successors of a step in the path search and d is the length of the path.Thus, we must find a middle ground: a sufficiently large set of representative mutation operators for meaningful results, while also not using any operator possible to bound the exponential nature of the A* algorithm by controlling the branching factor.
We decided to use a tool-driven approach for the selection of a suitable set of mutation operators.Our rationale was that mutation operators used by tools should be designed with the competent programmer hypothesis in mind and, therefore, be representative of small variations that should be expected based on the competent programmer hypothesis for a given technology, in our case, Java.
MuJava is a well-known mutation testing framework for Java [24] and has many operators from a variety of classes.The main focus of MuJava lies on mutation operators for object-oriented programming and the extensibility of more mutations.However, MuJava also provides a set of method-level operators.These operators are, in principle, similar to the operators of other frameworks, in the sense that they modify or delete the existing arithmetic operations or logical statements.
Major is another mutation testing framework with a strong focus on extensibility and customizability.Instead of applying the mutations on a source code level, Major can integrate itself into the compiled byte code at compilation time [25].One big difference between Major and other mutation testing frameworks is the ability to define and configure mutations through the domain-specific language Mml [26].However, Major does not provide a guidance regarding a subset of suitable mutation operators that are recommended for practical use.Due to the exponential nature of our approach, it is vital that we have a reasonably small set of mutations.While we could determine a subset from Major on our own, we rather want to re-use an existing set of mutations without selections on our part which could lead to problems with the reliability of our research.
Pitest (or PIT) was originally intended to extend JUnit tests to allow them to be run in parallel.After that problem was solved, the authors decided to use their enhancement to apply mutations during the test execution and, thereby, developed their project into a fully functional mutation testing framework [27].Due to the easy integration into the build process, Pitest is one of the drivers of the practical relevance of mutation testing, highlighted by frequent mentions in developer forums and blogs.Pitest has four sets of mutation operators, with a default set of 11 operators [28].
We also evaluated other Java mutation testing frameworks (e.g.Javalance [29], Jumple [30] and Jester [31]) but discarded them as options because they are no longer actively developed.
Based on our assessment of the tools, we decided to use the state-of-the-art default and optional (i.e.old defaults, stronger and all) mutation operator sets of Pitest § as the foundation for our work, which we summarise in Table 2.These mutation operators are widely used and very expressive for logical changes that are conducted within methods, that is, mutations that do not change the interface of methods and, consequently, classes.We also note that these operators or similar operators are also available in Major and MuJava.In the following, we refer to the default set of mutation operators as M default and the combination of M default and optional operators as M optional .We defined five additional mutation operators as given in Table 3, that should help to fix some shortcomings of M default and M optional , especially with respect to finding mutations that require specific values of literals or where references should be § http://pitest.org/quickstart/mutators/T A B L E 2 Pitest mutation operators we used.The description of default and optional sets of Pitest [28] is separated by the line.

Conditionals Boundary
Replaces relational operators such as > , > ¼, < and < ¼ with a different relational operator.

Increments
Replaces increments with decrements and vice versa.

Invert Negatives
Inverts the value of floating point and integer values, either applied to a variable or directly to hard-coded values.

Math
Replaces a mathematical operator (þ, Ã , %, j, …) with another mathematical operator.The plus operator in string concatenations ("A" + "B") is excluded because it is not considered to be a mathematical operator.

Negate Conditionals
Similar to the conditionals boundary operator, this operator replaces relation operators, by inverting them.¼¼ becomes !¼ and so forth.
Void Method Calls Deletes calls to methods that do not return any values.

Empty Returns
Replaces the value in a return statement with the default value for that type.For example, strings become empty strings, integers and floating point numbers become 0.
False Returns Replaces Boolean return statements to always return false.
True Returns Replaces Boolean return statements to always return true.

Null Returns
Replaces reference type return statements to always return null.
Primitive Returns Replaces numeric return statements to always return 0.

Return Values
In the new default group, this operator has been replaced by another set of return mutators, that is, empty, false, true, null and primitive returns.Depending on the return type of a method, it mutates the return value.

Inline Constant
Depending on the type of a non-final variable, it replaces literal values with another value.

Constructor Calls
Places null values as a substitute for constructor calls.
Non Void Method Calls Replaces non-void method calls with the Java default values according to the method return type.

Remove Increments
Removes increment (þþ) and decrement (ÀÀ) operators from the local variables.

Negation
Replaces any numeric variable with its negation.

Arithmetic Operator Replacement
Similar to the math operator, but it replaces one arithmetic operator (e.g.þ) with all other operators from this set (e.g.À, Ã , =, %).

Arithmetic Operator Deletion
Removes binary arithmetic operator along with one of its operands.

Constant Replacement
Similar to the inline constant operator, it substitutes a constant c with 1, 0, À1, Àc, c þ 1 and c À 1.

Bitwise Operator
Removes bitwise operator (e.g.&, j) along with one of its operands.It also replaces & with j and vice versa.

Unary Operator Insertion
Inserts an increment (þþ) or decrement (ÀÀ) operator to a local variable.
T A B L E 3 Extended mutation operators we used for additional experimentation.

Method Calls
The default set of Pitest can only mutate code to remove void method calls.We added an additional operator that can remove any method call, as otherwise all faults where non-void method calls are removed would not be reproducible.This approach is similar to an experimental mutation operator of Pitest for non-void method calls.

Relaxed Empty Returns
The empty returns operator uses a fixed default value as a replacement.While such a replacement may be a reasonable mutant, it is unlikely that such a mutant matches a real-world fault exactly because the replacement value may be different.We modified the empty returns operator to allow any possible replacement value to be able to better reproduce faults.

Relaxed Inline Constants
Same as relaxed empty returns, but for the inlining of constants.

Relaxed Return Values
Similar to relaxed empty returns.Additionally, we allow not only all possible values of the same literal type are possible but also the replacements of object references.Otherwise coding mistakes where the wrong object was returned could not be reproduced.

Rename
A generic operator that can replace valid identifier names with other valid identifier names, for example, the name of a method with another method.We added this operator because otherwise all faults where a method call is replaced could not be reproduced.
replaced.These are also simple mistakes that are in line with the competent programmer hypothesis (i.e.failing to rename variables after copy and paste) that would otherwise not be covered.We refer to the combination of M default , M optional and the five new operators as M extended .We note that there is sometimes an overlap between the operators in the set, for example, the operator 'Relaxed Empty Return' could always be used instead of the simpler 'Empty Returns'.Due to this, we bias the A* heuristic to first use operators from the default set, then the optional set, and only then from the extended set.We actively decided against additional operators, especially those that modify the interface of methods and classes.Our rationale was that such mistakes would not be in line with the competent programmer hypothesis, as changes that affect the design of a software cannot reasonably be characterised as small differences to the correct code.Our decision to not use such operators was also a factor in favour of Pitest over MuJava as the foundation for our work.

| Measurements
Let B be the set of bugs included in our study with b ¼ S fix ,S bug À Á B as a pair of fixed and defective source code.We define our performance metrics based on three subsets of the bugs R, P and U that are defined as where M Ã is the set of all possible mutation paths.These sets define the faults that we can fully reproduce (R), that we can partially reproduce (P), and that we are unable to reproduce (U) using mutation operators.The size of these sets is our measure to explore the competent programmer hypothesis: if we can reproduce or at least partially reproduce faults, this indicates support for the competent programmer hypothesis.We use the progress of partial paths as defined in Equation ( 4) to gain further insights into how well the partially reproduced faults support the competent programmer hypothesis.Faults that cannot be reproduced at all are contradictory to the competent programmer hypothesis.
The second measure we use is the length of the mutation paths, that is, the order of the resulting mutants for faults that we can reproduce.Short mutation paths would provide strong support for the competent programmer hypothesis for these faults.Longer paths indicate that the faults are the sum of many small mistakes, which means that the fault was actually not just almost correct, which would not be in line with the competent programmer hypothesis.

| Methodology
Our experiment methodology is straightforward and consists of three phases.In all phases, we apply our approach for the reproduction of faults through mutations described in Section 3 for the Defects4J data.In the first phase of the experiments, we use M default to determine how many faults from Defects4J we can reproduce, in the second phase, we use M optional .Thus, the first phase evaluates the relationship between a standard set of mutations that is already used in practice and the support of the competent programmer hypothesis.The second level analyses how the Pitest optional operators influence the recreation of faults ratio when used along with the M default set.The third phase evaluates how our extension of the allowed mutations M extended impacts the results.We used insights from the first two phases to extend the set of mutation operators.Through this, we mitigate a potential impact on our results due to a small and restricted set of mutation operators.In all phases, we measure the relation between the competent programmer hypothesis and the set of mutations through the measurements defined above.

| Results
Table 4 shows the absolute number of our reproduction of faults.Overall, we could fully reproduce 24 faults with the M default operators, 29 faults with the M optional operators and 75 faults with the M extended operators.Moreover, 287 faults could at least be partially reproduced with the M default operators, 300 with the M optional and 717 with the M extended operators.We could not reproduce 700 faults with the M default , 682 faults with the M optional and 219 faults with the M extended operators.Figure 2 provides a relative view of the data, that is, the percentage of faults per project that could be reproduced, partially reproduced or not reproduced at all. Figure 3 shows the difference among the percentage of faults reproduced by using M default , M optional and M extended sets.The data is relatively stable and shows that between 50% and 81% of the faults could not be reproduced at all with the M default mutation operators.This changes with the M extended operators, which allows for the partial reproduction of 54% to 82% of the faults instead.
Figure 4 allows us a closer look at the partial reproduction of faults and shows which percentage of the required AST we were able to reproduce.When considering different groups of mutation operators, that is, M default , M optional and M extended , a consistent trend emerges in the results.Notably, the number of faults fully reproduced by these operator groups follows the order M extended > M optional > M default .For cases involving partially reproducible faults, we observe a similar trend in their distribution.With the M extended set, few faults have less than 10% partial reproduction, while most peak at around 30% partial reproduction.As the percentage increases, the number of partially reproduced faults steadily declines, with very few exceeding 75% partial reproduction.While the peak and decline in partial reproduction are more rapid for the M default and M optional operators, aligning with the small number of faults that can be partially reproduced using these operators.
Figure 5 shows the length of the mutation paths, that is, the order of mutations we determined for the reproduction of faults.With the M default and M optional operators, we observe that full reproduction was only achieved through firstorder mutants except one second-order mutant for the M optional .This is different for the M extended operators, where we also see full reproductions with up to fourth-order mutants with one outlier, that is, a sixteenth-order mutant.For partial reproduction, we observe that there are many cases where longer mutation paths are considered.The longest mutation path that is found consists of 55 mutations.However, we also observe a strong decay of the lengths of the mutation paths, that is, while we observe about the same number of first-order and second-order mutants for partial reproduction with M extended , the number of higher-order mutants steeply declines for higher mutation orders.With the M default and M optional sets, we observe this steep decline already for second-order mutants.
Finally, Table 5 reports which mutation operators were selected for the mutation paths.The operators 'Increments', 'Empty Returns', 'Remove Conditionals', 'Inline Constant', 'Constructor Calls', 'Relaxed Emtpy Returns', 'Relaxed Inline Constants' and 'Relaxed Return Values' are never used.The most effective operators for the reproduction of faults are the 'Void Method Calls' and 'Method Calls' operators, that is, the removal of method calls.The 'Rename' operator was also often used to switch identifier names.We further find that 'Relational Operator Replacement' seems to be important, but is only frequently used with the M extended set.We observe the same for the 'Negate Conditional', T A B L E 4 Absolute numbers of bugs that we recreated through mutations.j R j are fully reproduced, j P j partially reproduced and j U j not reproduced at all.

M default
M optional M extended j R j j P j j U j j R j j P j j U j j R j j P j j U j Percentages of bugs that were reproduced (R), partially reproduced (P) and not reproduced (U).
F I G U R E 3 Difference: Percentages of bugs that were reproduced (R), partially reproduced (P) and not reproduced (U).
F I G U R E 4 Percentage of required Abstract Syntax Tree (AST) changes that were reproduced through mutations.
'Non Void Method Calls' and 'Arithmetic Operator Deletion' operators.The 'Negation' operator is only used with the M extended set.'Conditional Boundary', 'Math', 'False Returns' and 'Null Returns' are equally used in all mutation operator sets.Application of 'Constant Replacement' and 'Bitwise Operator' operators is noticed similar in both M optional and M extended sets.'Invert Negatives', 'True Returns' and 'Remove Increments' are valuable, but less important than the other operators.The 'Primitive Returns', 'Arithmetic Operator Replacement', 'Return Values' and 'Unary Operator Insertion' operators are also helpful, but not for many faults.Relaxing the return operator did not make a difference, as this only merged the results for 'False Returns' and 'True Returns' into a single operator.
F I G U R E 5 Lengths of the mutation paths we found to reproduce bugs.Maximum path length for M default and M optional is 44 while for M extended its 55.
The general message of our results regarding the research question is clear: We fail to recreate most real-world faults through mutations, regardless of the order of the mutants.However, we can at least partially reproduce a large proportion of real-world faults.Within this section, we dive deeper into our results and identify reasons why this was not possible, discuss what this means for the competent programmer hypothesis and the implications of these results on the future of mutation testing.

| Reasons for failed fault reproductions
The main reason for failed reproductions is that the mutation operators are not in line with how software is often modified as part of bug fixes.One aspect is that bug fixes are often related to changes in method calls.The value of more flexible deletion of method calls is already evident in our results, demonstrated by the relaxed 'Method Calls' operator.The 'Rename' operator also covers cases in which a method was replaced with a different locally available method with the same signature.However, this still does not account for the addition of new method calls.Thus, all faults that contained a method call that was not part of the fixed source code, could not be reproduced.We note that this also means that our decision to use the Pitest operators [32] for this study does not have a strong impact on the results, as these faults could also not be reproduced by the operators of any of the other mutation testing frameworks we are aware of.This problem can be generalised as the key reason for failed reproductions: whenever there is a change to an external dependency that is outside of the scope of a method that is modified as part of a bug fix, mutation testing is likely not able to reproduce such faults.Unfortunately, the design of generic mutation operators that could solve this problem in a meaningful way is very hard, maybe even impossible.Consider what this means: mutation operators should be able to pick a suitable (!) candidate from all (!) possible method calls that could be inserted at any (!) possible locations.The number of possibilities is for all practical purposes infinite for non-trivial applications.Since the number of suitable candidates is likely very small, the chance of randomly selecting a good method call to insert is basically zero.However, this is how normal mutation operators work: They define a logic that should be mutated and then insert this mutation in the source code.Bounding this problem can be considered as the opposite of programme repair, which indicates that learning common mistakes within an application may be a solution for this.This implies that the solution to this problem could be an application of a specific set of mutations that is learned from prior faults.Interestingly, this seems to be exactly the approach suggested by Beller et al. [33] in a recent paper on the adoption of mutation testing in practice.The first example of a learned mutation that Beller et al. [33] present is a complex replacement of a method call.
A second pattern we found that would be hard to define with mutation operators is the addition of conditional blocks, for example, if statements with null checks.The inability of the mutations to add complete blocks hinders the reproduction of such faults.Similar to the addition of method calls, there is a huge amount of possibilities, where new blocks could be inserted randomly.Moreover, such operators would also have to decide which statements should become part of the new block, while also generating meaningful conditions.Thus, random mutation operators are, again, likely not possible, while learning operators could also solve this problem.
Thus, while we were initially surprised that the number of faults we could fully reproduce is relatively low, a deeper inspection of the problem shows that this is not surprising and that a fixed set of mutation operators is not likely to overcome this problem.Consequently, our analysis also showed that the inability is not due to the complexity of the faults.

| Competent programmer hypothesis
Our evidence for the competent programmer hypothesis is not straightforward and the actual implications are a matter of interpretation both of our findings and the meaning of the competent programmer hypothesis.
On the one hand, we found that many faults cannot be reproduced using mutation operators.However, as we discussed in the previous section, in many cases the root cause for this was not the complexity of the fault but rather that the mutation operators were not sufficiently powerful.On the other hand, we found when we could fully reproduce faults, the mutation paths were relatively short.The partial reproductions also contained longer mutation paths, and mostly contained between 5% and 75% of the required mutations.We can estimate the order of mutations required for all faults, where we have at least partial mutations for the path p as where g p ð Þ and the progress p ð Þ are defined in Equations ( 2) and ( 4), respectively.This provides the expected mutation path lengths, assuming that the same ratio of the required mutations is missing.We then proceed and add the data for fully reproduced faults.Figure 6 shows the result.We observe that the sequence lengths seem to be exponentially growing with the percentiles.In other words, we have an exponential decay for the lengths of mutation paths: 40% of the faults should be reproducible by 7 or less mutations, 50% of faults should be reproducible by 9 or less mutations, 60% of the faults should be reproducible by 14 or less mutations and 70% of faults should be reproducible by 21 or less mutations.
Thus, while we find 30% of non-trivial faults that require more than 21 mutations to reproduce, we still believe that our results support the competent programmer hypothesis.Most faults can be replicated with only a few mutations, which is in line with the competent programmer hypothesis.There is also a grey area of faults that require more, but not a lot of mutations.We do not want to judge whether these are still small variations or larger deviations.However, the competent programmer hypothesis does not state that mutation testing should be able to mimic all real faults, but rather only most real faults.Thus, we do not believe that the more complex faults we found contradict the hypothesis, instead we postulate that these are the 'few' faults that are not among the 'most' mistakes.However, as we found that standard mutation operators do not cover all important cases, we also find that first-order mutations are too simple to represent real-world faults.
We note that our discussion is based on our interpretation of the competent programmer hypothesis.If we were to assume a rigid stance regarding the competent programmer hypothesis, similar to Gopinath et al. [7] our conclusion would be strongly against the competent programmer hypothesis.This shows a vital research question that future studies should address: What do developers interpret as 'small variation' that is in line with the competent programmer hypothesis?Without such work, we can only provide evidence regarding the data, but not yet achieved a universally accepted interpretation of this evidence.

| Implications
While we support the competent programmer hypothesis, our results are rather on the side that the faults introduced by currently used mutation operators are not similar to the majority of faults.However, this is not because the faults are complex, but rather because the mutation operators lack the expressiveness to cover the relevant AST modifications.
We note that our results should not be seen as an implication that the mutation operators, for example, of Pitest, are bad.However, that they work and can be used to improve test suites can, to our mind, not be explained using the competent programmer hypothesis, as the faults these operators mimic are not realistic.Instead, we can only speculate why current mutation testing is effective.One explanation could be that the coupling effect is sufficiently effective on its own.Another aspect could be that even unrealistic mutations are sufficient to uncover a lack of assertions in tests.Further research on the foundations of mutation testing is required to better understand why this works, for example, by analysing how test suites are improved as a consequence of mutation testing.
The second, possibly more important, implication of our results is that research into the targeted generation of higher-order mutants could be valuable, as these seem to better mimic real-world faults.Such mutations should especially cover how logic may change through additional calls of methods, which may be particularly hard to solve, due to the huge amount of possibilities.The challenge that must be solved is to identify new method calls such that 1. the mutants are not live mutants because the additional method calls have no effect; and 2. the mutants are not trivial to kill, for example, because they crash the application.
Recent research by Beller et al. [33] shows the promise of using real-world faults to learn mutation operators, instead of specifying them manually through the addition.Such learned mutations generate effectively higher-order mutants, as they provide more complex modifications of the AST.Since the patterns could even include how new methods are selected for mutation, we believe that such approaches may lead to a major advancement of mutation testing, both with respect to the effectiveness, as well as the adoption.
Another interesting result of our study is that certain mutations were never used for the reproduction of faults.While this does not automatically mean that such mutations cannot help, for example, to improve test suites, this is I G U R E 6 Expected lengths of mutation paths required for the reproduction of bugs.certainly an indicator that not all mutation operators mimic real-world faults that occur regularly.This is another indication that learning mutations could lead to more effective mutation testing.

| THREATS TO VALIDITY
There are several threats to the validity of our work, which we report following the classification by Wohlin et al. [34].

| Construct validity
The core of our construct is the selection of mutation operators.An unsuitable set of mutation operators could alter our results towards finding fewer or longer mutation paths and, thereby, bias our results against support for the competent programmer hypothesis.We countered this threat through the extension of our set of mutation operators through additional operators that resolve restrictions and make the approach more general.Our data shows that this was effective, as we were much better able to reproduce faults with the extended set of mutation operators.Since our results are sufficient to estimate the expected order of mutations to fully reproduce the faults in our sample, and the data also shows the limitations of current mutation testing operators, we do not believe that the usage of other mutation operators would substantially alter our results.There could also be a threat to our results because we first use the default operators, then the optional operators and then the extended operators.However, this should not limit the capability to find mutation paths, but rather only lead to using simpler operators first.Since our approach is limited to the modification of a single AST, we cannot directly apply it to the reproduction of faults that affect multiple files.However, exactly one file was changed in most instances so there is a small amount of noise.We only have 25% more file changes than faults.

| Internal validity
Our conclusion in support of the competent programmer hypothesis is, in part, based on the extrapolation of our results, that is, the combination of our findings regarding the observed path lengths and the completion percentage.While we made the reasonable assumption that a similar number of mutations is required to cover the remaining segment of AST modifications, this is not necessarily true.Instead, the remaining part could be more complex mutations that affect more AST nodes at once, therefore requiring fewer mutations.Regardless, while this means we may overestimate the order of mutations required to reproduce real faults, this would not impact our findings: there would still be a non-trivial number of faults that require a large number of mutations and most faults would still be reproducible with relatively few mutations.

| External validity
Our experimentation is limited to data from Defects4J, which is not an unbiased sample of faults in general, or even for Java software.Therefore, our results may not translate to other settings, for example, other Java projects or projects written with other programming languages.However, the generalisation of our results would only be affected, if the faults from Defects4J are particularly simple or particularly difficult.If the faults are simpler than faults on average, this would mean that our conclusion in support of the competent programmer hypothesis may be wrong, as we would underestimate the number of mutations required.If the faults are more difficult than faults on average, our findings with respect to the competent programmer hypothesis would hold, but our extrapolation that learning mutation operators can overcome difficulties may not hold, as simple pre-defined operators may be sufficient.While we cannot mitigate this threat, we are also not aware of any research on the Defects4J data that indicates that these faults are particularly simple or difficult, nor did we see any indications regarding this in our study.
The size of our sample of faults is another threat to the generalizability of our results.However, the trends we observe in our data are very clear, and with a sample size of 835 unique faults, it is not small.Therefore, we do not see any indication that these trends should change with a larger sample from the same population, e. g. more faults sampled the same way as Defects4J.
Within this article, we consider the competent programmer hypothesis through the lens of our ability to reproduce faults through mutation operators.This gives us a different perspective on the competent programmer hypothesis than prior work that, for example, only considered how many tokens are changed by bugs or manually compared mutations with faults.We re-framed the problem of transforming a correct into a buggy AST as a path search problem, in which each step of a path is a mutation.This allows us to evaluate not only if mutation operators are directly related to faults but also the order of the mutations required to fully reproduce faults.We found that our data supports the competent programmer hypothesis.However, we were often not able to fully reproduce faults because of the limited expressiveness of mutation operators.We found that especially the addition of new blocks and method calls is a large difference between real-world faults and mutation operators that can only delete or modify AST nodes.Thus, while our results indicate that the competent programmer hypothesis is true, mutation operators are often not in line with the slight differences in correct code introduced by developers.
In the future, we plan to investigate if automatically learned mutation operators are better suited to reproduce realworld faults and, thereby, demonstrate that mutation testing is based not only on the competent programmer hypothesis but also actually fully in line with real-world faults.These investigations will help us to further understand why mutation testing works with the goal to at some point not just demonstrate a correlation between mutations and realworld faults, but actually find causal links between mutations and real-world faults that can be exploited to improve the effectiveness and efficiency of mutation testing.
Number of analysed pairs of bugs and fixes per project.
T A B L E 5