The relationship between evolutionary coupling and defects in large industrial software

Evolutionary coupling (EC) is defined as the implicit relationship between 2 or more software artifacts that are frequently changed together. Changing software is widely reported to be defect‐prone. In this study, we investigate the effect of EC on the defect proneness of large industrial software systems and explain why the effects vary. We analysed 2 large industrial systems: a legacy financial system and a modern telecommunications system. We collected historical data for 7 years from 5 different software repositories containing 176 thousand files. We applied correlation and regression analysis to explore the relationship between EC and software defects, and we analysed defect types, size, and process metrics to explain different effects of EC on defects through correlation. Our results indicate that there is generally a positive correlation between EC and defects, but the correlation strength varies. Evolutionary coupling is less likely to have a relationship to software defects for parts of the software with fewer files and where fewer developers contributed. Evolutionary coupling measures showed higher correlation with some types of defects (based on root causes) such as code implementation and acceptance criteria. Although EC measures may be useful to explain defects, the explanatory power of such measures depends on defect types, size, and process metrics.

and blurred interfaces between modules and submodules. Breu and Zimmermann 17 showed that EC information and data mining techniques could detect crosscutting concerns in software systems. Such crosscutting concerns emerging overtime may contain functionality, which does not align with its architecture. Furthermore, Eaddy et al 18 argued that crosscutting concerns were harder to implement and change consistently because multiple (possibly unrelated) locations in the code have to be found and updated simultaneously. Their study suggested that increased crosscutting concerns may actually cause or contribute to defects. Our own previous study of EC in a banking system 19 suggested that EC does impact defects.
Conversely, Graves et al 8  In this study, we analysed the correlation between EC measures and the number of defects and defect density in 2 large software systems in industrial software development environments. Correlation analysis is performed separately for each module.* We also built logistic regression models. In this study, multivariate regression analysis is used to explore the relationship between EC (independent variable) and defects (dependent variable) to understand how helpful EC measures are in defect analysis compared with other process metrics (we build correlation models rather than prediction models). We also analysed the relationship between EC and defect types. Our research questions are as follows:

• (RQ1) What is the relationship between EC and software defects?
The results of our study showed that there was, in general, a relationship between EC and software defects in the software maintenance/evolution phase of the industrial software systems under study.
We detected a positive correlation between EC measures and defects.
Compared with other process measures such as the number of commits and the number of developers, EC measures seem to contain additional, sometimes important, information about defects: for every additional EC, the module is 8% more likely to be defective. However, correlation strength varied across modules and in some modules EC and defects were not correlated. On the basis of these findings, we added the following research question to our study:

• (RQ2) What factors explain why the relationship between EC and software defects is different for different modules?
Modules, which were small in Lines of Code (LOC) and developer numbers, tended to be less correlated with EC. Fewer defects due to EC seem to occur in small modules. Evolutionary coupling also appeared to be more highly correlated with some types of defects such as code implementation, acceptance criteria, and analysis problems. Overall, *A module is part of a software system. A software system is composed of one or more independently developed modules. Similar functionality is contained within the same module, and a module is generally composed of many source files. A module is generally owned by a specific team, and the team members are responsible for its development and maintenance. In the systems analysed in this study, modules are also part of subsystems. There is a one-to-many relationship between subsystems and modules. A module can be part of only one subsystem, and a subsystem may have many modules. But subsystems are not covered in the scope of this work. regression analysis showed that EC may be useful for explaining defects in industrial systems.
We make the following contributions in this paper. Firstly, we analyse large commercial systems, which have rarely been empirically studied to understand the relation between EC and defects. Secondly, we show that the effect of EC on defects varies depending on the module. Thirdly, the explanatory power of EC measures varies depending on defect types and module features such as size and developer activity. This paper is organised as follows: In the next section, we summarise related work. In Section 3, we present our methodology including measures, data extraction, and analysis methods. Section 4 shows the results of applying our methodology to 2 industrial systems. The discussions and threats to validity of this study are then addressed in Sections 5 and 6, respectively.
Finally, in Section 7, we summarise and present our conclusions.

Evolutionary coupling
Evolutionary coupling was first identified in 1997 by Ball et al. 21 Early studies on EC focused on the relationship between EC and architectural problems with EC used as an indicator of architectural weaknesses and modularity problems. Classes that were frequently changed together during the evolution of a system were presented visually using EC information by Ball et al. 21

Relationship between EC and defects
Evolutionary coupling measures have also been used in defect prediction studies. These studies are related to our first research question (RQ1). First, we focus on the studies, which reported a relation between EC and defects. Steff and Russo created sequential commit graphs of evolutionary coupled classes. 28

METHODOLOGY
This section explains the setting and data sources that we used as well as how we extracted and analysed the data.

Study context
We performed our study on 2 large industrial systems. One of the systems was a large financial legacy system that had evolved for over 25 years to support the back-end business processes of a large financial institution (henceforward known as "Company 1"). Much of the code was written in PL/I and COBOL, but there were also files written in job control language, a scripting language used on mainframes to develop a batch job. The system consisted of 20 subsystems and 274 modules.  Table 1.

Data collection
We collected source code data from SVN and CA SCM VCSs, defect data from JIRA and in-house developed defect repositories, and the link between source code and defects from configuration management database (CMDB) at Company 1. Figure 1 presents an overview of our approach to mine the data.
We developed adapters for the 5 different data sources. The output of adapters containing the data retrieved from the data source for the specified period is stored in a database. For source code data, we fetch all versions created during the specified period in the VCS and store them in the database. We applied static code analysis on each file revision providing method-level and program-level static metrics (discussed in the next subsection). Commit information such as the developer ID that created the version, date of creation, and the related problem/request/project ID were available from the source code repository. We applied filtering to remove large commits that may have contained logically irrelevant changes. Commits containing more than 30 files were ignored and were not considered while calculating EC measures.
In CMDB, each software product is defined as a separate configuration item (CI) and each change is recorded and linked to the corresponding CI. In our study, we collected all source code-related changes performed in the scope of defect fixing or enhancement on the software product analysed over the defined period. In CMDB and JIRA, 2 different sources for a change were defined: Problem or Request. We could therefore distinguish between bug fixing and enhancement.

Code repositories
Source code repositories are primarily used for storing and managing changes to source code artifacts. The full history of changes, the owner of the change, date of the change, and even the corresponding require-

Defect repositories: Company 1 (Finance)
We mined the defect repository to collect defect data reported.
The defect repository at Company 1 was developed in-house by the company.
Mapping between defects and source code: We followed different approaches for the 2 companies for finding a mapping between defects and source code. For Company 1, we used the CMDB for this purpose. For both companies, we assumed that files involved in a defect fix contained the defect.
Configuration Management Database: Many companies store information related to components of their information system in a CMDB, which contains data describing the following entities 35 : • managed resources such as computer systems and application software, • process artifacts such as incident, problem and change records, • relationships among managed resources and process artifacts.
In our study, we used the CMDB system at Company 1 to extract data about the relationship between MR and source code. The CMDB system was developed in-house by the company.

Defect repository: Company 2 (Telecommunications)
Company 2 used JIRA, 36 a proprietary defect tracking product, developed by Atlassian.
Mapping between defects and source code: For Company 2, we used the defect IDs provided in the SVN commit comments by developers and the revision numbers provided in JIRA issues. For both companies, we assumed that files involved in a defect fix contained the defect. The percentage of fixed bugs linked to version control is changed between 73% and 79% yearly. The bugs that are not linked to version control include the defect fixes, which do not require source code change and version control commit such as database-related fixes. The mappings from SVN commit to defect and from JIRA issue to SVN commit were generally consistent. Table 2 lists all the measures used in this study. The following sections will provide the details of these measures.

EC measures
In the companies under study, any changes made to the source code were made based on MRs. An MR represents a conceptual software change, which includes modification of one or more source code files by one or more software developers. These changes can defect fixes or enhancements. We used an MR-based approach to calculate EC and formalise our approach as follows.
Let MR denote the set of MRs, mr denote a specific MR in MR, and f denote a source code file changed in the scope of mr. On the basis of these definitions, we calculate evolutionary coupled files and EC measures as follows: The set of evolutionary coupled files of a file f: The total number of evolutionary coupled files of a file f: Set of evolutionary coupled files of a file f in the scope of a MR mr: Sum of the number of evolutionary coupled files of a file f for all mr's in MR: are the unique commit operations of a developer. In this approach, it is assumed that developers commit logically coupled files within a transaction. The system at Company 1 was a legacy system, and developers rarely committed more than 1 file in one transaction. Therefore, we found that a transaction-based approach was not appropriate to detect EC. We followed an MR-based approach and grouped the file changes according to the associated MR numbers. 8 In our approach, file changes spanning multiple transactions that were grouped together if they were associated with the same MR. The third issue considered for EC calculation was the boundary for finding coupled files. We chose module level to find coupled files that resided in the same module. We consider EC only within module boundaries. Alternative module boundaries could be subsystem or system level, which considers cross module couplings.
In this study, we ignore any cross-module ECs.

Size measures
Lines of Code was chosen for size measurement, and this is also used for normalising derived measures. We also used LOC to detect outliers in the data. To this end, we identified files whose size was greater than 10 K (0.4% of all files). These files were removed from the analysis as they were interpreted as outliers. Lines of Code is also used to investigate file size as a possible confounding factor. We check for correlation between LOC and other measures. Using defect density as normalised measure in our study mitigates the risk of size as a possible confounding factor. We use the following measures for defects: number of defects reported for a file (NoD) and defect density (DD). We use the following formula for calculating defect density:

Defect types
We used the defect types listed in the Appendix (Table A1) and provide their descriptions. This defect type classification was used by Company 2 and each defect reported was tagged by one or more defect types (the defect repository stored the defect type data for each defect). The defect types in Table 13 are ordered on the basis of the defect type codes used by the company.

Analysis method for answering RQ1
Spearman correlation analysis was used to find the relationship between EC and defect measures. Since the data is not normally distributed, we apply Spearman rank correlation analysis. Spearman rank correlation analysis is a nonparametric test of correlation and assesses how well a monotonic function describes the association between variables. This is done by ranking the sample data separately for each variable. We used the Shapiro-Wilk test 37  0.5 and 0.7 as high, between 0.7 and 0.9 as very high, and above 0.9 as almost perfect. 39,40 Correlation analysis was applied on each module separately to obtain , P and StdErr values for each. We used histograms to summarise the correlation results and the SPSS 41 tool was used for the statistical analysis.
After correlation analysis was performed, we applied multivariate logistic regression and multicollinearity analysis with basic process metrics such as number of commits, number of developers, and prior number of defects as well as EC metrics. With this analysis, we are aiming to identify the relationship between metrics and metrics that do not add any new knowledge about defects.
The following describes the steps taken to build a logistic regression model for the EC metrics, process metrics, and the presence or absence of defects. The first step is to binarise the defect count such that a data point is labelled defective if the defect count is greater than 0. Then we build a logistic regression model using all terms and no interactions.
Having built the model, we test for multicollinearity to find any independent variables, which are correlated. Then we build a model, which includes interaction terms and identify terms, which are correlated.
Finally, we build an interaction model without correlated terms and apply stepwise reduction to remove terms, which are not significant.
By using regression models, we aim to determine whether a particular independent variable really affects the dependent variable and to estimate the magnitude of that effect, if any.
We diagnose collinearity through variance inflation factor (VIF) analysis. 42 We used 2.5 as the cutoff value for the simple model and 10 for the interaction model where collinearity naturally occurs by default.
If a VIF value is greater than the cutoff value, the metric with the largest VIF is removed and the model rebuilt until all VIF values are less than the cutoff value.

Analysis method for answering RQ2
We used box plots to determine differences between the modules where significant correlation was or was not observed. We drew box plots for the following measures: To check the role of defect types, we repeated the correlation analysis between EC and defect measures, but this time for each defect type.
We aimed to find defect types that were likely to be related to EC, and we checked the distribution of defect types for each module.  Figure 3A,B shows the distribution of values on the histogram for Company 2. The correlation values do not seem to be high but while interpreting these results, we need to consider that we are only analysing one factor among many, which can have a relationship with defects. From this perspective, having 59% of modules with significant correlation and low to moderate correlation strength is an important result.

RQ1: What is the relationship between EC and software defects? Correlation analysis results
If we compare the analysis results of the 2 companies, we observe that Company 2 has relatively fewer modules with high correlation values. The practices such as Agile and TDD used by Company 2 may have affected this result. Such practices may lead to lower coupling in systems. This result may also be due to the different architectures used by these 2 systems. Company 2 used the Model-View-Controller architectural pattern in its projects, which divides a software application into 3 interconnected parts, so as to separate internal representations of information from the ways that information is presented to, or accepted from, the user. Whereas the architecture in the Company 1 systems is more ad hoc since these legacy systems have been evolved over a long period. Organizational structure of the companies may also have impact on the design and coupling of the systems analysed as suggested by Conway law. 43 However, this should be investigated further. ‡ This figure only shows the histogram of Spearman values for correlation between NoECF and NoD. The histogram for correlation between NoECFMR and NoD is not shown in the main text, as it is very similar to the former one. However, it can be seen in Figure A4 in the Appendix.  We have also applied Spearman correlation analysis for basic process metrics such as number of commits, number of developers, and prior number of defects for comparison purposes. Table A3 summarises the results.

RQ1: What is the relationship between EC and software defects? Regression analysis results
After correlation analysis, we applied multivariate logistic regression to build models, which indicate files which are likely to be defective. First, we built a logistic regression model using all terms and no interactions (Table 3).
Having built the model, we test for multicollinearity to find any independent variables, which are correlated (Table 4). We assess the VIF. A VIF > 2.5 is considered problematic requiring one or more variables  to be removed. "NoECFMR" and "NoECF" are identified as being correlated, and therefore, we remove "NoECFMR" from the model (Table 5).
Multicollinearity analysis results and odds ratio (OR) § effect sizes after removing 'NoECFMR' are also provided in Table 6 respectively. The OR results suggest a rather low relation between EC and defects, although slightly higher than that of the number of commits.
Having identified individual variables, which make a significant contribution to the logistic regression model, we built a model that includes interaction terms (Table 7) and identify terms that are correlated (Table 8). Again, VIF values are highly likely to be correlated because we are using interaction terms; therefore, VIF > 10 is considered problematic (Table 8). Odds ratio effect sizes for this model are provided in Table 9.
Next, we built an interaction model without correlated terms and applied stepwise reduction to remove terms, which were not significant ( is slightly less than 1.0 indicating that as both increase together, the linear model is adjusted to marginally decrease the increasing propensity of the model to predict a file as being defective. § An odds ratio greater than 1.0 indicates that an increase in the variable will increase the propensity for the file to be defective. To check the relationship between EC measures and defect proneness of files from a different perspective, we drew box plots for EC measures of files with and without defects. A separate box plot for each module was created, and for some of the modules, these can be seen in Figure A2 in the Appendix (1: represents files with defects and 2: represents files without any defects).
We also performed manual analysis for some highly evolutionary coupled files and their defects to show how software defects were influenced by EC. In some defect instances, a highly evolutionary coupled file was changed, but this change was not accumulated to all coupled files correctly. This was the root cause of the fault. There was no structural or dynamic coupling between these files. We also observed similar instances but across different modules managed by different teams. A change made in a module was not accumulated to the evolutionary coupled modules. For some defect instances, a previous modification to a highly evolutionary coupled file caused some unanticipated behaviour in the coupled files. test. We also checked how balanced the set of modules were in defects with and without correlation, and they were mostly unbalanced. There are generally more files without defects than those with defects. We also analysed the relationship between module size and Spearman values for the correlation between EC (NoECF measure) and number of defects. The results can be seen in Figure A3 and Table A5 in the Appendix. The correlation analysis showed a significant negative correlation (p = .005 < 0.05 and = − 0.218) between module size and value.

RQ2: What factors explain why the relationship between EC and software defects is different for different modules? Defect type analysis results
The results of correlation analysis for each defect type are summarised in Code Implementation has the highest correlation with EC, and moderate correlation was observed here. One interpretation is that developers tend to make coding errors while they work on source files, which are highly evolutionary coupled, and they should take into account more relations with more files when coding these files. For the defect types in the table, we observed low correlation, although they are significant. Defect types such as Acceptance Criteria and Analysis can be associated with external EC to other modules and applications. Involvement of more modules and applications may make analysis and defining acceptance test criteria more difficult. We can interpret the correlation with Test Implementation type in a similar way to Code Implementation. We have checked the defects of Not An Issue type with the project members. They explained that this defect type was generally used for deployment problems. Correlation between defects of this defect type and EC may be explained as that deploying highly evolutionary coupled files and modules may be more error-prone due to more dependencies to be considered and deployed together.

DISCUSSION
Our findings give insights to future researchers and practitioners on the effect of EC on defects.
(RQ1) What is the relationship between EC and software defects?
Our results suggest that there is, in general, a significant positive correlation between EC measures and defects. This finding is consistent with the general opinion that low coupling is an important principle to follow for a high-quality software design and that high coupling can be related to defects. 12,44,45 Fewer interconnections between elements reduce the chance that changes in one element cause problems in other elements. Fewer interconnections between elements are also reported to reduce programmer time. 46  equally for every module in a system, so EC use is not consistently helpful. We recommend that practitioners use EC for assessing the quality TABLE 12 Spearman correlation analysis results between evolutionary coupling measures and different defect types (Appendix 9.1 provides more details of these defect types)
of their software design but also in conjunction with other module characteristics. That way, practitioners will get the best of both worlds.

(RQ2) What factors explain why the relationship between EC and software defects is different for different modules?
We also tried to explain possible reasons for the different effects of EC on software defects. We considered this issue from 2 perspectives: module characteristics and defect types. We found that EC was less likely to have an effect on software defects for modules with fewer files and where fewer developers contributed. This may be explained by fewer defects being caused by EC in relatively small modules. Potentially, there are fewer interconnections between elements in a small module. Let n denote the number of files in a module. The potential number of interconnections in a module is calculated as n * (n−1) 2 = n 2 −n 2 . Interconnections between files in a module can grow quadratically with the number of files. The more interrelated the files are, the more difficult these modules are to understand, change, and correct and thus the more complex the resulting software system. This may eventually lead to defects. An alternative explanation at least for the nondensity models would be that such files typically have fewer defects.
We also recommend that practitioners add EC measures to their metric suite for software design evaluation. We recommend that researchers report process and size metrics of modules in their EC studies to account for the possible effect of context in their results. Furthermore, we found that EC may be more related to some defect types such as Code Implementation, Acceptance Criteria, and Test Implementation and less related to others such as Unexpected Functionality, Infrastructure Issues, Missing or Incomplete Data Migration, Incorrect Environment, and User Error.
We believe that defect types may be used to explain the contradictory findings reported by previous EC studies in the literature. The different systems and modules used in these studies have different defect types ,and EC has different relationships with different defect types.
It is more likely that high EC will cause Code Implementation and Test Implementation defects, because a high number of changes must be made to related parts of the system when code with high EC is changed.
The locations of these related changes may be scattered within the application or even across applications in a software ecosystem; making related changes across these locations is likely to be challenging, and this can increase the cognitive load of developers. 48 Moreover, developers may miss some locations, which should be cochanged, and this may cause unforeseen code and test implementation problems. On the other hand, EC is unlikely to contribute to defects whose root cause is user error or infrastructure issues. If a module has defects caused mostly by user error or infrastructure issues, EC measures will not be useful for detecting defects and hot spots.

6
THREATS TO VALIDITY

Internal validity
In this study, we used CIs from the CMDB (for Company 1 only) attached to problem records and related requests (move to production, code review, move to test, etc) with which to match defects to source code files. Two assumptions were made at this stage: 1. Configuration items defined at CMDB correspond to source files changed in the scope of the resolution of a defect.
2. Configuration items of the source files changed in the scope of the resolution of a defect are linked to the problem record of the defect.
The validity of these 2 assumptions can be guaranteed for certain record types (move to production and code review), but in general, these cannot be guaranteed, that for each defect, all related source files are detected.
Another assumption is that developers commit source files changed in the scope of the same MR to the same package in the code repository. This assumption is used in the calculation of EC measures. We rely on the data collected from versioning systems, and any project, which is not managed in the versioning system (or any file which is not committed to versioning systems), is not considered in our study.
The measures and defect types chosen for answering RQ2 are not exhaustive and do not cover all characteristics of a module and all defect types, which can exist. An exhaustive examination may have revealed other factors that have a greater effect on defects and which may be confounding our results. In our study, we investigated file size (LOC) as a possible confounding factor. We observed that code size correlated with number of defects in some modules. Defect density however had either no significant correlation or only minor negative correlation. Using defect density in our study mitigates the risk of size as a possible confounding factor. We are planning future investigations to explore the effect size of a large number of factors related to defects.

External validity
External validity relates to the generalisation of our study results. We only studied 2 industrial software systems. These systems may not be representative of the way developers develop systems more generally.
We mitigate this risk by choosing 2 systems from different domains and with different technologies. In future work, we would like to extend this study by including more commercial systems and projects.

CONCLUSIONS
In this paper, we presented a study on the relationship between EC and software defects in 2 large industrial software systems. We reported a positive correlation between EC and defect measures in the software maintenance/evolution phase of systems from 2 different companies.
Our results indicated low-level, moderate-level, and high-level correlation, with varied correlation strength across modules. Our regression analysis results indicated that EC measures could be useful for explaining defects.
The box plots drawn for each module separately showed the potential of EC measures to distinguish defective and nondefective files. We also observed that the company using practices such as Agile and TDD had relatively fewer modules with high EC-defect correlation values.
However, this finding needs to be further investigated on more companies for generalisable conclusions.
We also tried to understand the reasons for variation of the observed effect of EC on software defects for different modules. We found that modules, which were small in file and developer numbers, tended to be less correlated with EC. Interconnections between files in a module can grow quadratically with the number of files. The more interrelated the files are, the more difficult these modules are to understand, change, and correct and thus the more complex the resulting software system. This complexity may eventually lead to defects, and this may be one of the reasons for variation across modules. Furthermore, we observed that EC measures showed higher correlation with some types of defects (based on root causes) such as code implementation, acceptance criteria, and analysis problems. The dispersion of these defect types could be another reason for these varying effects. Different modules have different defect types, and EC has different relationships with different defect types.
Module characteristics and defect types may also explain why differ-