SEARCH

SEARCH BY CITATION

Keywords:

  • domain-based coupling;
  • architectural dependences;
  • database dependences;
  • source code analysis;
  • programme comprehension

SUMMARY

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

Software dependences play a vital role in programme comprehension, change impact analysis and other software maintenance activities. Traditionally, these activities are supported by source code analysis; however, the source code is sometimes inaccessible or difficult to analyse, as in hybrid systems composed of source code in multiple languages using various paradigms (e.g. object-oriented programming and relational databases). Moreover, not all stakeholders have adequate knowledge to perform such analyses. For example, non-technical domain experts and consultants raise most maintenance requests; however, they cannot predict the cost and impact of the requested changes without the support of the developers. We propose a novel approach to predicting software dependences by exploiting the coupling present in domain-level information. Our approach is independent of the software implementation; hence, it can be used to approximate architectural dependences without access to the source code or the database. As such, it can be applied to hybrid systems with heterogeneous source code or legacy systems with missing source code. In addition, this approach is based solely on information visible and understandable to domain users; therefore, it can be efficiently used by domain experts without the support of software developers. We evaluate our approach with a case study on a large-scale enterprise system, in which we demonstrate how up to 65% of the source code dependences and 77% of the database dependences are predicted solely based on domain information. Copyright © 2013 John Wiley & Sons, Ltd.

INTRODUCTION

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

When software maintainers change a software entity, they have to search for other related entities and update them accordingly. This is not a trivial task, and many bugs are introduced by programmers who fail to properly propagate changes [1]. Knowledge of software dependences is vital to many change impact analysis methods and other maintenance activities [2-5].

Source code analysis can be used to trace dependences [6]; however, it is not an easy approach to apply in many situations. As software systems become more interoperable, it is common to see hybrid systems composed of multiple programming languages (e.g. C++ and Python). It is often impractical to trace source code dependences within these systems using conventional code analysis tools targeting a single language. The other difficulty in implementing existing code analysis tools is the required level of technical expertise that is beyond the knowledge of typical programmers. Therefore, it is a common practice in enterprise software environments for developers to trace the dependences and the change propagation by manually searching the source code.

A large majority of enterprise software systems are derived from domains where requirements are uncertain and are likely to change during the software's lifetime [7]. In these domains, the domain experts are the primary source of information for evaluating requirements [8]. These domain experts drive software evolution by continuously asking for new functionality or requesting changes to existing ones. Unfortunately, domain experts are in a poor position to estimate the impact of the changes that they request because they do not have inside knowledge of the software dependences.

Enterprise software systems are constructed to model business domains [7]. Therefore, it is reasonable to expect that real-world dependences be reflected in the software itself. Consequently, we hypothesise that software dependences can be predicted by exploiting domain information.

In this paper, we propose a novel approach to predicting software dependences based on the notion of domain-based coupling [9], which is derived from the domain-level relationships between software components. Although the proposed method returns the probability of dependences existing between components rather than the actual dependences, it offers software maintainers the following benefits:

  1. It is source code-independent, so it can be used where the software source code is not available or not supported by code analysis tools. For this reason, it can also assist in tracing inter-system dependences in hybrid systems with heterogeneous source code.
  2. It solely relies on domain information, thus allowing non-technical domain experts (e.g. consultants, subject matter experts and managers) to predict the impact of software changes without the support of developers. Such a prediction can assist software maintainers by improving the change management process.
  3. The proposed approach is based on the software domain-level model; hence, we envisage that this approach can be used to evaluate the complexity of software implementation with respect to the software domain-level relationships.

We evaluate our approach with a case study of a large-scale enterprise system, called ADempiere, where we demonstrate how domain information can be used to identify dependences in the source code and database layers. ADempiere1 is an enterprise resource planning (ERP) software package that integrates internal and external management information across an entire organisation. We have chosen this system as a case study, as it is a large, complex, multi-language system developed over many years, with a large user base. Our results show that we can approximate architectural dependences with more than 70% accuracy. In this study, we report how efficiently domain-based coupling can assist software maintainers in the following scenarios:

  • Searching for source code dependences: Suppose a software maintainer has no access to source code analysis tools. Using software domain information, how accurately can she predict the existence of source code dependences between various parts of a software system?
  • Searching for database relationships: Some business constraints and relationships are defined and managed at the data layer. These relationships may or may not be visible at the source code level [3, 2] or can be difficult to analyse as in legacy databases. How accurately can a domain expert predict such relationships without analysing the database?
  • Searching for architectural dependences: When a domain expert needs to estimate the impact of a change to a user interface component (UIC) such as a data entry screen, she needs to predict which other components might be affected because of architectural dependences. How accurate can such a prediction be using solely domain information?

In summary, the contributions of this paper are as follows:

  • We refine our previously defined domain-based coupling [9], and we extend our previous method of selecting the highly coupled components with the help of an automated clustering technique.
  • We formally define architectural dependences and propose a model to trace dependences among source code, database and UICs.
  • We present an empirical study of one of the biggest open source enterprise systems, demonstrating how domain-based coupling can be used to predict source code and database dependences.

A shorter version of this paper has been published in the proceedings of the 18thWorking Conference on Reverse Engineering [10]. This paper provides the following additional information: advanced details on the implementation of domain-based coupling (Section 2.4), details on the implementation of the source code-based dependence model (Section 3.4), an expanded evaluation section (Section 4), an analysis of the impact of granularity on the results (Section 4.10), and general considerations on the applicability of the domain-based coupling (Section 5). In addition, this paper extends the discussion of dependences in ADempiere (Section 3.5). In addition, this paper extends the discussion of dependences in ADempiere (Section 3.5).

The rest of this paper is organised as follows: Section 2 describes the domain-based coupling analysis. Section 3 presents the dependence analysis. Section 4 demonstrates the evaluation results. Section 5 describes the applicability of the domain-based approach to various software types. Section 6 discusses the threats to the validity of our findings. Section 7 presents the related work, and finally, Section 8 concludes this paper with a discussion about future areas of investigation.

DOMAIN-BASED COUPLING ANALYSIS

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

Domain information can reveal relationships among UICs [11]. In this section, we describe how the domain-based coupling [9] derived from software domain information can be used to predict dependences between the UICs.

We use the following terminology when we talk about the domain model of a system:

  • A domain variable is a variable unit of data which has a clear identity at the domain level.
  • A domain function provides proactive or reactive domain-level behaviour of the system that includes at least one domain variable as an input or output.
  • A UIC is a system component that directly interacts with users and contains one or more domain functions.

For example, in a business software system, a data entry form is considered a UIC, the entry and editing of business information are domain functions, and the data fields visible on the form are domain variables.

Notations and definitions

Most of this section quotes our earlier works [9, 11] with the exception of new definitions of the number of common variables (Definition 2) and revised definitions of the domain-based coupling graph (Definition 3).

We adopt the following conventions in this work. For R, Q ⊆ A × A, we denote by R. Q their composition, that is, x. R. Q. y if ∃ z : x. R. z ∧ z. Q. y. We also denote by R− 1 the inverse of R and by ID the identity relation.

Moreover, we abbreviate x. R = {y|x. R. y}. We visualise relations as graphs, denoted by G = (V,E,l) the graph G with vertices V, edges E ⊆ V × V and labels l : E[RIGHTWARDS ARROW]L for some label set L.

If L is a finite set of relation labels and lR ∈ L the name of R for any R ∈ X, then we define REL(A,X) to be the labelled directed graph REL(A,X) = (V,E,l) with V = A, E = ∪ R ∈ XR such that (v,v′) ∈ E and l(v,v′) = lR if v. R. v′ for some R ∈ X.

The three key element types are modelled as follows:

  • Domain variables are modelled by a finite set V, called variable symbols.
  • Domain functions are modelled by a finite set F, called function symbols, and the binary relation USE ⊆ F × V represents the relation between functions and variables as the input–output of the functions.
  • UICs are modelled by a finite set C called the component symbols, and HAS ⊆ C × F represents the relation between components and functions.

For the rest of the paper, and without loss of generality, we assume that the system under analysis (SUA) is fixed, that is, V, F and C are fixed and so are their REF, USE and HAS relations.

Definition 1. The conceptual connection relation CNC ⊆ C × C is defined by

  • display math

The domain-based coupling between two components is derived from shared domain variables, on the basis of the following measurements:

Definition 2. The number of common variables among two UICs is modelled by the function ϑ : C × C[RIGHTWARDS ARROW]R where

  • display math

Note that the definition of common domain variables is symmetric, that is, ϑ(c,c′) = ϑ(c′,c).

Definition 3. The domain-based coupling graph of a SUA is the symmetric weighted graph G = (C, CNC \ ID, ω) where coupling weight function ω : C × C[RIGHTWARDS ARROW] [0..1] is

  • display math

It turns out that it is practically useful to weight domain relationships by their level of sharing domain variables. A threshold t can be used to select relevant coupling by their weight ω ≥ t. In the following examples, we demonstrate how to derive domain-based coupling from UICs of ADempiere and then how to approximate dependences from that coupling.

Example 1

In ADempiere, Vendor Details (Figure 1) and Import Product are the UICs that we use in this example. Vendor Details (c1) has two domain functions, and in total, 25 domain variables, as follows:

  • display math
image

Figure 1. ADempiere: the Vendor Details user interface component.

Download figure to PowerPoint

Import Product (c2) contains one domain function and 42 domain variables as follows:

  • display math

There are 18 common domain variables between these UICs as follows:

  • display math

and in total, 49 (42 + 25 − 18) variables used by either of these UICs; thus,

  • display math

The next section demonstrates how to create a weighted graph from CNC relations of Vendor Details.

Example 2

Now that we have explained the domain definitions, let us demonstrate how to use them for predicting dependences. Imagine a domain expert who considers asking for an enhancement to Vendor Details (c1). Then, given the domain information of ADempiere, she can derive common domain variables (ϑ) among c1 and other UICs similar to what was described in the previous example.

Figure 2 shows that there are 33 UICs for which the coupling weight with c1 is greater than a given threshold ω ≥ 0.5. The selected threshold is applied to avoid weak results that do not likely lead to any architectural dependences. This also reduces the density of the resulting domain-based coupling graph and makes it more readable. The results are illustrated (Figure 2) as a weighted graph where the edge width is proportional to ω, and the edge length is proportional to 1/ω, that is, the stronger the coupling weight, the thicker is the edge and the closer the node to the centre (c1).

image

Figure 2. Vendor Details: domain-based coupling graph.

Download figure to PowerPoint

The top three closest UICs are Import Products (c2), Spare parts (c3), and Product Planning (c4), where the coupling weight values are 0.37, 0.32 and 0.25, respectively. Investigating the source code shows that all three UICs are connected to Vendor Details by source code dependences.

Implementation

The ADempiere user interface is composed of three major elements: data fields, tabs and windows. Each window is composed of one or more tabs, and each tab has multiple data fields and provides one or more domain functions.

Both windows and tabs provide one or more domain functions, and they interact with the end user; therefore, they are qualified as UICs. In the rest of this paper, we discuss the relationships between ADempiere UICs at the macro level, and we refer to a window as a UIC. It is only in Section 4.10 that we examine the micro level granularity of the UICs and discuss the impact of the granularity of UICs on the evaluation results.

In prior works [11, 9], the system functional specification document and user manuals have been used as the source of information about the software domain-level elements. Domain experts use these sources to derive the relationships between UICs and create the domain-based coupling graph based on the following manual process:

  • Step1 – Identifying UICs: Any software component that interacts with the end user and has one or more domain functions is a UIC. There are multiple sources for deriving the list of UICs including system user manual, help documents and the software menu. It is common for an enterprise application that its major UICs will be accessible through the software menu. Although in most systems, the visibility of the items in the software menu is limited based on users' privileges, the complete list is often available to the system administrator. Therefore, the list of UICs can be derived from the working software menu using administrator privileges. This function is platform-independent, and web-based enterprise systems have often similar menus to desktop applications, that is, online accounting systems, banking systems and facility management systems.
  • Step2 – Identifying related domain variables: For most enterprise systems, the domain variables are the data fields that are visible on the UICs. Domain experts review the functionality of UICs by interacting with the running software or reading the user manual and answering the following question for individual units of data: Is the data understandable purely with domain knowledge? The answer to this question will indicate whether a domain user who has no familiarity with the architecture and the source code of the given application can still understand the meaning and purpose of the given data within the domain. If a data field is related to a particular system behaviour such as Screen ID or Last Modified Record, then this is a system variable, and it will be excluded from the list of domain variables. The list of the associated domain variables to each UIC can be recorded with a generic tool such as a spreadsheet and then the CNC relationships will be derived automatically from this information using a script or a spreadsheet's macro.
  • Step3 – Creating the domain-based coupling graph: The aim of this step is to create a weighted graph that represents the strength of the CNC relationships between UICs and identify the clusters of highly coupled components. The nodes of the graph will be UICs, the edges will be CNC relationships between UICs and the weight of each edge is ω, the coupling weight function. There are a number of graph analysis tools that can be used to automatically analyse and visualise a weighted graph such as the open source network analysis tool Gephi [12].

Although the described process works for most enterprise systems, the required labour for collecting the domain information by domain experts has been a drawback in this approach. One of the resources about the domain information is the system database. In the case of ADempiere, the relationships between data fields, tabs and windows are stored in a part of the ADempiere's database called application dictionary. We took advantage of this part of ADempiere's database to automatically derive the list of UICs and their associated domain variables using the following steps:

  1. Use an Structured Query Language (SQL) script to extract the list of windows from the application dictionary.
  2. Extract the list of data fields in ADempiere.
  3. Review these data fields and exclude the fields that do not contain domain information. The remainder are considered to be domain variables. For example, Tax Group, Bank Account, Asset Number are domain variables, whereas Help and Search Key are not domain variables, and we exclude them from the list. The result of our domain analysis yields 348 UICs and 2359 domain variables, leading to 18 451 pairwise CNC relationships.
  4. Use an SQL script to extract the relationships between UICs and domain variables from the application dictionary, and then transform these relationships into the domain-based coupling graph (Definition 3).

Please note that although we have used application dictionary for this study, similar results can be derived from the three steps manual process and without using application dictionary.

Expectation–maximisation clustering

In Section 2.3, we discussed using a threshold value for domain-based coupling to identify highly coupled components. Previously, the threshold value has been selected manually based on the system characteristics such as distribution of the coupling values, or by graph visualisation [11]. However, the manual approach is subject to human errors and does not scale for large data sets. To address this limitation, in this work, we use a clustering technique to automatically identify highly coupled components.

The aim of clustering is to group a given set of objects so that similar objects are grouped together and dissimilar objects are kept apart. There are many different multi-dimensional clustering techniques [13]. In this paper, we have used a statistical clustering technique called expectation–maximisation (EM)2 because it can automatically find the optimum number of clusters [14].

The main idea behind EM is to fit the parameters of a distribution model by using training data. The EM algorithm assigns a probability distribution to each instance of the number of common variables (ϑ), which indicates the probability of the instance belonging to each of the generated clusters. In this study, the training set is the same as the dataset, and there is no test dataset because this is an unsupervised technique. In Section 4.8, we demonstrate how EM clustering improves the precision of identifying dependences.

DEPENDENCE ANALYSIS

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

ADempiere has been designed in such a way that a developer can extend the system by touching as little code as possible. Whenever a new table is added to the database, the required Java code is automatically generated.

Most domain-level relations are managed at the data layer. As a consequence, traditional coupling metrics fail to capture the domain-level relationships between these classes. Moreover, the database contains important information about the architectural dependences in the system. We therefore need to develop a model that is capable of expressing dependences both at the source code and at the database layers.

In this section, we present two general models for representing a system and its architectural relationships based on the analysis of the source code and the database. We also explain how we populated our model in the particular case of ADempiere.

Source code dependences

The main wellspring of architectural relations is the source code. At the source code level, our analysis models three key entities and their associated relations. These entities are independent of the programming language, as long as it is object-oriented:

  • Classes are represented by a finite set CLS.
  • Attributes are represented by a finite set ATT. The binary relation F ⊆ CLS × ATT maps attributes to the containing classes.
  • Methods are represented by the finite set MET. The binary relation M ⊆ CLS × MET maps methods to the classes that contain them.

In addition, the relation R ⊆ MET × CLS expresses the return types of methods (NB) Nota Bene: we allow Void ∈ CLS to model methods that return void), I ⊆ MET × MET represents method invocations and A ⊆ MET × ATT represents the accesses of methods to attributes. These relationships are illustrated in Figure 3.

image

Figure 3. Source code elements and relations.

Download figure to PowerPoint

Two classes cls, cls ∈ CLS can have following relationships:

  • display math(1)
  • display math(2)
  • display math(3)

where Equation (1) shows cls is the return type of cls′, Equation (2) shows a method of cls that invokes a method of cls′, and Equation (3) shows a method of cls that accesses an attribute of cls′.

Definition 4. A direct relation between two classes is defined as D = {M− 1. R− 1, M. I. M− 1, M. A. F− 1}. For two classes cls, cls ∈ CLS, we denote that cls is directly dependent on cls by cls. D. cls

Definition 5. For two classes cls, cls ∈ CLS, we denote that cls is indirectly dependent on cls by cls. D. D− 1cls

Database relationships

A significant part of a system's business logic is incorporated in the database relationships, and these relationships complement those that are visible at the source code level.

The main type of entity that we model at the database level is the table, and we denote the set of all the tables by TBL. The binary relation FK ⊆ TBL × TBL maps tables to tables based on the foreign keys. Figure 4 illustrates this relationship.

image

Figure 4. Database table with the foreign key relation.

Download figure to PowerPoint

As in the case of source code, we define both direct and indirect relationships in the database.

Definition 6. Given two tables t, t′ ∈ TBL, we say that t has a direct relation to t′ if and only if t. FK. t′.

Definition 7. Given two tables t, t′ ∈ TBL, we say that t has indirect relation to t′ if and only if t. FK. FK− 1. t′.

Although foreign key relations among tables are there to model a specific aspect of the domain, indirect relations between tables should suggest how different concepts are bound together.

Architectural dependences

Two components are considered to be architecturally dependent either by direct or indirect dependences between the classes behind them, or by direct or indirect relationships between the tables accessed by these classes.

Figure 5 shows the relations between the components (C), classes (CLS) and tables (TBL) of ADempiere. These elements are related by DEP ⊆ C × CLS that represents classes that a UIC depends on, and REF ⊆ CLS × TBL that represents tables that a class reads or writes to.

Definition 8. For two components c, c′ ∈ C, we say that c has an architectural dependence to c′ if and only if they are in one or more of the following relationships:

  • display math(4)
  • display math(5)
  • display math(6)
  • display math(7)
  • display math(8)
  • display math(9)
image

Figure 5. Relationships between software elements.

Download figure to PowerPoint

This definition describes all direct and indirect dependences through classes or tables. lst:sharedclasses defines a connection between two components based on shared classes. lst:directclasses and lst:indirectclasses consider, respectively, direct and indirect dependences between classes to connect the components depending on them.

lst:sharedtables defines a connection between two components based on their shared database tables. lst:directtables and lst:indirecttables consider direct and indirect dependences between database tables that connect two components.

Implementation

The analysis on ADempiere has been performed using the Moose [15] platform for software and data analysis. One of the Moose core components is the FAMIX [16] meta-model, which describes the static structure of object-oriented software systems. This abstract representation contains all the elements composing a software-oriented system (i.e. classes, methods, attributes and namespaces) together with all the associations among them (i.e. inheritances, invocations and accesses). The source code dependences described in Section 3.1 are computed directly from the FAMIX core. Because ADempiere is not simply implemented on Java, but it also rely on a database, to perform the analysis on this system, we first needed to extend FAMIX with a meta-model for relational databases. This extension is similar to the one proposed by Marinescu [17] but with more detailed relations between the software entities and the relational elements.

Figure 6 shows the subset of the extended meta-model where the new elements modelling relational databases are indicated in bold.

image

Figure 6. Extended version of the FAMIX meta-model including a meta-model for relational databases.

Download figure to PowerPoint

The entities modelling relational databases are self explanatory, more interesting are the relations between meta-model extensions and the meta-model for object-oriented systems:

  • A class that maps a table is a class that represents a table at the source code level, e.g. Enterprise Entity beans, idem for a map between a class attribute and a table column.
  • The relation access represents class methods accessing database tables. The access can be made directly (e.g. using the java.sql package) or through a framework (e.g. Hibernate).
  • The relation reference represents connections among table columns established using a foreign key constraint.

These modifications to the FAMIX meta-model have been implemented in MooseJEE [18], an extension of the Moose [15] software analysis platform.

The entities and relations of the meta-model just described are generic and independent from any kind of platform or software to analyse. What is platform-related is the place where the information we need is stored. Consequently, the fact extractor may need to be changed from one system to another.

Dependence analysis in adempiere

The first step we took to analyse ADempiere was to import the code and the database into the unified meta-model described in Section 3.4. Once the two models were populated, we needed to extract the mapping between UIC and classes, and between classes and tables.

In the case of ADempiere, the mappings between the UIC and the classes can be found in the Application Dictionary as discussed in Section 2.4. The Application Dictionary is used by a code generator to build the application structure and the relations among the various application components. So, the relations between the source code elements and the UI elements are explicitly encoded in the database. We extracted this information by querying the database.

The mapping between the database tables and the source code is based on a naming convention. For each domain-related table specified in the Application Dictionary, a Java class named ‘X_tableName’ and a Java interface named ‘I_tableName’ are generated by ADempiere. For example, for the table called A_Asset_Acct, a class called X_A_Asset_Acct and an interface called I_A_Asset_Acct will be generated.

By creating this mapping between classes and tables, we determined that 76 tables in the database are not related to any class. And 65 of them are tables used for the language localization, nine are materialised views and two are used for logging purposes.

The architecture of ADempiere only contains one-to-one mappings; hence, the relation map between classes and tables in our meta-model can represent all the relations we need. Other kinds of mappings should be modelled differently. Once we had these mappings, we were able to compute the dependences between the components based on architectural relationships.

EVALUATION

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

In this section, we provide empirical evidence of the usefulness of domain-based coupling in approximating architectural dependences. We answer the following questions in this evaluation:

  • RQ1: How accurately can we predict dependences between UICs at the source code level?
  • RQ2: How accurately can we predict relationships between UICs at the database layer?
  • RQ3: How accurately can we predict architectural dependences between UICs?
  • RQ4: What is the impact of granularity of UICs on the prediction results?

Note that all the predictions in this evaluation are performed based on only domain information and domain-based coupling between UICs.

Evaluation setup

For a given UIC, c ∈ C, we test the query q = (c,E,AN) where the expected outcome E ⊆ C is the set of UICs that have architectural dependences to c, and the returned answer

  • display math

is the set of UICs that are coupled with c at the domain level. We describe the outcome of such a query as follows:

  • TP  =  |E ∩ AN| shows the number of correctly identified dependent components.
  • TN  =  |C \ {AN ∪ E}| shows the number of correctly identified independent components.
  • FP  =  |AN \ E| shows the number of incorrectly predicted dependent components.
  • FN  =  |E \ AN| shows the number of incorrectly predicted independent components.

For the SUA, we measure the percentage of the queries with at least one correct answer using the feedback (FB) metric:

  • display math

We use the well-known definitions of precision (Pq) and recall (R) to evaluate the outcomes of a given query:

  • display math

In addition, we report on the F-measure (F1), which is the harmonic mean of the precision and recall:

  • display math

Precision and recall only evaluate TP; to describe both TP and TN, we measure accuracy (Aq), which is the degree of closeness of results to the preferable values where all dependent and independent components are correctly identified. The higher the accuracy, the closer the prediction outcomes are to the perfect results where both FP and FN are equal to zero. Accuracy [19] is defined as follows:

  • display math

Case study: adempiere

We scouted the open-source software landscape for a suitable open-source system to use as a case study for our analysis. After considering several candidates, we eventually settled on ADempiere, an ERP software package. The qualities that persuaded us to choose ADempiere for our case study are:

  • Well-defined business domain: An ERP system integrates internal and external management information across an entire organisation, embracing accounting, manufacturing, sales and service, and so on. Such a system manifests a strict separation between the expertise of the stakeholders and developers. This is the type of software that benefits mostly from domain-based coupling analysis.
  • Tiered architecture: The system manifests a clear separation between the different architectural tiers. The system has a rich set of UIC and four distinct front-ends from which the user can choose including a Java GUI and three web interfaces. The system heavily uses relational database management systems (e.g. PostgreSQL and Oracle) for data storage as well as for storing business logic.
  • Evolving and active system: The ADempiere project traces its evolution back more than a decade. Created in September 2006 as a fork of the Compiere open-source ERP, itself founded in 1999, ADempiere soon reached the top five of the SourceForge.net enterprise software rankings. At the time of this publication, it is the first system among that top five. This is a measure of both the size of its developer community and its impact on the ERP software market.
  • Large-scale and complex design: The system represents cutting edge open-source software technology. It is a multi-language system that aggregates more than two million lines of code. The core part is written in Java and contains more than 3531 classes with more than half a million lines of code.3Figure 7 presents a high-level architectural view of the Java core of ADempiere as obtained with the architecture recovery tool Softwarenaut [20]. The view is obtained by aggregating the direct call and inheritance relationships in the system up along the package hierarchy. The area of every visible module is proportional to its number of lines of code. Every visible dependence is directed and has its width proportional to the number of abstracted low-level dependences. Every module is represented similar to a treemap, with the sizes of the contained classes and modules proportional to their size in lines of code.
  • Active developers and users community: The system has a very active associated community: often, the mailing list has more than 800 messages per month, and the SourceForge.net page shows that ADempiere is downloaded more than 15 000 times per month. The system is used by a large number of companies around the world.
image

Figure 7. ADempiere architecture: high level view.

Download figure to PowerPoint

For all these reasons, we deem ADempiere to be relevant and representative for enterprise systems and for the state-of-the-art in open-source software at the moment of writing this article and appropriate for our analysis.

Macro evaluation

To evaluate the results for all UICs in ADempiere, we take the mean value of measurements of all queries as

  • display math

where f is one of these measurement functions: TP, TN, FP, FN, R, P, F1 or A.

Likelihood

One application of domain-based coupling might be notifying software maintainers of possible dependent components when they browse a list of UICs. To assess the usefulness of such notifications, we measure the likelihood (L) whether at least one of the top 3, 5 or 10 returned results have architectural dependences. More formally, if ANc,n shows the top n results for a component c, then

  • display math

The likelihood function enables us to distinguish between the topmost results and the entire returned result set.

Results: searching for source code dependences

In this section, we investigate the first research question: How accurately can we predict dependences between UICs at the source code level? ADempiere contains 348 UICs. The source code analysis revealed 16 450 indirect dependences and no direct dependences among classes behind these UICs. For any given UIC, we queried the connected UICs by source code dependences and compared the results with the domain-based coupling graph.

Figure 8 shows the histogram of the queries' outcomes. As presented in Figure 8(a), the larger majority of the query results contain little false negatives, for example, 109 queries have less than three false negatives. A comparison between the histograms of recall and precision shows that the domain-based coupling makes conservative predictions, that is, for most of the queries, there is a trade off between recall and precision in favour of recall. Moreover, Figure 8(c) shows the high accuracy of the results for most queries. Such predictions are particularly useful for change impact analysis where high recall provides confidence about the prediction of the scope of change propagation. The feedback is 0:93, which shows that for 93% of queries, domain-based coupling returned at least one correct answer.

image

Figure 8. Source code dependencies: histogram of the queries' outcome.

Download figure to PowerPoint

For example, Financial Report is a UIC in ADempiere with source code dependences to eight other UICs. The query for this UIC returned 60 UICs including seven true positive and 53 false positive results, leading to the precision of 0.12. Given that there are 348 UICs in ADempiere, the accuracy of the results is 0.84. This strong accuracy indicates that the query results allow a software maintainer to focus on limited number of UICs (60 UICs rather than 348 UICs).

On average, for a given UIC, 31 connected UICs by source code dependences are identified correctly, whereas 18 UICs with source code dependences are incorrectly described as independent components, and 77 independent UICs are falsely identified to have source code dependences. These results lead to average recall equal to 0.63, average precision equal to 0.30 and average F-measure of 0.29.

The accuracy of the prediction is equal to 0.73, implying that for more than 7 out of 10 UICs, our prediction method correctly identified whether two UICs are dependent or independent at the source code level. The likelihood of discovering source code dependences in the top three coupled UICs is 72%, and it will increases to 83% for the top 10 UICs.

In any given system, it is expected to find some independent UICs, that is, E = ∅. The queries for these UICs might distort the average results. In ADempiere, we identified 16 UICs that have no source code dependences to any other UICs. We filtered out these queries and measure the average results for the rest of the queries. The comparison between the average results for all queries and the filtered queries (Table 1) shows a minor decrease in recall and almost no change in average precision, F-measure and accuracy. Moreover, the feedback only increases from 0.93 to 0.96, and the likelihood of finding source code dependences between top 3, 5 and 10 results slightly increases by 2% or 3%. Therefore, we conclude that the queries with E = ∅ have a little impact on the overall prediction results.

Summary. On average, 63% of UICs connected by source code dependences has been identified correctly, whereas for 83% of queries, the top 10 results contains one or more source code dependences.

Table 1. Source code dependencies: evaluation results.
 AllFiltered AllFiltered
  1. All: complete queries for all ADempiere's user interface components. Filtered: only queries with |AN| > 0.

TPM31 ± 36.8431 ± 36.84AM0.73 ± 0.130.73 ± 0.13
FNM18 ± 23.0518 ± 23.05F1M0.29 ± 0.220.29 ± 0.21
FPM77 ± 48.7577 ± 48.75FB0.930.96
TNM222 ± 69.42222 ± 69.42L30.720.74
PM0.30 ± 0.280.30 ± 0.28L50.770.80
RM0.63 ± 0.300.63 ± 0.30L100.830.86

Results: searching for database relationships

In this section, we address the second research question: How accurately can we predict relationships between UICs at the database layer? The database analysis of ADempiere showed that there are 7986 direct and 11 894 indirect relationships among data tables behind UICs. We queried these relationships using the domain-based coupling, and the results are presented in Table 2. The feedback of these queries is 0.91 and 0.93 for direct and indirect relationships, respectively.

Table 2. Database relationships: evaluation results.
 AllFiltered AllFiltered
  1. All: complete queries for all ADempiere's user interface components. Filtered: only queries with |AN| > 0.

Direct relationships
TPM19 ± 25.9619 ± 26.19AM0.74 ± 0.150.74 ± 0.15
FNM4 ± 9.614 ± 9.76F1M0.23 ± 0.210.23 ± 0.20
FPM87 ± 53.1888 ± 52.45FB0.910.96
TNM238 ± 66.79236 ± 66.30L30.580.59
PM0.20 ± 0.250.2 ± 0.24L50.660.68
RM0.77 ± 0.300.8 ± 0.30L100.750.77
Indirect relationships
TPM21 ± 29.9528 ± 31.60AM0.72 ± 0.150.72 ± 0.14
FNM13 ± 18.8117 ± 19.97F1M0.22 ± 0.230.28 ± 0.20
FPM85 ± 53.5181 ± 51.12FB0.930.95
TNM229 ± 68.46222 ± 69.82L30.510.66
PM0.22 ± 0.270.3 ± 0.26L50.560.73
RM0.71 ± 0.320.6 ± 0.31L100.620.81

On average, for a given UIC, 19 directly related UICs and 21 indirectly related UICs are identified correctly. The results show only four false negatives for direct relationships that are more than three times lower than the 13 false negatives identified for indirect relationships. However, the number of false positives is similar: 87 and 85 for direct and indirect relationships, respectively.

Comparing the results between direct and indirect relationships shows that for direct relationships, the recall is slightly higher (0.77 versus 0.71), whereas the precision is slightly lower (0.20 versus 0.22). The accuracy values for both relationship types are higher than 0.7, which means that the relationship state of 7 out of 10 UIC pairs is identified correctly. In addition, validating the topmost results shows that the likelihood of database relationships in the top three results is 58% for direct and 51% for indirect relationships. Also, the likelihood of indirect relationships increases to 75% for the top 10 results.

For example, Financial Report (a UIC in ADempiere) has nine direct database relationships to other UICs. Our query returns 60 UICs that includes eight UICs with database dependences to Financial Report. The recall computed from these results equals to 0.89 and precision equals to 0.13. Although the precision is low, the query results includes 287 true negatives that leads to the accuracy of 0.85. Such a high accuracy can assist software maintainers to exclude independent components from change impact analysis.

Figure 9 shows the histogram of precision, recall and accuracy of queries for direct and indirect database dependences. As it is illustrated in Figure 9(a) and (b), there are noticeable number of queries with recall equals to one and precision close to zero, that is, their results contain no false negatives and some false positives. Qualitative analysis of the results shows that the recall of 84 queries for direct database relationships equals to one from which 13 queries have no expected answer, that is, E = ∅. Figure 9(d) and (e) shows a similar but accentuated pattern. The recall of 116 queries for indirect database relationships equals to one from which 86 queries have no expected answers. Such queries might distort the results. We measured their impact on the average results by filtering them out from result set as presented in Table 2. The comparison shows minor increases in the average true positives and false negatives, and minor decreases in false positives and true negatives. The impact of these changes on precision and recall for direct dependences is not noticeable; however, the precision and the F-measure of the indirect dependences slightly increase, whereas their recall slightly decreases.

Summary. On average, more than 71% of database relationships can be derived from domain information, and the likelihood of finding database dependences among the top 10 results is up to 75%.

image

Figure 9. Database dependencies: histogram of the queries' outcome.

Download figure to PowerPoint

Results: searching for architectural dependences

In this section, we address the third research question: How accurately can we predict architectural dependences between UICs? The analysis of the source code and the database of ADempiere shows 17 279 architectural dependences (Definition 8). We evaluated how accurately a domain expert can predict whether there is at least one architectural dependence between any given pair of UICs. Figure 10 shows the precision, recall, F-measure and accuracy of the query results based on the number of UICs in the returned answer by the query, that is, x-axis shows the |AN| = TP + FP.

image

Figure 10. Architectural dependencies: scatter chart of evaluation metrics based on the size of the result set.

Download figure to PowerPoint

As it is illustrated in Figures 10(a), the recall significantly increases proportionally to the increment of the result size; however, this behaviour is different for the precision (Figure 10(b)). Therefore, the average F-measure (Figure 10(c) slightly increases for queries with higher number of results. On the contrary, Figure 10(d) shows that there is a clear negative relationship between the size of the results returned by the queries and the accuracy of the results themselves. We can conclude that in ADempiere, UICs with less domain-based coupling to other UICs are more likely to be independent components at the architectural level.

The feedback of the queries is 0.93, that is, for 93% of queries, the domain-based coupling returned at least one correct answer. As presented in Table 3, on average for a given UIC, 31 dependent UICs and 223 independent UICs are identified correctly using domain information. However, 19 dependent and 75 independent UICs are incorrectly placed in the opposite dependence state. These results lead to an average recall of 0.63 and precision of 0.31. The average accuracy of the predictions is 0.73, which shows for 7 in 10 UIC pairs, their dependence state is identified correctly. In addition, the likelihood of discovering an architecturally dependent UIC pair in the top three results is 72%. This likelihood will increase to 84% for the top 10 results.

Summary. On average, 63% of architecturally dependent UICs are discovered using domain information, and the likelihood of discovering a correct architectural dependence in the top 10 predictions is 84%.

Table 3. Architectural dependencies: evaluation results.
TPM31 ± 36.91PM0.31 ± 0.29FB0.93
FNM19 ± 23.91RM0.63 ± 0.30L30.72
FPM75 ± 49.53F1M0.30 ± 0.22L50.78
TNM223 ± 70.19AM0.73 ± 0.13L100.84

Improving precision

The prediction results for architectural dependences (Table 3) show that the average precision is 0.31. To improve the precision, we utilised the EM technique (Section 2.5) to filter out weakly coupled pairs, with the assumption that UICs with strong domain-based coupling are more likely to have architectural dependences.

Table 4 shows the improved results. The mean precision for architectural dependences is increased from 0.31 to 0.7, and the mean accuracy is increased from 0.73 to 0.87. However, these improvements are achieved at the expense of the reduction in recall. Although the value of precision is more than doubled, the value of recall decreased almost three times (from 0.64 to 0.23). This implies that there are a number of architectural dependences between UICs that have no strong coupling at the domain level.

Summary. By using EM technique, precision can be improved up to 0.7. However, it is a trade-off between precision and recall.

Table 4. Prediction results using expectation–maximisation clustering.
 RMPMAM
Source code dependences0.290.680.88
Direct database relationships0.400.570.89
Indirect database relationships0.270.610.93
Architectural dependences0.230.700.87

Results: visual comparison

In this section, we visually compare the results obtained with the domain-based coupling against the actual architectural dependences between software elements. The visualisation in Figure 11 provides the reader with a graphical answer to the third research question: How accurately can we predict architectural dependences between UICs?

image

Figure 11. Domain-based coupling versus architectural dependencies.

Download figure to PowerPoint

The domain-based coupling graph (Figure 11(a)) is visualised using Fruchterman and Reingold's [21] force-based graph layout in three steps: first, the graph is created based on Definition 3; second, the EM technique (Section 2.5) is applied; third, the derived graph is visualised by the force-based layout algorithm.

To compare the domain-based coupling graph with the architectural dependences, the edges from Figure 11(a) are replaced with the architectural dependences without changing the location of nodes. The resulting graph (Figure 11(b)) illustrates the distribution of the architectural dependences in comparison with the domain-based coupling.

The comparison between Figure 11(a) and (b) shows that the most populated cluster (tagged by A) in the domain-based coupling graph has the biggest number of architectural dependences. However, the number of architectural dependences decreases in the clusters with poor domain-based coupling (B, C and D). In addition, there are a number of architectural dependences where there is no domain-based coupling, illustrating that not all dependences can be derived from the domain-based coupling graph.

Results: impact of granularity

In this section, we address the fourth research question: What is the impact of granularity of UICs on the prediction results?

In Section 2.4, we described two granularity levels for UICs in ADempiere. The coarse-grained UICs are windows, and each window is composed of fined-grained UICs called tabs. We evaluate the impact of granularity by repeating the queries from Sections 4.5, 4.6 and 4.7 for the fine-grained UICs. ADempiere is composed of 889 tabs, which is more than twice the number of the 348 windows. As such, the number of architectural dependences between tabs is 54 030 that is more than three times higher than the 17 279 architectural dependences between windows.

This increase in the number of dependences has a notable impact on the prediction results. Figure 12 shows that the true negatives for the fine-grained UICs are improved by more than 200% in comparison with coarse-grained UICs. Consequently, the overall accuracy of the queries is increased by 14% to 20%. However, the average number of false negatives for code dependences and database relationships for the fined-grained UICs is more than twice that for coarse-grained UICs. Moreover, the average number of false positives is increased by 20% to 28%. As a result of the increase in the false positives and false negatives, the average recall and precision is reduced by 11% to 50%. Furthermore, the likelihood of finding dependences between the top 10 coupled UICs is reduced by 18% to 46%.

image

Figure 12. Impact of granularity on prediction results.

Download figure to PowerPoint

These results suggest that the proposed approach provides a better outcome for coarse-grained UICs. The overall accuracy increases slightly for the fined-grained UICs, but the noticeable decrease in the precision and recall might discourage software maintainers from using this method.

Summary. Domain-based coupling provides a more precise prediction of dependences between coarse-grained UICs than fine-grained UICs.

Discussion

The reverse engineering of ADEMPIERE revealed 16 450 source code dependences, 7986 direct and 11 894 indirect database relationships among ADEMPIERE UICs. We used the FAMIX meta-model (Section 3.4) and Definition 8 to identify how UICs are architecturally connected via the database and source code dependences. The results show 17 279 architectural dependences between ADEMPIERE UICs.

In this evaluation, we queried these dependences using domain-based coupling. The results show that for more than 90% of queries, the domain-based coupling returns at least one correct answer. The average recall for these queries is more than 0.6 for both source code dependences and database relationships, whereas the precision and F-measure is lower than 0.4. Although the precision is not strong, the average accuracy of the queries is higher than 0.7. The accuracy reflects both true positives and true negatives, and where there are many components in a system, the number of true negatives is important for maintenance activities and particularly changes propagation analysis.

For example, a domain expert needs to estimate the impact of a change to a Financial Report, a UIC in ADEMPIERE. Domain-based coupling graph of ADempiere shows 60 coupled UICs to Financial Report, and evaluating the source code and database shows that this result includes one false negative, 53 false positives, seven true positives and 286 true negatives. From these results, we compute the accuracy of the query as 0.84. This high accuracy shows that the outcome of this query enables the domain expert to focus on a significantly reduced search space for change propagation (60 rather than 348 UICs).

Comparison between the size of the queries' results and the precision, recall, F-measure and accuracy shows that queries with larger outcome have higher recall and lower accuracy. However, the relationship between the size of the queries' results and the precision and F-measure is less significant. In addition, we have evaluated the impact of granularity of UICs, and the results show that domain-based coupling provides a more precise prediction of dependences between coarse-grained UICs than fine-grained UICs.

One of the factors that affect the average results is the number of independent UICs. The expected answers for these queries are empty sets, which lead to a recall of 1, and any false positive for these queries leads to precision of zero.

We performed a qualitative analysis and filtered out these queries from our result set. The comparison between the results showed a small change in average precision, recall and accuracy. However, filtering out these queries improved the feedback (textitFB) and increased the likelihood of finding dependences in the top 3, 5 and 10 results.

The probability of finding architectural dependences using domain-based coupling is a trade off between precision and recall in favour of recall. However, in some case, it is preferable to achieve a more precise result set. For example, if someone aims to develop a tool based on this method, too many false positives can discourage the user. In Section 4.8, we demonstrated how an unsupervised clustering method can be used to automatically increase the precision to 0.7 at the cost of the reduction in recall.

Finally, we demonstrated how domain-based coupling could be used to inform software maintainers while they browse software UICs. The results show that the likelihood of discovering architectural dependences among the top 10 coupled UICs is 85%. Given that these results are obtained without looking at the source code or the database, they are quite promising. On the other hand, in its current form, domain-based coupling analysis cannot completely replace source code analysis.

APPLICABILITY

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

In this paper, we introduced a novel approach to dependence analysis based on the analysis of domain-based coupling between UICs. Now, we answer the following questions about the applicability of this approach:

  • What are the requirements for implementing this approach? This approach requires access to the system domain knowledge including access to the information about the UICs, their domain functions and their associated domain variables. For most enterprise systems, domain experts are the best source of information about the software domain functionalities; in addition, the information about UICs and domain variables can be derived from the system user manuals or software help documents. Therefore, this approach requires access to the domain experts or alternative sources of information about UICs and their domain functions.
  • What kinds of software can benefit the most from the proposed approach? Lehman [22] classified the software into three types: S-Type software can be validated relative to a formal specification and includes systems such as compilers. E-Type software, such as enterprise systems, is designed to mechanise a human or societal activity. P-Type software is an intermediate between S-Type and E-Type systems, such as a chess game, where users are concerned with the execution results rather than validating the implementation. The S-Type and P-Type system often have limited user interfaces, and most cases, they operate based on a model that is not visible to the end user. The proposed approach in this thesis is suitable for systems with many UICs that enable end users to manage information based on domain driven business rules and workflows. Therefore, this work does not consider S-Type and P-Type software systems, and it focuses only on E-Type software. Most E-Type systems interact with human users through a number of UICs, and they are mostly data driven systems that collect, manage and report domain information. These systems take the most benefit from the domain-based coupling analysis.

In a more detailed classification, Pressman [23] classified computer software into seven categories: system, application, engineering/scientific, embedded, product-line, artificial intelligence and web applications. The domain-based approach is applicable to subsets of application software, product-line software and web applications, which are data-driven and provide their functionality through a number of UICs.Domain-based coupling analysis is not applicable to software systems that their functionality is not visible to domain users, such as system software or embedded software. Also, it may not be suitable where systems are not data-driven or have few UICs, such as engineering/scientific or artificial intelligence systems.

  • What kinds of software changes and maintenance activities can benefit the most from the proposed approach? We envisage that the main application of the proposed approach will be estimating the change propagation prior to maintenance activities such as bug fixes and software enhancements. Lientz and Swanson [24] classified software changes as perfective, adaptive, corrective and preventative. Preventative changes are typically initiated by programmers/developers or software engineers who are concerned with the non-functional properties of the system, such as the maintainability of the source code. Such changes might be difficult to map to domain functions and UICs; therefore, the proposed approach would not be suitable for this kind of changes. However, perfective, adaptive and corrective changes are typically performed in response to a request from the system users or in response to changes in the software environment. Such software changes are often easy to map to UICs; therefore, domain experts can use the proposed approach to analyse the coupled UICs and estimate the change propagation.

In summary, the domain-based coupling analysis is applicable to the most enterprise systems and data-driven software packages that provide most of their functions through the UICs. For such systems, domain experts might use the domain-based coupling to predict the dependences between UICs and estimate the impact of perfective, adaptive and corrective software changes.

THREATS TO VALIDITY

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

In this section, we discuss the threats to validity of our findings and how we addressed them.

Threats to external validity are concerned with generalisation of our findings. Although we performed our evaluation on a large-scale enterprise system representative of the state of the art enterprise systems developed in Java, we are aware that more studies are required to be able to generalise our findings.

Threats to construct validity are concerned with the quality of the data we analysed and the degree of manual analysis that was involved. The domain information typically is provided by the domain experts using a manual data collection process. To minimise the risk of human error, we extracted the relationship between domain variables and UICs from user manuals and help documents. In ADempiere, this information is stored in the database. We only used manual inputs from domain experts to confirm this information and kept the manual additions and alterations to a minimum.

One other factor that could affect the validity of our results is the type of domain information in the case study. We limit the domain analysis to the domain variables visible on the UICs; however, there are other sources of domain information including user manuals and help files. It can be argued that different results might be achieved using alternative information; hence, further studies are required to evaluate various data sources and to identify the most suitable source for domain analysis.

RELATED WORK

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

The key applications of dependence analysis are change impact analysis, programme comprehension, concept location and reverse engineering [25-28]. Over the last two decades, researchers have proposed different techniques to perform such analysis.

The earliest techniques rely on formal models of change propagation. Luqi [29] presented a graph model for software evolution based on indirect relationships between components. Rajlich [4] introduced a model for change propagation based on graph rewriting that requires an understanding of the dependences between software elements. Arnold and Bohner [30] modelled change impact analysis as a cycle of revisions derived from relationships between software elements. Mirarab et al. [31] introduced a hybrid impact analysis method based on dependence information and co-change history.

More recent techniques work at the source code level. Source code analysis [6] is an established approach for tracing software dependences [32, 33] or evaluating the evolution of code and design [34, 35]. One of the code analysis methods is programme slicing, which has been exhaustively explored by many researchers and extended to many programming paradigms [36-39]. Source code analysis is further enhanced using dynamic analysis [40, 41] to capture dependences that might not be traceable from static relationships between software elements.

One direction in which impact analysis is extended is towards the analysis of entire software ecosystems [42]. In their work [43], Robbes et al. reported on an empirical study on the impact of API deprecations in a large open-source software ecosystem and conclude that tool support for impact estimation is needed also at the ecosystem level.

Further studies provided techniques based on software metrics. In an effort to quantify the dependence between objects, researchers and practitioners have defined metrics such as Coupling Between Objects (CBO) or CBO[44], which consider the inheritance between classes to measure the coupling among software elements. Other metrics such as Response For Class (RFC) [45] and RFC [44] consider indirect relations among classes based on a level of indirection in the invocation chain of the class methods. A good overview of the structural coupling metrics is provided by Briand et al. [46]. Often, these metrics attempt to provide feedback on the quality of a systems design. In our case, although we provide a coupling metric between GUI components, we do not aim at providing any feedback on the quality of the system but rather we use that metric to support the evolution of software.

The previously presented approaches rely on the source code being available. They also assume that the source code captures the relationships between the software elements. However, different parts of the system change together even when the source code encodes no explicit dependences between them (e.g. source code and configuration files). To detect the impact of a change when there are no source code relationships between the components involved in the change, one can use logical coupling [47, 48] or dynamic coupling [49, 50]. Several techniques [51-54, 27] use an evolutionary approach that analyse multiple versions of a system by mining its software repositories. These techniques work under the assumption that the parts of the system that frequently change together will keep changing together in the future. Just as our approach, these approaches are less expensive, and require less technical expertise, than the ones based on data flow and source code analysis. On the other hand, unlike ours, they are not applicable where maintenance history is not accessible.

An alternative to studying the evolution of a system is to define the conceptual coupling metric based on the vocabulary of the different components included in a software system. Poshyvanyk et al. [55-57] and Gethers [58] identified and measured the relations between software entities in object-oriented software using topics included in the source code and latent semantic indexing. The difference between the domain-based coupling approach presented in this paper and the conceptual coupling approach is that the former is source code-independent. In a recent study on ADempiere, Gethers et al. [59] have demonstrated that domain-based coupling and conceptual coupling are orthogonal, and combining them leads to better prediction of database and source code dependences with higher precision and recall as compared with its standalone constituents.

Gall et al. had shown that semantics metrics computed on design documents correlate well with semantic metrics computed on the source code, thus could be used as proxies for them [60]. Like domain-based coupling, also, this approach does not require source code analysis to compute coupling metrics. On the other hand, the approach of Gall et al. works on design specifications that can be outdated, most of the time not even available in first place. The domain-based coupling is computed starting from UICs that are necessarily updated to the latest features offered by a system.

CONCLUSION AND FUTURE WORK

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

In this paper, we demonstrated how domain information could be used to predict architectural dependences and assist software maintainers in searching for connected components at the source code or the database layers. Our proposed approach for predicting dependences promises independence from software implementation and simplicity and usability for non-technical domain experts. Hence, it can assist managers and consultants to take decisions about software changes without the support of the developers.

The proposed dependence analysis method is based on relationships between software domain information and UIC, modelled as a weighted graph. We demonstrated how such a model can assist in predicting dependences with a case study on a large-scale enterprise system, called ADempiere. We derived architectural dependences as a set of source code and database dependences and compared them with domain-based coupling between UICs. The results show that on average, 65% of the source code and up to 77% of the database dependences could be derived from the domain-based coupling. The accuracy of such predictions is, on average, more than 70%, implying that for 7 out of 10 component pairs, their dependence state is identified correctly. The results promise that domain information might be used to predict the existence of architectural dependences, and the accuracy of these predictions could support maintenance activities such as change impact analysis. However, at the current stage, this approach cannot replace source code analysis or database analysis.

A future area of investigation is assessing the impact of multiple domains on the results. ADempiere contains various modules that provide functions of multiple domains such as ERP, (CRM) Customer relationship management and asset management. Distinguishing these domains and their domain-based coupling graphs might lead to better understanding of the relationships between domain-based coupling and architectural dependences. Moreover, we envisage that domain-based coupling can be used in combination with other coupling metrics to predict change propagation. Domain-based coupling is described at the abstract level of software domain. Such an abstract coupling can complement code-based conceptual coupling or history-based evolutionary coupling because each of these metrics describes a distinguished aspect of software systems.

Overall, the positive results of the described case study suggest that domain-based coupling can be considered to be complementary to source code analysis and can assist software maintainers where existing code analysis tools are not applicable.

ACKNOWLEDGEMENTS

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies

We gratefully acknowledge the financial support of the Swiss National Science Foundation for the project ‘Synchronizing Models and Code’ (SNF Project No. 200020-131827, October 2010 – September 2012). Part of Amir Aryani's contribution to this work has been funded by an Australian Postgraduate Award (APA). We would also like to thank Dr Margaret Hamilton, Nicholas May and Jorge Ressia for their comments on this paper and their support on this project.

  1. 1
  2. 2

    EM can be a supervised technique when it is used to build a classifier. However, in this work, we use EM clustering that is an unsupervised technique.

  3. 3

    This was measured based on the Java code in the SVN repository at https://adempiere.svn.sourceforge.net/svnroot/adempiere/tags/trunklast/

REFERENCES

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies
  • 1
    Hassan AE, Holt RC. Replaying development history to assess the effectiveness of change propagation tools. Empirical Software Engineering 2006; 11(3):335367, doi:10.1007/s10664-006-9006-4.
  • 2
    Zhifeng Yu VR. Hidden dependencies in program comprehension and change propagation. Proceedings of the 9th International Workshop on Program Comprehension, IEEE Computer Society: Washington, DC, USA, 2001; 293–299, doi:10.1109/WPC.2001.921739.
  • 3
    Vanciu R, Rajlich V. Hidden dependencies in software systems. Software Maintenance (ICSM), 2010 IEEE International Conference on, 2010; 1–10, doi:10.1109/ICSM.2010.5609657.
  • 4
    Rajlich V. A model for change propagation based on graph rewriting. Proceedings of the International Conference on Software Maintenance, ICSM '97, IEEE Computer Society: Washington, DC, USA, 1997; 84–91.
  • 5
    Hassan A, Holt R. Predicting change propagation in software systems. Proceedings 20th IEEE International Conference on Software Maintenance (ICSM'04), IEEE Computer Society Press: Los Alamitos CA, 2004; 284–293, doi:10.1109/ICSM.2004.1357812.
  • 6
    Binkley D. Source code analysis: A road map. 2007 Future of Software Engineering, FOSE '07, IEEE Computer Society: Washington, DC, USA, 2007; 104–119, doi:10.1109/FOSE.2007.27.
  • 7
    Lehman M. Programs, life cycles, and laws of software evolution. Proceedings of the IEEE 1980; 68(9):10601076.
  • 8
    Cook S, Harrison R, Lehman MM, Wernick P. Evolution in software systems: foundations of the spe classification scheme: Research articles. Journal of Software Maintenance and Evolution 2006; 18(1):135, doi:10.1002/smr.v18:1.
  • 9
    Aryani A, Peake I, Hamilton M. Domain-based change propagation analysis: An enterprise system case study. Software Maintenance (ICSM), 2010 IEEE International Conference on, 2010; 1 –9, doi:10.1109/ICSM.2010.5609743.
  • 10
    Aryani A, Perin F, Lungu M, Mahmood AN, Nierstrasz O. Can we predict dependencies using domain information? Proceedings of the 18th Working Conference on Reverse Engineering (WCRE 2011), 2011, doi:10.1109/WCRE.2011.17.
  • 11
    Aryani A, Peake ID, Hamilton M, Schmidt H, Winikoff M. Change propagation analysis using domain information. Proceedings of the 2009 Australian Software Engineering Conference, ASWEC '09, IEEE Computer Society: Washington, DC, USA, 2009; 34–43, doi:10.1109/ASWEC.2009.31.
  • 12
    Bastian M, Heymann S, Jacomy, MG. An open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media, 2009.
  • 13
    Mahmood A, Leckie C, Udaya P. An efficient clustering scheme to exploit hierarchical data in network traffic analysis. Knowledge and Data Engineering, IEEE Transactions on 2008; 20(6):752767, doi:10.1109/TKDE.2007.190725.
  • 14
    Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1977; 39(1):138, doi:10.2307/2984875.
  • 15
    Nierstrasz O, Ducasse S, Gîrba T. The story of Moose: an agile reengineering environment. Proceedings of the European Software Engineering Conference (ESEC/FSE'05), ACM Press: New York, NY, USA, 2005; 1–10, doi:10.1145/1095430.1081707. Invited paper.
  • 16
    Tichelaar S, Ducasse S, Demeyer S, Nierstrasz O. A meta-model for language-independent refactoring. Proceedings of International Symposium on Principles of Software Evolution (ISPSE '00), IEEE Computer Society Press, 2000; 157–167, doi:10.1109/ISPSE.2000.913233.
  • 17
    Marinescu C, Jurca I. A meta-model for enterprise applications. SYNASC '06: Proceedings of the Eighth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, IEEE Computer Society: Washington, DC, USA, 2006; 187–194, doi:10.1109/SYNASC.2006.3.
  • 18
    Perin F. MooseJEE: A Moose extension to enable the assessment of JEAs. Proceedings of the 26th International Conference on Software Maintenance (ICSM 2010) (Tool Demonstration), 2010, doi:10.1109/ICSM.2010.5609569.
  • 19
    Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. University Press: Cambridge, 2008.
  • 20
    Lungu M, Lanza M, Nierstrasz O. Evolutionary and collaborative software architecture recovery with Softwarenaut. Science of Computer Programming (SCP) 2012; :available online, to appear in printdoi:10.1016/j.scico.2012.04.007. URL http://scg.unibe.ch/archive/papers/Lung12b.pdf
  • 21
    Fruchterman TMJ, Reingold EM. Graph drawing by force-directed placement. Software: Practice and Experience 1991; 21(11):11291164, doi:10.1002/spe.4380211102.
  • 22
    Lehman MM, Ramil JF. Rules and tools for software evolution planning and management. Annals of Software Engineering 2001; 11(1):1544, doi:10.1023/A:1012535017876.
  • 23
    Pressman RS. Software Engineering: A Practitioner's Approach. 7th edn., McGraw-Hill, 2010.
  • 24
    Lientz BP, Swanson EB. Software Maintenance Management: a Study of the Maintenance of Computer Application Software in 487 Data Processing Organizations. Addison-Wesley Publishing Company: Reading, MA, USA, 1980.
  • 25
    Cleary B, Exton C. Assisting concept location in software comprehension. Proceedings of 19th Annual Psychology of Programming Workshop (PPIG 07), Joensuu, Finland, 2007.
  • 26
    Tzerpos V, Holt R. Accd: an algorithm for comprehension-driven clustering. Reverse Engineering, 2000. Proceedings. Seventh Working Conference on, 2000; 258–267, doi:10.1109/WCRE.2000.891477.
  • 27
    Walker RJ, Holmes R, Hedgeland I, Kapur P, Smith A. A lightweight approach to technical risk estimation via probabilistic impact analysis. Proceedings of the 2006 international workshop on Mining software repositories, MSR '06, ACM: New York, NY, USA, 2006; 98–104, doi:10.1145/1137983.1138008.
  • 28
    Marinescu C. Discovering the objectual meaning of foreign key constraints in enterprise applications. Reverse Engineering, Working Conference on 2007; 0:100109, doi:10.1109/WCRE.2007.20.
  • 29
    Luqi. A graph model for software evolution. IEEE Transactions on Software Engineering 1990; 16(8):917927, doi:10.1109/32.57627.
  • 30
    Bohner SA, Arnold R. Software Change Impact Analysis. IEEE Computer Society Press: Los, Alamitos, CA, USA, 1996.
  • 31
    Mirarab S, Hassouna A, Tahvildari L. Using Bayesian belief networks to predict change propagation in software systems. Program Comprehension, 2007. ICPC '07. 15th IEEE International Conference on, 2007; 177–188, doi:10.1109/ICPC.2007.41.
  • 32
    Harman M, Binkley D, Gallagher K, Gold N, Krinke J. Dependence clusters in source code. ACM Transaction on Programming Languages and System 2009; 32:1:11:33, doi:10.1145/1596527.1596528.
  • 33
    Cleve A, Henrard J, Hainaut JL. Data reverse engineering using system dependency graphs. WCRE 06: Proceedings of the 13th Working Conference on Reverse Engineering, 2006; 157–166, doi:10.1109/WCRE.2006.22.
  • 34
    Hammad M, Collard M, Maletic J. Automatically identifying changes that impact code-to-design traceability. Program Comprehension, 2009. ICPC '09. IEEE 17th International Conference on, 2009; 20–29, doi:10.1109/ICPC.2009.5090024.
  • 35
    Lungu M, Lanza M. Exploring inter-module relationships in evolving software systems. Proceedings of CSMR 2007 (11th European Conference on Software Maintenance and Reengineering), IEEE Computer Society Press: Los Alamitos CA, 2007; 91–100.
  • 36
    Binkley D, Harman M. A survey of empirical results on program slicing. Advances in Computers 2004; 62:105178.
  • 37
    Willmor D, Embury S, Shao J. Program slicing in the presence of database state. Software Maintenance, 2004. Proceedings. 20th IEEE International Conference on, 2004; 448–452, doi:10.1109/ICSM.2004.1357833.
  • 38
    Xu B, Qian J, Zhang X, Wu Z, Chen L. A brief survey of program slicing. SIGSOFT Software Engineering Notes 2005; 30(2):136, doi:10.1145/1050849.1050865.
  • 39
    Silva J. A vocabulary of program-slicing based techniques. ACM Computing Surveys 2011.
  • 40
    Xiao C, Tzerpos V. Software clustering based on dynamic dependencies. Software Maintenance and Reengineering, 2005. CSMR 2005. Ninth European Conference on, 2005; 124–133, doi:10.1109/CSMR.2005.49.
  • 41
    Cornelissen B, Zaidman A, van Deursen A, Moonen L, Koschke R. A systematic survey of program comprehension through dynamic analysis. IEEE Transactions on Software Engineering 2009; 35(5):684702, doi:10.1109/TSE.2009.28.
  • 42
    Lungu M. Reverse engineering software ecosystems. PhD Thesis, University of Lugano Nov 2009. URL http://scg.unibe.ch/archive/papers/Lung09b.pdf
  • 43
    Robbes R, Lungu M, Roetlisberger D. How do developers react to API deprecation? the case of a small talk ecosystem. Proceedings of the 20th International Symposium on the Foundations of Software Engineering (FSE'12),2012; 56:1–56:11, doi:10.1145/2393596.2393662. URL http://scg.unibe.ch/archive/papers/Rob12aAPIDeprecations.pdf
  • 44
    Chidamber SR, Kemerer CF. A metrics suite for object oriented design. IEEE Transactions on Software Engineering 1994; 20(6):476493, doi:10.1109/32.295895.
  • 45
    Chidamber SR, Kemerer CF. Towards a metrics suite for object oriented design. Conference proceedings on Object-oriented programming systems, languages, and applications, OOPSLA '91, ACM: New York, NY, USA, 1991; 197–211, doi:10.1145/117954.117970.
  • 46
    Briand LC, Daly JW, Wüst JK. A Unified Framework for Coupling Measurement in Object-Oriented Systems. IEEE Transactions on Software Engineering 1999; 25(1):91121, doi:10.1109/32.748920.
  • 47
    Zimmermann T, Weißgerber P, Diehl S, Zeller A. Mining version histories to guide software changes. 26th International Conference on Software Engineering (ICSE 2004), IEEE Computer Society Press: Los Alamitos CA, 2004; 563–572.
  • 48
    Gall H, Jazayeri M, Krajewski J. Cvs release history data for detecting logical couplings. Proceedings of the 6th International Workshop on Principles of Software Evolution, IWPSE '03, IEEE Computer Society: Washington, DC, USA, 2003; 13–.
  • 49
    Arisholm E, Briand L, Foyen A. Dynamic coupling measurement for object-oriented software. Software Engineering, IEEE Transactions on 2004; 30(8):491506, doi:10.1109/TSE.2004.41.
  • 50
    Hassoun Y, Johnson R, Counsell S. A dynamic runtime coupling metric for meta-level architectures. Software Maintenance and Reengineering, European Conference on 2004; 0:339, doi:10.1109/CSMR.2004.1281436.
  • 51
    Ying A, Murphy G, Ng R, Chu-Carroll M. Predicting source code changes by mining change history. Software Engineering, IEEE Transactions on 2004; 30(9):574586, doi:10.1109/TSE.2004.52.
  • 52
    Hindle DGA, Jordan N. Visualizing the evolution of software using softchange. Proceedings of the 16th International Conference on Software Engineering & Knowledge Engineering (SEKE 2004), ACM Press: New York NY, 2004; 336–341.
  • 53
    D'Ambros M, Lanza M, Robbes R. On the relationship between change coupling and software defects. Proceedings of the 2009 16th Working Conference on Reverse Engineering, WCRE '09, IEEE Computer Society: Washington, DC, USA, 2009; 135–144, doi:10.1109/WCRE.2009.19.
  • 54
    Kagdi H, Maletic J, Sharif B. Mining software repositories for traceability links. Program Comprehension, 2007. ICPC '07. 15th IEEE International Conference on, 2007; 145–154, doi:10.1109/ICPC.2007.28.
  • 55
    Poshyvanyk D, Marcus A, Rajlich V, Gueheneuc YG, Antoniol G. Combining probabilistic ranking and latent semantic indexing for feature identification. Proceedings of the 14th IEEE International Conference on Program Comprehension, ICPC '06, IEEE Computer Society: Washington, DC, USA, 2006; 137–148, doi:10.1109/ICPC.2006.17.
  • 56
    Poshyvanyk D, Marcus A. The conceptual coupling metrics for object-oriented systems. Proceedings of the 22nd IEEE International Conference on Software Maintenance, IEEE Computer Society: Washington, DC, USA, 2006; 469–478, doi:10.1109/ICSM.2006.67.
  • 57
    Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T. Using information retrieval based coupling measures for impact analysis. Empirical Software Engineering 2009; 14(1):532, doi:10.1007/s10664-008-9088-2.
  • 58
    Gethers M, Poshyvanyk D. Using relational topic models to capture coupling among classes in object-oriented software systems. Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM '10, IEEE Computer Society: Washington, DC, USA, 2010; 1–10, doi:10.1109/ICSM.2010.5609687.
  • 59
    Malcom Gethers AA, Poshyvanyk D. Combining conceptual and domain-based couplings to detect database and code dependencies. 12th International Working Conference on Source Code Analysis and Manipulation, IEEE Computer Society Press, 2012; 144–153, doi:10.1109/SCAM.2012.27.
  • 60
    Gall C, Lukins S, Etzkorn L, Gholston S, Farrington P, Utley D, Fortune J, Virani S. Semantic software metrics computed from natural language design specifications. Software, IET 2008; 2(1):1726, doi:10.1049/iet-sen:20070109.

Biographies

  1. Top of page
  2. SUMMARY
  3. INTRODUCTION
  4. DOMAIN-BASED COUPLING ANALYSIS
  5. DEPENDENCE ANALYSIS
  6. EVALUATION
  7. APPLICABILITY
  8. THREATS TO VALIDITY
  9. RELATED WORK
  10. CONCLUSION AND FUTURE WORK
  11. ACKNOWLEDGEMENTS
  12. REFERENCES
  13. Biographies
  • Image of creator

    Amir Aryani is a Project Manager at the Australian National Data Service (ANDS), Australian National University (ANU). He has more than 15 years of Information Technology experience across the full software development life cycle. In 2013, he has completed his PhD in the field of software maintenance and evolution at the School of Computer Science and IT, RMIT university, Australia. He is currently involved in research projects in the fields of software evolution, data mining and eScience.

  • Image of creator

    Fabrizio Perin received his BS and MS degrees in Computer Science at the University of Milano Bicocca. After a short period in industry, in 2012 he received his PhD in Computer Science at the University of Bern. His research interests focus on reverse engineering, software analyses of multi-language software systems and software visualizations. He published peer-reviewed articles on these topics in International Conferences and Workshops, and he has been involved as a consultant in research projects involving industrial partners.

  • Image of creator

    Mircea Lungu is a PhD Researcher at the Institute of Computer Science (IAM) of the University of Bern. His interests range from software evolution analysis to programming languages and mobile computing. Visit him at http://scg.unibe.ch/staff/mircea

  • Image of creator

    Abdun Mahmood received the BSc degree in Applied Physics and Electronics and the MSc degree in Computer Science from the University of Dhaka, Bangladesh, in 1997 and 1999, respectively, and the PhD degree from the University of Melbourne, Australia, in 2008. He worked as a Lecturer in 2000 and as an Assistant Professor in 2003 at the University of Dhaka. He worked as a Postdoctoral Research Fellow at the Royal Melbourne Institute of Technology until 2011. He is currently a Lecturer at the School of Engineering and IT, University of New South Wales. His research interests include data mining techniques for network traffic analysis, anomaly detection and industrial SCADA security. He has published his work in IEEE Transactions and A-tier international journals and conferences.

  • Image of creator

    Oscar Nierstrasz is Professor of Computer Science at the Institute of Computer Science (IAM) of the University of Bern, where he founded the Software Composition Group in 1994. He is co-author of over 200 publications and co-author of the open-source books Object-Oriented Reengineering Patterns and Pharo by Example.