Impediments for software test automation: A systematic literature review

Automated software testing is a critical enabler for modern software development, where rapid feedback on the product quality is expected. To make the testing work well, it is of high importance that impediments related to test automation are prevented and removed quickly. An enabling factor for all types of improvement is to understand the nature of what is to be improved. We have performed a systematic literature review of reported impediments related to software test automation to contribute to this understanding. In this paper, we present the results from the systematic literature review: The list of identified publications, a categorization of identified impediments, and a qualitative discussion of the impediments proposing a socio‐technical system model of the use and implementation of test automation.

industrial test automation. Impediments are defined as "anything that prevents a team member from performing work as efficiently as possible" [14].
It should also be noted that the review focuses on industrial studies of test automation. Thereby, it does not consider the large body of knowledge on methods for automated test generation, an area where recent and mature reviews already exist (see, eg, Anand et al [15]).
From industrial experience, available publications in the field, and our own research, we propose that many impediments experienced by software developers are related to the use of automated testing. Many test automation impediments appear to be of a general nature, as they are present in reports from widely different parts of the software industry. Considering the importance of testing in software development, test automation in agile development [IP6], and the large costs involved, we propose that there is a general need for systematic knowledge about what impediments could be encountered and reasons that the impediments occur. Knowledge about the impediments is important to be able to reduce the impact of the impediments, or eliminate the impediments completely. This paper contributes the following: (1) A systematic literature review (SLR) of research publications identifying real-life experiences of impediments related to software test automation, (2) a categorization of the identified impediments, (3) a model presenting how different phenomena encountered in test automation use and development interact to amplify the impediments, (4) a qualitative discussion on the impediment and phenomena, and (5) a proposal for future work.
The paper is organized as follows: An overview of the study design is presented in Section 2, followed by the results of the SLR together with the qualitative synthesis in Section 3. Section 5 discusses implications for practitioners and researchers, together with proposals for future work.
Finally, the paper is concluded in Section 6.

STUDY DESIGN
The design of the study draws primarily from two sources. The searching and screening approach largely follows the systematic literature review guidelines by Kitchenham et al. [16] and the synthesis is influenced by Noblit and Hare's meta-ethnographic method for qualitative synthesis [17].

Objectives
There are two objectives for this research: 1. To identify scientifically reported software development impediments related to automated software testing.
2. To explain the mechanisms leading to the impediments through a qualitative synthesis.

Process
To analyze the included publications and perform the synthesis the following process was used:

Getting started
The meta-ethnographic method [17] defines "getting started" as determining what to be investigated using the method. This is formulated as two objectives, as presented in Section 2.1.

Deciding what is relevant to the initial interest
In this phase, we define the search terms, the databases to be searched, the inclusion and exclusion criteria, and how to do the screening. A pilot screening activity was performed to calibrate the screening process by comparing differences between screening decisions taken by the different researches. Then, we performed the searches and the screening, finally resulting in the primary studies listed in Table 7 and in the "Included Publications" section at the end of this paper. The search strategy is described in Section 2.3, and the screening process is described in Section 2.5.

Reading the studies
The purpose of this step is to get as familiar as possible with the content of the included studies [18] and extract the initial themes and concepts.
The step can be viewed as preprocessing for the synthesis step.

Synthesising the results
The synthesis phase contains the main difference between our study design and the original meta-ethnographic approach. Noblit and Hare propose that studies shall be translated into each other to identify common concepts, to form either reciprocal, refutational, or line of argument relations [17]. The relations are interpreted by the researcher, who expresses the synthesis in writing. This analysis becomes very complex as the number of studies grow. Atkins et al. [18] analysed approximately the same number of studies (n = 44) as in our study and concluded that it most likely would be impossible to translate the studies into each other. To manage this situation, they elected to perform a thematic analysis of the data instead of formally mapping the studies onto each other. Since we had a similar volume of studies, we elected to use the thematic method as well, as described by Cruzes and Dybå [19]. The generated themes and concepts were then analyzed in line with Noblit and Hare's original intent.

Study retrieval
Failing to retrieve all relevant reports, the gold standard, is a significant threat to validity in literature reviews [20,21]. The "gold standard" for a literature search is the set of all reports that are relevant for the defined research objectives [22]. If the search strategy is perfectly adapted to the objectives, all relevant reports would be retrieved and all irrelevant reports would be ignored. For reasons described below, it is very difficult or impossible to retrieve the gold standard in a software engineering literature search [22].
Recall is the percentage of all relevant document in the searched databases that have been retrieved, and the precision is the percentage of retrieved documents that are relevant [20]. This is also known as the sensitivity of the search [22].
The terminology used in software engineering in general and testing in particular is not clearly defined [4,[23][24][25][26][27][28]. The lack of a clearly defined terminology have at least two implications for a systematic literature review: The search terms have to be more inclusive to minimize the risk of omitting an important publication and will return more publications due to matching out of scope publications using the same terminology for a different concept. Publication keywords can be used to narrow the search further than a full-text or abstract search. However, the keywords suffer from the same issues as the terminology in general and are not always relevant for systematic literature reviews [28]. The consequence of the ambiguous terminology is a search with high recall and low precision that has to be refined by a series of screening actions as described in Section 2.5 and shown in Figure 1. The screening removes the irrelevant publications and produces the final set of studies for analysis.

Search query
The primary search query is shown in Table 1 using the PICO format [29]. In addition to this, we limited the searches by keywords and areas to keep the volume manageable, as specified for each database in Table 2.

Validation of the searches
To validate the search query and database selection, we used a number of preidentified relevant publications, listed in Table 3. This revealed that Karlström et al. [IP9] was missing from the database search result. The reason is likely that it is an improvement-oriented paper and presents a process   improvement framework, based on empirical data, together with the empirical evaluation of the framework. The empirical study is presented in the paper, but it is not the main focus of the report.
As a consequence, we decided to perform backward and forward chaining [20] from the primary studies identified in the database searches [30], and an additional search of four journals [20] relevant for the research objectives.
The "IBM Journal of Research and Development" and the "IBM Systems Journal" were searched using the IBM search engine for papers matching "test," "validation," "verification," or "automation/automated." The remaining search criteria were applied manually after retrieval of a list of inclusion candidates.
The Software testing, Verification & Reliability journal and the Advances in Software Engineering journal were hand searched for papers matching the inclusion criteria. Papers found in these searches, as well as those identified by backward and forward chaining, were added to the database and were subsequently subjected the same screening process as the papers found by the regular search string.

Study selection process
The study is limited to journal, conference, and magazine publications written in English. The inclusion and exclusion criteria for each phase in the review are described further below.
To increase the precision [20] of the search, a series of screenings were performed, as illustrated by Figure 1. Following the recommendations in the Cochrane Handbook for Systematic Reviews of Interventions [29, p. 179], we have not calculated Cohen's Kappa [31] to assess the interrater reliability. Instead, we performed a pilot screening and focused on a qualitative discussion of the synthesis and included items.
The screening process followed the steps described below: • Merge search results and remove duplicate records of the same report As shown in Figure 1, there was a relatively large overlap between databases, which made automation in the shape of a Unix script necessary.
The automated script removed duplicates in a two-step process. The first step identified duplicates by comparing the DOI of the publications, and the second step identified duplicates in publications missing DOI information by comparing the abstracts. If 2 abstracts were nonempty and byte-identical, the publications were considered to be duplicates.

• Examine titles to remove obviously irrelevant reports
We embraced the Cochrane handbook recommendation to be over-inclusive [29] in this stage.  • Examine abstracts to remove obviously irrelevant reports Inclusion criteria:Software development in the real world, papers that may be within scope and cannot be excluded with good confidence.
Exclusion criteria: Out-of-domain papers.

• Manual deduplication
During abstract screening, we realized that there were duplicates remaining that had escaped the automatic deduplication.
Exclusion criteria:Identical papers. We sorted the titles and inspected papers with the same title, then duplicates were removed.
• Content screening: introduction, conclusions, and full text • Inclusion criteria: Real-world experiences of technical debt and impediments related to software test automation and software test execution tools, as well as literature reviews and other secondary studies of the same. We started by analyzing the introduction and conclusions of the candidate reports. If this did not provide sufficient information, we continued immediately by analysing the rest of the report for inclusion or exclusion. The screening continued during detailed reading and initial theme extraction, when the question "Is this relevant to the research objectives and synthesis topic" was asked [32]. If the report was found to be irrelevant during thematic analysis, by not containing any relevant themes, it was excluded from the review.
• Exclusion criteria: Fixed-design surveys that only provide tool information from closed-ended response [33] questions. The reason for this exclusion criteria is that a closed-ended question make a priori assumptions on what the impediments are, which results in a researcher bias [34]. The information provided by this type of surveys is relevant as supporting evidence to determine the relative importance of impediments but do not provide any evidence on the actual impediments encountered during development.

Quality appraisal of the included studies
Kitchenham et al. [16] recommended that the candidate studies are assessed according to a predetermined checklist, to ensure that studies are impartially assessed for quality. The quality criteria used in this review are shown in Table 4 and are influenced by the criteria used by Atkins et al. [18] and Dybå and Dingsøyr [35].
The production and validity of the checklist itself [18] is a potential issue when assessing quality, as there is no consensus on what makes a study good and suitable for inclusion in a qualitative review [30]. We quickly realized that many experience reports have a "sense of rightness and feeling of comfort" [36] and "sound, well-grounded, justifiable, strong, and convincing" [37] arguments, which makes them suitable for inclusion in a qualitative study. The goal of the qualitative analysis and synthesis is to understand the phenomena, not to describe them [30], which allows us to consider weaker but still valuable evidence. Hence, we have performed the quality assessment but not excluded any studies based on the quality assessment.

Data extraction and analysis
From each included primary study, we extracted all data relevant to the research objective of the review. Specifically, we collected all paragraphs of text containing claims on negative effects of, or impediments for, test automation. The data extraction was performed using the qualitative data analysis software NVivo [38], into which the full text reports were imported. The text was then analyzed using "thematic coding" as described by Robson [39].
In general, thematic analysis is an iterative method to identify common themes in the data, by classifying behaviors, events, activities or other concepts that are considered to influence the studied phenomenon. The themes are used to group segments of data, and these groups can then be examined to find repetitions, similarities, differences, anomalies, or connections to the theory that are further analyzed to provide the results from the study.
The analysis starts by identifying a number of potential codes for coding, or labeling, snippets of evidence. This can be done by defining a set of start codes based on prior understanding, incrementally during the work, or as a combination of both [19]. This is then applied to the body of evidence by dividing the material into strata belonging to the various codes.
After applying the code-set to the body of evidence, a series of iterative refinement takes place, where the strata are combined into themes. This is applied iteratively, increasing the level of abstraction for each step. Themes thereby form associations between the different sources included, which paints a picture of the reported body of knowledge. In our case, the themes resulting from the analysis are listed in Tables 5 and 6 and discussed in detail in Section 3.3. Further, an overarching view of the themes and the way in which they interact is given in Figure 4.

RESULTS
Considering the objectives for this systematic literature review: • Objective 1: To identify scientifically reported software development impediments related to automated software testing.
To fulfill objective 1, we extracted relevant evidence from 39 publications that were identified through a systematic literature search following the recommendations of Kitchenham et al. [16] and the Cochrane Handbook of Systematic Reviews [29]. The data extraction and analysis was performed by a meta-etnographic thematic analysis to produce the results presented in Section 3.3. Through the analysis, we identified and classified a large number of impediments that are presented in Tables 5 and 6. The impediments are grouped hierarchically, and the studies providing evidence are listed per impediment type. To fulfill objective 2, we performed a qualitative synthesis that is presented in Section 3.3. The qualitative synthesis resulted in both the qualitative narrative, and a proposal of a socio-technical system model of interacting phenomena encountered during the use and implementation of test automation that is shown in Figure 4.

Publication statistics
The unscreened combined initial result of the database searches, the journal runs, and the forward and backward searches consists of 13 060 publications, published over a time of 50 years: from 1964 to 2015-03-31.
As seen in the figure, most of the retrieved publications were published from the late 1980s and onwards. The publication rate is growing steadily, until the late 1990s, where there is an apparent slowdown during the "dot com" years, followed by a sharp increase from 2004 and onwards.
The increase after 2004 can likely be explained by an increased interest in empirical software engineering and testing. Kitchenham, Dybå, and Jørgensen published a paper promoting evidence-based software engineering [41]

Quality assessment
An overview of all included publications and the answer to the assessment questions is available in Table 7.
The absolute majority of the publications had at least a minimal description of the context that could be identified easily, but the contents and the quality of the content description varies greatly. We have not set a limit for the context quality, but noted that very few descriptions fulfill the requirements set by Petersen and Wohlin [43] or Dybå et al. [44]. This makes it harder to compare the study results, as some results may be valid only in specific contexts [28].

Qualitative summary of the findings
This section presents our qualitative interpretation of the evidence collected through the systematic literature review.
Implementing test automation is difficult, and even if well prepared in advance, there are significant risks of failure [IP4]. We propose that this is because test automation is a "socio-technical challenge" [IP31] with a complex interaction between the various phenomena that emerge during use and implementation of the automation.
During the thematic analysis, it quickly became clear that there is an interdependence between the different themes that may form feedback loops that threaten to increase the magnitude of the impediment until the test automation initiative fails.
As an example, one of these feedback loops is shown isolated from its other dependencies in Figure 3. A change in any of the factors in the feedback loop will influence the other factors. Consider the case where automation is not viewed as useful. This will likely reduce the willingness to invest in the automation by allocating people and time to the work. Less time and people available to do the work will likely lead to a low quality automation.
Low-quality automation is not as useful as high-quality automation, which reduces the perceived value and the feedback loop is closed.
On the basis of the systematic review and the experience of the researchers, we propose a system view of test automation as shown in Figure 4.
During the thematic analysis of the primary studies, we identified interdependent relations between the themes that we propose may cause feedback in an automation project. To explain this, the figure was synthesised by identifying potential cause-effect relations in the primary studies, which   Note that the figure (to be discussed in further detail in the following sections) does not contain all groups of impediments or phenomena, only those we consider to be most important for the socio-technical system. The square boxes represent the conceptual areas, and the arrows indicate influence from other areas. The rounded boxes represent the nature of the influence on the area they are located with.

Behavioural effects
Our first finding is that people working in a test automation project have a significant influence on the outcome of the project [45] due to their behaviors [46], organizational culture [46], motivation [47], and expectations.

Process adherence
Test automation depends on stability and structure in the testing and processes to be successful [48]. It will be very difficult to successfully implement test automation if the testing in general is unstructured, considered optional [IP14], or there is a general negativity to processes and quality [IP13].
Tight deadlines and a high schedule pressure increases the risk for process deviations [IP17], [IP11] [49] that may cause problems later [IP18], [IP22], [IP30]. If the automated test is perceived as an impediment by its users, there is a risk that the testing is skipped [IP12] or that the automation is abandoned in favor of manual work [IP7].

Organizational change
It is crucial to understand that the introduction of test automation at least partially is about implementing an organizational change. The introduction of automation has to be managed properly [IP16] to minimize negativity [IP16], [IP17], [IP3], [IP5] that may cause the automation project to fail. Lack of management support [IP15], [IP9], [IP2] will increase the risk for change-related problems.
When deploying automation, one must consider resistance to the change caused by automation [IP5], and the historically always present [50] fear of becoming unemployed due to automation [IP12]. Keeping employees motivated is an important factor in any change project and are not less important in an automation project [IP17] [IP6].

Too high expectations
One of the major risks in a test automation project is that the expectations of the people involved are unrealistically high [IP14], [IP16], [IP10], which may result in abandonment of the automation if the expected results are not achieved within the expected time-frame [IP10]. Managing these expectations is very important for successful automation [IP16].
Automation is expensive to develop [IP22], [IP23], [IP3], [IP24], [IP7], [IP14], [IP5], and the likelihood of an immediate return on investment must be considered to be low [IP17], [51]. Focusing on rapid return on investment during deployment is likely to be detrimental to the maintainability of the Unfulfilled expectations will likely have negative impact on the perceived value of the automation. If the test automation disappoint the stakeholders and the people involved in the work, it does not matter how much value it actually provide to the organization, and it will be considered as a wasteful activity.

Process Deviations
To invest time, money, and effort in something, both the business and its employees need to see the value of their tasks.
It is fairly well known that testing is among the first things in a development project that are ignored when the time is short and the deadlines  [IP18]. If testing is neglected and not considered to be a valuable activity for the business, then by extension, the organization will not see any value in test automation either.

Business and planning
Many organizations exceed their software development budgets in general [53], and in light of this, it is not surprising that cost is considered to be a limiting factor for automation [IP15] [3] [IP14], [IP24], [IP7], [IP5], [IP22].

Cost of ownership and operation
Realizing that automation need to follow a reasonable business life cycle as opposed to getting the benefits immediately is an important precondition for successful automation [48,51]. Test automation projects are no different from other projects and need the same type of preparations [IP16].
Before automating anything, a business case [IP22] and an automation strategy need to be developed to assist decisions on what parts of the test scope shall be automated to what degree [IP27].

Automation too expensive for small projects
The properties of the product and features being developed are relevant to the possibility to get an acceptable return on the investment. There are examples of organizations that consider their project or products to be too small to automate efficiently [IP28], [IP3]. In small, short-lived, projects or products, it can and should be discussed if it is possible to get an acceptable ROI before the development ends [IP28], [IP3], [IP5], [IP7], [IP21], [IP4].
Setting up automation can be too costly compared to the rest of the development effort [IP28] [6]. In line with this, Kasurinen et al. [IP7] propose that the ratio between the resources needed for automation and the total availability of resources shall be used to guide the decisions on automation.
Combined with work on divergent products [IP16] or on emerging platforms [IP28] [IP36], the organization may not see sufficient opportunities for long term reuse to motivate an investment in test automation [IP5] . For those situations, there could be other things than execution that can be automated with better long term results [IP10].

Lack of time, people, and funding
Damm et al. [IP17] note that projects in time-pressure tend to prioritize near-time goals and neglect activities that are important for the long-term success. This is reported for testing in general, where design organizations down-prioritize testing or remove test activities to meet delivery dead- , [IP14], [IP15]. There are also reports of more or less permanent understaffing of the development teams with respect to testing [IP2], [IP19], [IP25]. This reflects the perceived importance of testing, as indicated by Liebel et al. [IP14] who reported that many developers in a studied organization considered test to be optional, something done if there was spare time at the end of the development cycle.
Considering automation, there are reports on choosing manual work instead of automation, if time is short [IP6] or if there are insufficient resources available for automation [IP7], [IP5]. This allows the testers to test immediately instead of establishing the infrastructure and test scripts.
The result is likely accumulation of technical debt that will cause problems later [IP22]. Berner et al. summarized the approach in one of the projects they reported on as "we do not have the time to be efficient" [IP10].

Skills
Test automation is often a complex activity that require specialist skills. It is likely that the replacement of manual work by automation will cause a major change in the testers daily work. Changing the work practices may require "major training" [IP3] if the introduction of automation is to be successful. Training requires both a budget and time and priority in the work schedule.  [IP10]. If these skills are not present in the project staff, the effects will be the same as on a badly staffed ordinary software development project: an increased risk for failure, slower work, lower quality, and budget overdrafts.

Diversity
As an example, missing programming skills in the project may cause the organization to seek alternatives to programming. One such alternative is "capture and replay" [54] tools that offer a shorter implementation time for automated test scripts by recording manual tester actions for later execution. The resulting automated test scripts are usually more fragile [IP4] than test scripts that have been implemented by programming, which introduces a risk for more rework when the system under test changes [IP1], [IP26], [IP20]. Unless this has been anticipated and planned for, the rework may cause significant disappointment [IP21] in the automated testing, resulting in setbacks to the automation project [IP16].

Steep learning curve
Learning to automate tests is hard. Test automation tools may be too complex to use without significant training [IP3], and the learning curve may be very steep [IP21]. The likelihood for this is higher if the tools are intended for more advanced users than those actually using them [IP11]. Appropriate tool selection will influence the level of competence needed to work with the automation, but to do the tool selection, it is necessary to acquire relevant skills beforehand [3]. If the organization is unskilled, it might cause the introduction of technical debt that have to be refactored away later [IP22]. A high technical debt may make the return on investment slower than anticipated due to necessary rework to be able to perform necessary changes later when the system under test behavior changes.
If manual testing is viewed as easier to perform than developing automated test scripts [IP5], the consequence may be that testers choose to work manually instead of automating tests, leading to a lower or completely missing return on investment from the automation.

Inadequate development practices
A critical insight is that test automation is software development [IP16] and need to be treated as such [IP10]. This means that there is a clear need for staff skilled in programming [IP16], [IP26], [IP29], [IP23], [IP15], [IP21] and an understanding of how to do software development [IP2], [IP8], to be able to do effective and efficient development of the test automation.

Test system
During use, test tools are combined into a test bed, "an environment containing the hardware, instrumentation, simulators, software tools, and other support elements needed to conduct a test" [55], that is used by the testers. Together with the automated test scripts, this forms a test system, which is used to test the application.
If the tools and the test systems cause problems, the automation provides little or no value to the testers and is perceived as nothing but a very significant impediment [IP2]. If the impediment is severe enough, it may lead to manual testing as a work-around [IP5] instead of solving the problems.
In the test tool domain, impediments spanning the entire tool life cycle, from sourcing tools to using tools, have been reported.
The primary subthemes in this category are development and maintenance of tools, tool selection, and configuration of the test beds and test system.

Testware architecture
A test automation system is typically at least as complex as the system under test [10,11] [IP7] [12], which makes it likely that the test automation software need to be designed, developed, and maintained in a mature [40] way, similarly to the software developed for delivery to a customer.
Careless or ad hoc development of test automation software is likely to result in a lower quality test system [40]. Low quality in the test system may in turn, exactly as with any other software, result in problems for the users of the test system and a risk to the quality of the tested product. Defects in the test software may cause escaped defects [54] through false positives or false negatives [IP34], [IP9] in the testing.
Published literature suggests that test automation often is implemented without any consideration of architecture, standards, documentation or maintainability [IP10]. The shortcuts taken include not testing the test software [IP9], ignoring known problems [IP31], ignoring known technical debt [IP30], staffing with inexperienced temporary personnel without planning for competence transfer [IP5], and prioritizing rapid deployment over maintenance [IP7]. An automation project or activity need to be handled as "ordinary" software development [IP5], [IP11], [IP3], [IP10], and have adequate development standards, quality criteria, and a competent staff.
A contributing factor may be the lack of priority that commonly is exhibited for testing. There is evidence indicating that test tool development is considered to be a low priority activity, performed in whatever time there is to spare and without the careful actions involved in producing production software. Lack of time for test-related work due to other, higher prioritized activities, is relatively common and is discussed further in Section 3.3.2.
Even if the lack of time is not the cause as such, the lower priority of test work indicate that it is seen as having less value than other work in the organization.

Untested test environment
If the test system quality is low, it will result in more expensive maintenance [IP7] and a lower quality of testing [IP31]. The consequences are likely a reduction in the overall perception of the value of the test automation as well as less time available for general improvement and feature additions to the automation. Considering situations such as the one reported by Karlström et al. [IP9], "in some situations it was hard to decide whether it was the test environment that caused a failure or the software under test," one can hardly blame those experiencing that kind of issues if they are wary of automation.

Limitations in externally sourced tools
Selecting the correct tools is very important. If the tool selection is done without concern for the process and the system under test, the result may be that the selected tools are unsuitable for the work [IP23] [IP11].
It is likely that a high level of customization in the application will make it harder to use automation efficiently [IP5]. If the application is customized [IP3], changing a lot [IP21] or running on an emerging platform [IP28], there will probably be issues with using generic tools, and more customization and in-house tool development has to be done. If the application is very customized, or operating on an emerging platform [IP28], there is an increased risk that tools and emulators used for testing are missing important features [IP18].
Reuse may be a way to mitigate the costs of automation, but one must consider that reuse have the potential to cause other problems in itself.
Reuse is likely harder if the system under test is specialized [IP3], changing a lot [IP21] or running on an emerging platform [IP28]. These limitations for reuse are similar to the limitations for using prefabricated tools, which can be considered to be a special case of reuse. Prefabricated tools are discussed in Section 3.3.4. In addition to the purely functional considerations, many organizations develop tools ad hoc, as further discussed in Section 3.3.4, and this is a significant barrier for reuse [56].
An externally sourced tool is a tool not developed by its users. It may be a commercial tool, an open source tool, or simply a tool that is produced by another team in the company, sufficiently far away from the users of the tool to be perceived as a third party. The potential benefits of using externally sourced tools are easily imagined. These benefits include increased work focus for the product developers, who do not have to develop both test tools and the sold product, and cost sharing by developing the tool once instead of everyone developing their own environment.
Using an externally sourced tool is fundamentally different from using a tool developed specifically for the system it is intended to test and introduces challenges that may cause significant problems for the users. The generic nature of a tool shared by many organizations may make it inadequate for the intended purpose [6] unless integrated with other tools [IP35], [IP11] or modified [IP7] to cover the gap. Commercial and open source tools have limited functionality [57] and may require modification to be useful. In the case of commercial tools, the modifications are likely impossible to do within the own organization and will require additional work by the supplier, which will cause more costs and increased lead times.
It is likely that there will be communication issues [IP2] unless there is a close cooperation between the supplier and the users. Any problems with the communication between the supplier and the users will decrease both the actual value of the tools and the perceived value of the tools. This may have serious impact on the organizational efficiency. In our experience, people will start producing their own solutions if "being a customer" is seen as a larger impediment than developing a tool on your own.

Environment configuration
Configuration issues can be divided into 2 groups, problems relating to difficulties with performing the configuration and problems related to managing the configuration to be able to use known and relevant configurations for testing that are repeatable if necessary.
Ad hoc configuration of test tools and test beds introduces a test risk by potentially causing the tests to be executed on unknown configurations.
This may lead to instable results with corresponding false positives/negatives [IP34], [IP9], and problems with repeatability of tests [IP15]. Since repeatability and confidence in the tests are among the key benefits of automation [6][48], the importance of reliable configuration should not be understated.
Configuration issues are likely to occur if there is a suboptimal organization [IP9] of the environment or if it is difficult to use [4] [IP21], [IP11], [IP14], [IP8]. Expectations on configurability are important: If the test system users expect the system to work without their involvement [IP17], [IP9], [IP11], it is likely that configuration and setup of the test system will be more difficult and take longer time than if it was considered a normal part of their work. Unexpected work is less likely to be considered during training and staffing and will form a double impediment, as the necessary skills and people have to be secured before the work can be done.
Resolving configuration and setup issues may consume a large part of the time and resources dedicated to testing, to a point where the automation can be considered to be useless [IP29], [IP18], [IP32].

System under test
The properties of the system under test are very important for successful automated testing and will influence the design and operation of the test system.

SUT speed
One of the primary expectations on automation compared to manual work is increased speed. However, automated test execution speed cannot be faster than the reaction time of the system under test. This means that if the system under test is slow, the automated test execution will also be slow [IP22], and the perceived value of the automation risk becoming lower due to high expectations on speed.

SUT testability
Low testability in the system under test, with respect to automated testing, may cause problems [IP17] and potentially expensive workarounds to be able to implement the automated testing. This is particularly a problem for interfaces not designed for machine-machine interactions [IP28], such as graphical user interfaces [IP16], [IP34], [IP5]. In the case of testing a graphical user interface, it may be more cost-effective to improve the testability of the system under test and introduce a testing interface for functionality, than to implement and maintain functional testing through the graphical user interface [IP10].

Platform limitations
Targeting the same functionality towards different operating systems [IP5], [IP3], languages [IP33] or platforms [IP28] is likely to introduce impediments for the automated testing. Variations in look and feel, interfaces, and testability may cause problems when interfacing the system under test from an automated test execution system [IP28]. A prime example is development of mobile applications for smartphones, tablets, and similar.
Mobile applications is a rapidly expanding field [58], and the resulting products are deployed in a highly fragmented universe of different devices from different manufacturers [IP28]. Similar phenomena can be encountered in development of desktop software as well, when there are differences in the exposed interfaces, or if one expects to use the same test automation system for fundamentally different products without adaptations [IP5], [IP3].

Construct validity
Recall is the percentage of all relevant document in the searched databases that have been retrieved, and the precision is the percentage of retrieved documents that are relevant [20]. Having low recall corresponds to a failure to retrieve all or most relevant papers.
Searches with low precision require removal of many candidates, which increases the risk for misinterpretation or researcher fatigue. Given that we have removed approximately 99.7% of the initial search yield through manual screening, as shown by Figure 2, there is a risk that we have failed to identify primary studies with poor title or abstract during our screening process. Inadequate keywords and terminology is a well-known issue when performing systematic literature reviews [28], and we know that relevant publications were missing completely from the initial corpus. We had previous knowledge of one relevant study by Karlström et al. [IP9] that was not found by the search strings, as reported in Section 2.4. This validity issue was mitigated by hand-searching a selection of relevant venues as described in Section 2.3.

Internal validity
In identifying impediments, there is always a risk that the primary studies have not reported the actual impediments, but instead the impediments that were perceived at the time of the study. In that case, this has carried over to our synthesis. A further threat to internal validity is the substantial industrial experience of the authors. While this industrial experience has served to facilitate analysis and conceptualization, it might also unconsciously have impacted data extraction and analysis, e.g., judgment of relevance of certain potentially domain-specific impediments.

External validity
One readily identifiable challenge with performing a systematic literature review in this field of research is the lack of evidence. In total, 39 relevant publications were included after screening, of which 18 are experience reports. This is on par with the 25 publications analyzed by Rafi et al. [6] in their review of benefits and limitations of automated software testing.
Several publications are partially overlapping, which makes the actual body of evidence smaller than it appears. Publications [IP3], [IP24], [IP7] and [IP5] draw from the same source material. The publications are written from different points of view and with different analyses of the material, but the contexts are the same and the presented evidence is very similar. Also, publications [IP1] and [IP26] at least partially share the same context.
Publication bias occurs when the published research is unrepresentative for the research actually performed in the area [59]. This introduces a risk of drawing the wrong conclusions when performing secondary research. A small body of evidence, such as the one used for this review, is more sensitive to bias problems. Kitchenham et al. [16] proposes that there is an increased risk for publication bias in industrial research that is viewed as potentially sensitive. Test research is arguably sensitive, as the testing is related to the product quality, and as such, is directly related to the bottom line of any software business.
However, much research is anonymous, and there is a reasonable spread of the severity of the issues reported in the included publications. The major risk of bias is likely in evidence drawn from experience reports, as many experience reports paint a very positive view of the situation. All considered, we consider it likely that there is a sufficient mix of both trivial and significant impediments in the published literature.

Reliability
The limited volume of evidence means that the basis for the synthesis can be questioned. Many of the publications would not have been included if one were to follow strict inclusion criteria and only accepted research reports. In our opinion, we have captured a better picture of the practice by also including the experience reports in the analysis.
The search approach and screening influences the reliability as well, due to the large part of manual selection involved. Noblit and Hare [17] point out that "[a] synthesis reveals as much about the perspective of the synthesizer as it does about the substance of the synthesis." What this means is that the synthesized result is highly dependent on the researchers involved, and is only one of the possible interpretations of the data [30]. This is a threat to the validity of the study, as it is likely that researchers with different background or perspective will reach different conclusions.

DISCUSSION AND FUTURE WORK
In this section, we discuss the implications of the review results for research and practitioners and propose future work based on the results and meta-observations during the review.

Implications for researchers
A related literature review on automated software testing was undertaken by Rafi et al. [6]. While some of the impediments of automated software testing identified pointed out in our work were also identified as limitations in that review, our work extends the work of Rafi et al. in the following ways: First, instead of only listing limitations or impediments that may arise in automated software testing, we elaborate on how these impediments manifest and why they do so. Second, we describe an explanatory socio-technical system model of how the impediments affect each other in Figure 4.
Third, while the review by Rafi et al. was presented in a very summarized form (the search process, analysis and results were presented in less than two pages), we describe the our review and analysis process in depth.
For researchers, the categorization, the publication overview, and the synthesized results are useful as a basis for future research and to understand the state-of-the-art and the state-of-the-practice as of the time of the review.
Terminology is a known issue in software engineering research [4,23-28] and a threat to the reliability of empirical research. Our categorization, presented in Tables 5 and 6, provide a framework for classification of test automation impediments that can be used during future research in the area.
The publication overview reveals that there is a shortage of empirical evidence in the research area. There are few research publications, 18 of 39 of the analyzed publications are experience reports. This indicates that there is both a need for, and a clear opportunity to perform, industry-relevant research in the area: There is a high practitioner interest and a lack of research publications.
Our interpretation of the collected evidence, based on our experiences from industry and research, is presented in the synthesis and the model in Figure 4. The synthesized narrative together with the system can be used both to quickly gain an understanding for the area and to direct future research.

Implications for practitioners
It is clear from the review that there are similarities in experienced impediments between different businesses, product types, and development models. Hence, we recommend anyone that use, or plan to use, test automation, to consider the findings in this paper. Our results are useful to gain a deeper insight into the complexity and challenges of test automation, to be able to implement and use test automation in a more effective and efficient way.
Understanding a problem is the first part of a solution for the problem. In particular, we propose that the socio-technical system model of test automation presented in Figure 4 will be useful for development organizations that implement test automation, where the model can be used to understand and manage the automation by monitoring and controlling the interacting phenomena described by the model.

Meta-observations on the review process
A meta-conclusion from the review is that conducting systematic literature reviews in software engineering are difficult. Many challenges were encountered during the review process: unclear terminology [28], influential databases did not provide machine readable search results, and a large number of irrelevant documents were returned by the searches which required extensive screening. These impediments need to be removed to enable more high-quality reviews in a shorter time in the future.
Barroso et al. [20] state that the main threat to the validity of a systematic literature review is the failure to conduct a sufficiently exhaustive search. This is mitigated by relaxing the search criteria to include more potentially relevant publications in the initial corpus and perform a screening to identify what actually is relevant to the study [29]. A consequence of this is that there is a need for extensive, manual, postprocessing of the searches, which also may introduce bias to the study. An alternate approach to searching and screening is "snowballing," searching through the graph of references starting with a set of key publications. This approach was investigated by Jalali and Wohlin [25] who concluded that it could be a more efficient approach, in particular when the search query contain general terms that risk matching many irrelevant publications.
Given the amount of work needed to perform this review and the resulting yield, less than 0.3% of the originally retrieved publications, we see an urgent need to improve how publications are classified and to improve the search precision in the publication databases.

Future work
In general, we see a need for empirical research in industry to increase the body of evidence related to actual use of test automation. This observation is based on the relatively small number of relevant research publications identified during this study. In our experience, this will require close cooperation between industry and academia, as the type of empirical research needed to provide deeper insights into the actual problems experienced by software developers need to be conducted close to the people performing the work [61].
A large part of the impediments are of an organizational, and not technical, nature, as shown in Table 5. Many of these impediments can be suspected to be caused by an ad hoc approach to the management and handling of the test automation, even in organizations that are mature in their handling of their commercial products. This class of impediments are reported from a wide diversity of contexts and appears to be sufficiently general in nature to warrant further investigation. What are the reasons for the apparent lack of focus on testing and test automation, and how can the situation be improved?
There is a lack of test process improvement and assessment models suitable for test automation [IP11]. To address this, we propose the development of an improvement model in the form of a measurement system similar to a balanced scorecard [60] that is based on the model proposed in Figure 4. The implementation of the measurement system is proposed to be a combination of objective measurement, as proposed by Eldh et al. [12], and self-assessment. The validation of the measurement system will require further industrial empirical research, which in addition to validating the model, also will contribute to the general body of evidence in the test automation area.

CONCLUSIONS
In this paper, we report on a systematic literature review, searching for publications providing information on impediments related to automated software testing. The retrieved publications were analyzed to propose a categorization of impediments and to perform a qualitative synthesis of the collected evidence.
The qualitative synthesis is based on the evidence collected through the review and presents our interpretation of the state of the practice as described by the publications included in the review. The synthesized results as presented by the narrative in Section 3.3, and the model in Figure 4, builds on the collected evidence and our experience from industry and research.
This study proposes an explanatory model for the impediments and their interaction in Figure 4. The explanatory model relates the impediments to each other, and to the business of software development and testing in the large. The explanatory model highlights areas of interest for future research and enables a deeper understanding of how to manage impediments in an automation project.
It is important to make software engineering research useful and available to practitioners [2]. One way to make research more available to practitioners is to create assessment and improvement models that can be used to guide the daily work in a software development business. We propose that there is a need for an assessment and improvement model specifically aimed at software test automation. Hence, we propose further research to turn the explanatory model in Figure 4 into an assessment and improvement model for software development organizations working with test automation.