Metrics in automotive software development: A systematic literature review

Software is an integrated part of new features within the automotive sector, car manufacturers, the Hersteller Initiative Software (HIS) consortium defined metrics to determine software quality. Yet, problems with assigning metrics to quality attributes often occur in practice. The specified boundary values lead to discussions between contractors and clients as different standards and metric sets are used. This paper studies metrics used in the automotive sector and the quality attributes they address. The HIS, ISO/IEC 25010:2011, and ISO/IEC 26262:2018 are utilized to draw a big picture illustrating (i) which metrics and boundary values are reported in literature, (ii) how the metrics match the standards, (iii) which quality attributes are addressed, and (iv) how the metrics are supported by tools. Our findings from analyzing 38 papers include a catalog of 112 metrics of which 17 define boundary values and 48 are supported by tools. Most of the metrics are concerned with source code, are generic, and not specifically designed for automotive software development. We conclude that many metrics exist, but a clear definition of the metrics' context, notably regarding the construction of flexible and efficient measurement suites, is missing.

expectations.Having an appropriate software design, it is also possible to deploy new functionality (e.g., over-the-air or online remote updates) after launch-even after features have been deployed.Because many vehicle functions are realized through software, quality of automotive software is key to participate in ensuring the reliability of a car.
Automotive software development is also challenged by the way software is developed today.Many software components are no longer developed in-house but outsourced to third parties. 1,3Hence, car manufacturers have partly established procedures to efficiently and effectively assess such software components in the development process.Standards and norms like the ISO/IEC 26262:2018 4 and the ISO/IEC 25010:2011 5 provide support by defining development and quality assurance procedures.However, such standards are generic, thus requiring an adaptation and embodiment to the internal software development process.That is, generic quality requirements must be adapted to the actual software system measurement procedures, metrics must be mapped to quality attributes, and the boundary values have to be defined to fulfill these quality attributes.The characterization of software properties and the measurement of software quality can be done with the help of (software) metrics. 6,7][10][11] Many metrics are generic to allow for broad applicability and, to a certain extent, to support comparability of software components and systems.
Metrics are available for different programming languages, models, and development activities, and a number of tools provide support for data collection and analysis.

| Problem statement and objective
Different definitions and a different understanding of quality attributes and metrics to assess relevant quality attributes lead to situations in which project parties (manufacturers, suppliers, and contractors) risk misunderstandings and costly renegotiations of qualities.An agreed, modern, and harmonized metric suite for automotive software development is not available, and moreover, a recommendation system that helps practitioners define the qualities of interest and to select metrics aligned with the available development tool chain is missing.
Our overall objective is to capture the current state of the art and practice of metrics and their use in automotive software development to support the improvement of quality management systems.We aim to provide views on metrics and quality standards that help consolidate metrics applied to automotive software projects and, thus, to reduce the necessity of deviations from defined qualities.

| Contribution
In this article, we present a systematic literature review on the use of software metrics in the automotive software sector.On the basis of 38 selected primary studies, we extracted 112 metrics for which we provide a detailed description and a categorization using the HIS * metric catalog, 31 selected quality attributes of the ISO/IEC 25010:2011 5 quality model, and the ISO/IEC 26262:2018 Part 6 4 as references.In the systematic review, we found only 17 metrics defining boundary values, yet, no general boundary values or recommendations were found.The metrics obtained from the systematic review were also analyzed in the context of 20 selected tools used to collect data and compute metrics.From the 112 metrics, 48 are supported by the selected tools.Our findings include that there is neither an agreement in literature on the practical relevance of specific metrics nor an agreed mapping between metrics and quality categories.Furthermore, our findings show that the metrics found are not specific for automotive software development.Metrics are of general nature and are used to assess software (as part of a safety-critical system) in general.Specific characterizations of metrics for automotive software development, notably a precise definition of the context of use, are not available.We contribute views proposing such mappings and characterizations that include the 112 metrics from the systematic review and the three selected standards as well as a mapping to 20 selected tools supporting these metrics.The mappings aim to help practitioners select appropriate metrics that cover desired quality attributes and that reduce the number of exceptions from the agreed quality requirements.Our results lay the foundation for constructing a recommendation system that helps practitioners develop adaptable measurement systems.

| Outline
The remainder of this article is organized as follows: Section 2 discusses background and related work.Section 3 presents the research design including the research goals, research questions, and the different steps performed to conduct our study.Section 4 presents the study results, which are discussed in Section 5, before we conclude the paper in Section 6.The appendix of this article includes detailed information about the study, detailed data tables of the study results, visual mappings, and the metric catalog.*HIS stands for the "Herstellerinitiative Software" (OEM Software Initiative), a consortium composed of the German car manufacturers targeting automotive software development.This initiative, however, was closed in 2008.Information is partially available here: https://www.autosar.organdhere: https://emenda.com/wp-content/uploads/2017/07/HIS-sc-metriken.1.3.1_e.pdf.
We introduce the background regarding metrics and measurement systems in automotive software development before discussing related work.

| Metrics in automotive software development
In automotive software development, a number of quality standards, notably the ISO/IEC 26262:2018 4 (including the Automotive Safety Integrity Levels, ASIL) have to be applied.Quality norms and standards also refer to metrics, which are supposed to help assess software quality (e.g., Table C3).However, a clear understanding of which metrics in software development address a specific quality attribute (e.g., standard attributes as shown in Figure 1) is often missing.It is often unclear what a reasonable set of metrics addressing a specific quality attribute looks like.For instance, depending on different ASIL levels, metrics can be interpreted differently with regards to value ranges and boundary values.
As a result, automotive software projects often struggle with inappropriate metrics, which need to be adapted, and deviations from initially defined target values have to be carefully documented.† This situation challenges company-and/or product-wide quality management systems in establishing standards and providing a notion of what is considered a high-quality software.
In response to the situation outlined above, in automotive software development, a standardized set of metrics was developed by the HIS consortium.The HIS metrics target the C programming language and provide no further categorization, for example, for project size, project type, and safety level.A definition of boundary values to be applied to the different project types is not provided.HIS does not include mappings of metrics to quality attributes of interest.Furthermore, as other languages than C are also used in automotive software development today, for example, C++ and Java, the applicability of the HIS metrics is limited.In practice, this situation leads to constant negotiations of boundary values and quality numbers that car manufacturers, integrators, and (external) software vendors have to agree upon.Because there is no standard set of metrics available for automotive software development, software quality is hard to compare across different projects and even within projects that include different software components.
Another perspective on software quality is given by the ISO/IEC 25010:2011. 5This standard, among other things, provides a holistic perspective on general product quality.ISO/IEC 25010:2011 evolves the quality attributes from the ISO/IEC 9126 to allow for a better characterization of the different qualities to be considered in software and system development projects.Figure 1 provides an overview of the quality criteria, notably of those criteria that have been selected for analysis in the article at hand (Section 3.1).For all quality attributes, the complementing ISO/IEC 25023:2016 9 recommends metrics (note the standard also uses the term "measure"), including simple formulae and basic boundary values as baselines.The quality criteria illustrated in Figure 1 also show that the ISO/IEC 25010:2011 is meant to be applicable to any software-intensive system.Hence, in this article, these general quality criteria serve as a baseline for the literature analysis.
As the different standards for automotive software (ISO/IEC 26262:2018) and general software product quality (ISO/IEC 25010:2011) evolve, automotive software companies are nowadays challenged by different, in parts competing and inconsistent, metric sets, the one provided by the ISO/IEC 26262:2018 and the one provided by the ISO/IEC 25023:2016.Hence, for each automotive software project, these different metric sets have to be analyzed and implemented in projects carefully to avoid inconsistencies and the risk of misunderstanding due to deviating quality baselines.The article at hand specifically addresses this issue by providing mappings using the ISO/IEC 25010:2011 quality attributes as a reference to integrate and map the different standards and the metrics found in the systematic review.† A car as such is a dependable system, i.e., reliability is a key requirement of the system.Consequently, the software of a car must also fulfill the quality attributes related to reliability, e.g., availability, safety, and security for which dependability is the umbrella (see also Avizienis et al. 27 ).
F I G U R E 1 Overview of the software product quality attributes as defined by the ISO/IEC 25010:2011 standard, which have been selected for the analysis (see Figure 4)

| Related work
Metrics are an important topic in software engineering research, 6,8 and, therefore, metrics are constantly reported in literature.For instance, El-Sharkawy et al. 10 study the current state of practice regarding deployment metrics in software product lines (SPL).They point out that there are very few metrics that can assess quality in the context of SPLs.It was further pointed out that many metric definitions are rather inaccurate and reuse of metrics across (sub-)products has barely taken place.Furthermore, authors found that the information from the SPLs combined with the code artifacts received little attention.Authors conclude that it would be valuable to combine the information from the variability model and the code artifacts.In a similar direction, Wagner et al. 12 propose the Quamoco approach, which is a meta-quality model aiming at closing the gap between abstract quality attributes and concrete quality assessments.There exists a number of quality models, 13 and a number of them are (partially) implemented in tools that support collection of metric data and analyzing quality attributes of interest.For instance, Tsuda et al. 14 provide a measurement and quality evaluation framework based on the SQuaRE model, that is, the ISO/IEC 25000 standard series.However, most quality models are focused on specific aspects leaving out the big picture of the "product."Quamoco's base model is based on the ISO/IEC 25010 quality attributes (Figure 1) and includes more than 300 factors and 500 measures for software products developed in Java and C#. 15 Besides secondary studies, quality models, and meta-quality models providing a big picture, several publications deal with the actual application of metrics to practice.For instance, Selvarani et al. 16 developed a model to examine and evaluate the Chidamber and Kemerer (CK) metrics for their predictive capability for errors and degeneration.The model was developed based on the "Shannon entropy."The result shows that the NASA/Rosenberg threshold 16 risk categorization allows for a high level of forecasting.Also in the Space domain, the European Cooperation for Space Standardization (ECSS) standards 17 provide guidelines and examples of software metrics.These metrics can be used in space system development with respect to the requirements 18 and to provide a coherent view of the software metrication program definition and implementation.In this article, we also use the ECSS standard to fill gaps in the details of metrics obtained from the systematic review (see also Section 3.3.2).
An analysis of complexity measures in practice was provided by Antinyan et al. 19 Authors found that the current measures for code quality and complexity measures are barely known in practice and rarely used by software engineers; also supported by Sloss et al. 11 Antinyan et al. 19 showed that the lack of knowledge has a negative impact on the internal quality of a software.Furthermore, they analyzed how software metrics integrate with the goals of an organization and conclude that it is important to capture well-designed metrics with documented goals in a catalog.This helps to measure progress and achieve the defined goals.A practical and systematic start-to-finish method for selecting, designing, and implementing metrics is considered a valuable aid in further improving software products, processes, and services.Such catalogs, however, require metrics to be "universal."In this regard, Hoffman et al. 20 investigated how metrics can be used universally.For this, authors analyzed the properties of metrics in detail and proposed a schema to assess metrics for their universal applicability.At the other end of the spectrum, Alves et al. 21present a method that can be used to create boundary values for software metrics using benchmarks.Authors propose a method that weights software metrics and evaluated their proposal using 100 software projects.They conclude that their method can better reflect the boundary values of metrics, because essential metric properties have been taken into account.Similarly, Schroeder et al. 22 studied how software metrics combined with expert knowledge can be used to evaluate models for specific quality criteria.They analyzed 65 000 software revisions and showed that the predictive quality of models can be improved using expert knowledge and software metrics together.Finally, Schroeder et al. 23 studied how machine learning methods and software metrics can predict the development of models (Matlab/Simulink) in the automotive industry.Authors analyzed a project with in total 4547 revisions.They could show that metrics provide an important input to support the machine learning methods.
The article at hand contributes an analysis of metrics as reported to be used in automotive software development.We aim at developing a catalog of metrics for this domain to lay the foundation to develop recommendation systems that improve the usability of metric-based measurement systems in industrial contexts.Furthermore, we aim at consolidating the variety of metrics available to help practitioners select proper metrics also taking into account the available tool infrastructure.That is, practitioners shall be enabled to select metrics and tune the selected metrics, such that they can implement these metrics with those tools that are available in the company's tool chain.

| RESEARCH DESIGN
Figure 2 shows the overall research methodology applied, which we describe in detail in Section 3.1.Section 3.2 presents the research objectives and research questions, before we describe the data collection procedures (Section 3.3) and the data analysis procedures (Section 3.4).Finally, we describe the procedures implemented to increase the validity (Section 3.5) of our study.

| Overall methodology
This study was conducted following a multistaged research approach in which we used a systematic mapping study 24 to scope the research and a systematic literature review 25 to perform the detailed analysis.To organize the literature studies, we followed the pragmatic guidelines defined in Kuhrmann et al. 26 and adopted the "three-researcher voting model."In addition to the plain literature studies, different mappings and complementing analyses of 20 selected tools have been performed to drive the collection and structuring of the metrics obtained from the systematic review.
That is, the study at hand consists of three substudies that are integrated with each other to provide a big picture.The overall research method applied, including the different analysis steps (Section 3.4), is illustrated in Figure 2 and will be explained in detail in subsequent sections.

| Research objectives and research questions
Our overall objective is to develop a catalog of metrics used in automotive software development.In this context, the quality attributes addressed by metrics, the practical use of metrics, and respective boundary values are of particular interest.To address our research objective, we defined the following research questions:RQ1: Which metrics are reported as being used in automotive software development to evaluate quality?Numerous metrics exist to measure software systems and help determine software quality.With this research question, we aim to collect metrics reported in literature that are (specifically) used in the field of automotive software development.
RQ2: Which quality attributes are addressed by the metrics?As metrics are used to support determining software quality, we are interested in the relation of the different metrics to the quality attributes of a software system.That is, we aim at answering the question if there are metrics specifically addressing certain quality attributes.For this, we utilize the three norms and standards HIS, ISO/IEC 25010:2011, and ISO/IEC 26262:2018, and we provide a mapping between metrics and these standards.

RQ3: Which boundary values exist for the different metrics?
Basically, a metric is a function mapping a software characteristic to a number, which is used to evaluate the characteristic of interest. 6,7However, quite often, it remains unclear when a measurement outcome of a specific metric can be considered good or bad.In this research question, we analyze the metrics obtained from the systematic review for boundary values and study the contexts in which such boundary values are defined.RQ4: How are metrics implemented and supported in software development practice?Finally, we are interested in the applicability of metrics, notably the tools and the language families for which the metrics provide support.The goal of studying this research question is to investigate how far metrics described in scientific literature are practically implemented in tools at the market and how the availability of tool-supported metrics impacts the selection of proper metric sets in projects.

| Data collection procedures
To collect and select the papers of interest, we adopted the "three-researcher voting model." 26Figure 3 refines the first phase of our study (Phase 1; two-staged literature search, Figure 2) and illustrates the overall data collection approach, which consists of a manual keyword-based web search for reference papers, which are listed in Table B1, and an automated search in the Web of Science (WoS; core collection), which yielded another 26 papers listed in Table B2.The subsequent sections provide details on the search procedure.
F I G U R E 2 Overview of the multistaged research method (five main phases) and the three substudies implemented (a mapping study to scope the work and to develop search strings, a systematic review to collect the main data, and a collection and analysis of tools to evaluate the practical availability of the metrics)

| Query string construction and automated search
As the purpose of the study was to collect and systematize metrics for automotive software development, we opted for an opportunistic query construction approach.Initially, we defined the search string ("safety" OR "automotive") AND ("software metrics" OR "software metric"), which was the basis for the query construction.The goal of this search string was to get a set of publications dealing with metrics in automotive software development or, at least, general metrics applied to safety-critical system development.To test and refine the search, 26 we used the initial search string and several variants with Google Scholar (executed on June 29, 2018) to generate "test" result sets, which we analyzed for suitability, that is, if they contain previously defined reference papers (Table B1).To analyze the suitability of a search string, we used the keywords of the returned papers to create word clouds, 26 which we visually inspected for a sufficient coverage of the topics of interest and an acceptable overhead (papers outside the area of interest).The word clouds provided the frequency of the keywords, and we collected those keywords that had a minimum frequency of three and were in scope of the study.The resulting list of keywords was consolidated, and the consolidated keyword list was used to create keyword groups, which were used to derive the final search strings shown in Table 1.
Eventually, we concluded the search strings listed in Table 1, which were constructed to generate overlapping result sets in order to minimize losses as described by Kuhrmann et al. 26 The expected multiple occurrences of papers have been resolved in the dataset cleaning and the final paper selection procedures (Section 3.3.2).To avoid the need of implementing extra activities to resolve "cross-database-cross-search" issues, that is, multiple occurrences of papers in the dataset due to the overlapping search strings applied to multiple digital libraries, we opted for a meta-search engine to execute the search.The search was designed as a topic-based search, which already includes the fields title, abstract, author keywords, and keywords plus, using the WoS (core collection).The search using the WoS was executed on July 27, 2018 and yielded 143 papers in total.As Table 1 shows, 14 duplicates (<10% overhead) have been removed from the result set such that 129 papers remained for evaluation.For each hit in the WoS database, the full reference (including author list, title, abstract, and so forth, see Table A1) was exported into a spreadsheet file.
T A B L E 1 Final search strings for the automated search in the Web of Science and the number of papers returned by the respective strings including duplicates among all the different search strings in the whole result set F I G U R E 3 Data collection method implemented in the literature review

| Inclusion and exclusion criteria, search execution, data collection and extraction, and evaluation
To select the papers for the study, we defined the inclusion and exclusion criteria shown in Table 2.The evaluation of the result set using the inclusion and exclusion criteria was independently performed by two researchers.A third researcher evaluated the two votes and created a new dataset from the results of the individual votes. 26Using the inclusion and exclusion criteria, all papers were evaluated for inclusion or exclusion in the study, and if a paper was accepted, the required data for the analyses were extracted.Finally, an integrated spreadsheet file was created that included all individual reviewer votes and extracted (meta) data.The integrated spreadsheet consisting of 36 candidate papers was re-evaluated by the whole team for finally selecting the papers to be included in the study.Eventually, 26 papers from the automated search have been selected for inclusion (Table B2).Together with the 12 handpicked reference papers (Table B1), the final dataset for the study consisted of 38 papers.
Besides the metadata (Table A1), for each paper, the data extraction (Figure 2; Phase 2) was performed using the structure shown in Table A4.In the course of executing the different data collection, extraction, and analysis steps, the data structure evolved, for example, by adding extra attributes and scores.Specifically, in the first iteration, we extracted all metrics mentioned in the respective papers (Table D1).In further iterations, we completed and extended the data, before, in the final iteration, we cleaned up and checked the extracted data to ensure that a metric is present in the dataset only once and all information regarding a specific metric was consolidated.Furthermore, during the data extraction, we learned that detailed information, for example, about the formulae used to compute the metrics, was available for few metrics only.Specifically, the papers obtained in the search yielded 39 formulae, 51 metric descriptions, 12 boundary values, and 19 mappings between metrics and quality attributes.Therefore, we used the European Space Agency's (ESA) ECSS standard 17 and further gray literature 28 (focused search for content in other literature that was not included in the 38 papers) obtained in a snowballing procedure as described by Badampudi et al. 29 to complement our data (Table A4) with detailed information.This extra literature increased the quality of the information, such that-after integrating it-in total, 112 metrics and their descriptions were available for analysis, that is, for the extraction of formulae, the identification of boundary values, and for performing the various mappings.

| Analysis procedures
To analyze the papers, we defined a basic data structure that includes the attributes of interest (Table A4).On the basis of the extracted data, we performed the analyses using the analysis model shown in Figure 4.The analyses included a number of mappings and comparisons of the data extracted from the systematic review with different standards relevant to the field of interest (cf. Figure 2; Phases 3 and 4).Specifically, we investigated the coverage and application of metrics found in the systematic review in the context of the standards introduced in Section 2.1.
All analysis steps were initially conducted by the academic researchers in the team, who presented the (tentative) findings in workshops and weekly phone calls to the practitioners in the team.In these workshops and calls, results have been discussed and next steps for the analysis were defined.Finally, as illustrated in Figure 2 (Phase 5), all results were checked again.These final checks also included an update of the standards to their latest versions ‡ and a re-evaluation of the study's findings in the context of the new standards before executing the synthesis.
T A B L E 2 Overview of the inclusion criteria (IC) and the exclusion criteria (EC)

EC2
The paper is a workshop summary, a guest editor introduction etc., i.e., the paper is not an original research article

EC3
The topic of interest is only mentioned in the introduction or related work, but is not a key contribution of the paper

EC4
The paper is not available for download a Note that the exclusion of nonautomotive software papers was part of the cleaning procedure.However, if papers were identified that describe metrics outside the field of interest, usually, such papers have been considered for complementing data extraction and completion of data points.‡ During the data analysis steps, the standard ISO/IEC 26262 was updated, and the industry partners made the new version available to the team.After a discussion on how to treat the new version, the actual analysis was paused, the new standard was analyzed, and, eventually, the analysis steps executed so far were repeated with the new standard version ISO/IEC 26262:2018 as a new baseline.
The starting point in our analysis model from Figure 4 are the 112 metrics identified in the dataset (in the following, we just speak of SLR-metrics, see Table D1).In the next step, the 15 metrics defined by HIS metric catalog were mapped to the 112 SLR-metrics (the metrics either matched directly or the description of the metrics matched to a large extent), which created a unidirectional link between the HIS metrics and the 112 SLR-metrics.After linking the HIS metrics and the SLR-metrics, every SLR-metric and every HIS-metric was mapped to one or more of the 31 subcategories in ISO/IEC 25010:2011 (Figure 1).In the next step, the ISO/IEC 26262:2018 4 (Part 6) was analyzed for metrics, which were mapped to the SLR-metrics as well.In case a mapping was not straightforward, the problematic metric was discussed in the team workshops.
To round out the picture, in the last step, the ISO/IEC 25023:2016 9 was mapped to the SLR-metrics.Because the ISO/IEC 25023:2016 provides a mapping to ISO/IEC 25010:2011, the finally developed mappings allow for putting metrics into context, such that a metric from the HIS metric catalog can be positioned in the ISO/IEC 25010:2011 standard and so forth.
Finally, to study the practical relevance of the metrics, in the team workshops, it was decided to also include tools used for collecting and analyzing metric data (Section 3.4.2).The procedure to link metrics to tools was implemented the same way the metrics' mapping was performed.
That is, the HIS-metrics and the SLR-metrics were mapped to tools to study which metrics are supported by the tools.
In the following, we describe the two core analysis steps: (i) the mapping of the metrics with the quality attributes as defined in the ISO/IEC 25010:2011 and (ii) the mapping of the metrics with selected tools.

| Analyzing and mapping metrics with quality attributes
Because the HIS metric catalog is still frequently used in the article's industrial context, we started with analyzing how many of the HIS-metrics (Section 2.1) were found in the systematic review and which parts of the HIS metric catalog are covered.To link the HIS-metrics and the SLR-metrics, we developed a mapping from the HIS-metrics to the ISO/IEC 25010:2011 quality attributes, before we performed a mapping from the SLR-metrics obtained from the systematic review to the ISO/IEC 25010:2011 quality attributes and the associated metrics defined in the ISO/-IEC 25023:2016.§ As for some metrics an appropriate mapping was provided in the respective papers (Appendix C1), we collected this information and completed the mapping ourselves where necessary.Finally, we analyzed all metrics for their level of support through tools (see Section 3.4.2).

| Analyzing and mapping metrics with tools
To analyze the tool support for metrics, we selected 20 tools that we analyzed for the metrics they support and for the coverage of the SLR-metrics.Specifically, we analyzed if a specific metric is fully supported by tools, that is, if data collection and evaluation is fully automated, if Analysis model applied to the dataset § Please note that the ISO/IEC 25023:2016 9 covers more metrics than defined in ISO/IEC 25010:2011.Furthermore, as we were interested in analyzing the coverage of metrics in detail, our analyses were performed using the 31 detail-level criteria form ISO/IEC 25010:2011, see Figure 1.Also, the ISO/IEC 25023:2016 also provides basic formulae and thresholds.These were, however, not used in the article at hand as we were explicitly interested in the practically defined/used thresholds reported in literature.
there is a partial support, that is, at least some steps in the data collection and evaluation are automated, or if the metric has no tool support, that is, all steps in the data collection and evaluation have to be done manually.To find relevant tools, we conducted a web-based manual search using the search criteria summarized in Table 3.The purpose of this search conducted by the researchers in the team was to create an initial set of tools to be discussed in the whole team.After applying the exclusion criteria shown in Table 4, 12 tools were initially selected (Table 3).Note that the resulting tools are counted including their language variants, that is, SourceMeter is available for Java, C++, and so forth, but generates only one hit in Table 3.However, during the analysis, the language variants have been analyzed as well.The tools were used as additional input for assessing the practical relevance.It has been assumed that metrics found in the SLR and supported by many tools are of more practical relevance.
To confirm the practical relevance of the selected tools, the initial set of 12 tools was presented to the practitioners in the team who were not involved in the search for tools.On the basis of the discussion, the tool set was extended to include the tools Development Assistant for C (DAC), Klocwork, and Polyspace Code Prover (QA-C), which are frequently used in the company's tool chain but were not present in the initial list of tools in Table 3.After extending the list of tools, the whole team discussed the extended list again and agreed on the appropriateness for the analysis.The finally analyzed 20 tools (including variants) are listed in Table 6, which also shows the different metrics found in the systematic review and how these metrics are supported by the selected tools.However, it has to be noted that this search is scoped to general tools that are available for analysis and those tools specific to the industrial context of the practitioners in our team.This introduces a threat to validity, which is discussed in Section 5.3.

| Validity procedures
To constructively improve the validity of our findings, we implemented several procedures.First, we rely on available standard procedures to define the study and collect and analyze the data.We framed the study by analyzing the input material (HIS, Section 2.1) and discussing the findings in workshops (academic and industrial partners).Finally, we conducted a mapping study according to Petersen et al. 24 using a limited set of the 12 handpicked papers that serve as reference papers (cf.Figures 2 and 3).After analyzing the reference papers, a full systematic review according to Kitchenham et al. 25 was conducted following the work mode as described in Kuhrmann et al. 26 To ensure the rigor of the individual steps and to improve the validity of the results, we established a team-based working style.That is, the team of researchers was split such that at least two researchers performed an actual activity, for example, collecting the data, analyzing the data, and so forth.From the remaining researchers, at least one not involved in the respective activity conducted a quality assurance.In case of stalemates during evaluation processes, a researcher not involved in the decision-making was called in to evaluate the critical object and to decide.
In general, evaluations were independently performed by at least two researchers, and a third researcher integrated the results.As illustrated in Figure 2, all activities were continuously quality assured.For this, we defined a work mode in which the academic researchers involved in the study were mainly concerned with analysis and synthesis tasks whereas the practitioners participating in the study performed continuous quality assurance.In workshops and in weekly phone calls, tentative results, problems, and issues were discussed, and further study activities were defined.
T A B L E 3 Search strings applied to the web search for tools that support software measurement and overview of the resulting tools after applying the exclusion criteria from Table 4 ID Search string Initially selected tools 1 "simulink" and "metric" Simulink Check 2 ( "tools" OR "tools * code") AND ("analyzer" OR "analysis") Gamma, Sonargraph, NDepend, Frama-C, PMD, SourceMeter, Blu Age Analyzer, Parasoft C/C++test, Resource Standard Metrics (RSM), Designite 3 "Eclipse * code" AND ("analyzer" OR "analysis") Eclipse Metrics

Total: 12
T A B L E 4 Overview of the exclusion criteria for tools (EC T ) that were used to reject tools returned by the web search The tool is in the status "support canceled" The tool's documentation does not contain any information about the implemented/supported metrics EC T 3 There is no publicly accessible documentation for the tool available There is no test/evaluation version of the tool available

| STUDY RESULTS
This section presents the findings of our study.We start with a result set overview in Section 4.1 before we present the findings structured according to our research questions as presented in Section 3.2.

| Overview of the result set
As described in Section 3.3.2,our study finally included 38 papers in total.Figure 5 shows the publication frequency of the papers analyzed.
Detailed information about the papers included in the study can be taken from Appendix B1.
To better characterize the analyzed papers, we classified the papers according to the research type facet (RTF; Wieringa et al. 30 ) and the contribution type facet (CTF; Shaw 31 ), which is illustrated in Figure 6.The categorization in Figure 6 shows our result set providing a considerable share of solution proposals and lessons learned.Yet, we also find models and theories proposed in the result set.In summary, approximately two thirds of the papers propose solutions (Figure 6, dimension RTF) and approximately one third of the papers reports on lessons learned (Figure 6, dimension CTF).Also, the result set contains only one paper describing a tool-based solution.
An evaluation of the papers according to the rigor-relevance model by Ivarsson and Gorschek 32 is illustrated in Figure 7.The evaluation shows that 13 out of the 38 selected papers (upper right quadrant of Figure 7) are evaluated to be of high to very high relevance (score ≥3 ) and of high rigor (score ≥2.5 ).Only seven papers received an evaluation of rigor ≤1.0, and 12 papers in total received an evaluation for the relevance ≤1 of which six papers are in the lower left quadrant.Hence, we consider the overall result set of sufficient practical relevance, and we consider the included papers as having undergone a research procedure of sufficient rigor.

| RQ1: Which metrics are reported as being used in automotive software development to evaluate quality?
The first research question is concerned with identifying the metrics used in automotive software development as reported in literature.In total, the 38 analyzed articles provided 112 metrics that address various artifacts and that can be applied to various programming languages.The complete catalog of metrics identified in the systematic review can be taken from Table D1.To categorize the metrics, in a team workshop, we reviewed all metrics and agreed on the following three categories (adapted from Kan 8 ): 2. Product metrics.This category includes metrics to quantify properties of the product as such, for example, represented by models and architecture documentation.
3. Code metrics.This category includes metrics to quantify properties of source code.
Figure 8 shows the assignments of the 112 SLR-metrics to the three metric categories.In total, we made 124 assignments for the 112 metrics.
The figure shows that the majority of the found metrics is concerned with source code, models, and architecture.Metrics applied to the measurement of binaries are not present in the result set.Also, most metrics are categorized as code metrics and product metrics, which is in line with previous studies. 13,15cause we were interested in the information provided alongside the metrics' naming, we analyzed the papers for the availability of formulae used to compute the metrics.In total, for 54 out of the 112 SLR-metrics, we could find formulae or algorithms, for instance, simple formulae, for example, Lines of Code (LoC; Metric ID 2 in Table D1), complex formulae, such as the Component Input Complexity 33 (Metric ID 27 in Table D1), or algorithms, for example, the Fault Coverage 34 (Metric ID 60 in Table D1).For another eight metrics, we utilized "gray literature" to add formulae to the metrics (e.g., the Classified Attributes Inheritance (CAIW), named in Mumtaz et al. 35 and a formula found in Alshammari et al. 36 ; Metric ID 19 in Table D1).Finally, for 50 out of 112 metrics, we could not find proper information regarding the metrics' structure or their computation.
Finding 1: Most of the metrics found in the systematic review are concerned with source code (e.g., McCabe, Lines of Code, Henry and Kafura, Halstead).Product metrics as the second-ranked category are mainly concerned with models and architecture descriptions (e.g., Ease of function learning, Function points, Traced Components per Requirement).Metrics applied to binaries are not contained in the result set.Another finding is that only a few metrics have been found explicitly addressing automotive software development.It was found in the analysis that there are many cross-sectional relationships to the dependable systems domain.In total, only 54 formulae could be extracted from the papers included in the systematic review, and another eight formulae have been added through studying "grey literature".That is, 50 metrics were mentioned in the text only without further details (e.g., Number of Statements).

| RQ2: Which quality attributes are addressed by the metrics?
A key question in our research is concerned with the quality attributes addressed by specific metrics.That is, we analyzed which metric or set of metrics shall/can be applied to assess a specific quality attribute of a software system.
According to our analysis model from Figure 4, we initially mapped the metrics to the different standards as described in Section 3.4.Figure 9 provides an integrated perspective on the mappings for which the detailed mapping tables can be taken from Appendix C1.We mapped the metrics to the ISO/IEC 25010:2011 to check the general coverage of the quality attributes addressed by the SLR-metrics found in the systematic review.
Specifically, we studied the mappings provided by the two ISO standards 25010:2011 and 25023:2016, which is illustrated in Appendix C1 (Figure C1).The ISO/IEC 25023:2016 provides 86 metrics for which a mapping to the 31 subcategories of the ISO/IEC 25010:2011 product quality model (Figure 1) is provided.We used this standard mapping as a baseline for mapping the SLR-metrics to the different standards.We mapped the SLR-metrics to the HIS metric catalog and to the ISO/IEC 25010:2011 quality attributes to assess the coverage.The mapping to the HIS-metrics is provided in Table C1, and the mapping to the ISO/IEC 25010:2011 quality criteria is provided in Table C2.
F I G U R E 7 Categorization of the papers according the rigor-relevance model The mapping to the HIS metric catalog shows that all but one metric defined in the HIS metric catalog could be found in our result set.Only for the HIS-metric Number of Recursions could no proper mapping be found.The mapping to the ISO/IEC 25010:2011 in Table C2 shows that we found a good coverage but that there is a specific focus on the quality attributes of the top-level category Maintainability (Figure 1), which contains the quality attributes Modularity, Reusability, Analyzability, Modifiability, and Testability.The third standard of interest is the ISO/IEC 26262:2018, which is specifically designed to support the development of safe road vehicles.To provide a big picture in which we combine the mappings of the HIS metrics and the ISO/IEC 25010:2011 quality attributes with the ISO/IEC 26262:2018, Figure 9 provides an integrated perspective.This mapping shows that the majority of the metrics addresses the category Maintainability.Also note that Figure 9 is a reduced presentation based on available mappings of the metrics found in the systematic review and their assignment to quality attributes to allow for providing an integrated perspective.The full mapping of the ISO/IEC 26262:2018 to the ISO/IEC 25010:2011 quality attributes can be taken from Appendix C1 (Figure C2).Further detailed assignments from the considered standards and the metrics obtained from the systematic review can be found in Tables C3 and C4.
Finding 2: Most of the SLR-metrics address the ISO 25010 top-level category Maintainability, which contains Modularity, Reusability, Analyzability, Modifiability, and Testability.For some quality attributes, our study did not provide metrics to cover all subcategories of the ISO 25010.The different mappings in Appendix C also reveal that certain system characteristics are emphasized while others have no proper metrics defined.

| RQ3: Which boundary values exist for the different metrics?
A software metric is a function whose inputs are software data and whose output is a numerical value that can be interpreted as the degree to which software possesses a given attribute that affects its quality. 37However, quite often, it remains unclear when a measurement outcome of a specific metric can be considered good or bad.For this, boundary values are introduced that support the interpretation of a metric.
Our third research question is concerned with such boundary values and if/how they are defined in the found metrics from the systematic review and in the standards included in the study.Table 5 shows that 17 out of the 112 SLR-metrics define boundary values.Of these 17 metrics, 11 boundary values were found in the result set of the systematic review, and six have been added by studying the external ECSS-Q-HB-80-04A standard. 17The external ECSS standard was included in the study, because this standard also addresses dependable systems and provides a set of metrics including boundary values and application scenarios comparable to the ASIL levels defined in the ISO/IEC 26262:2018. 4 Please also note that the extraction of boundary values did not include the boundary values as defined in the ISO/IEC 25023:2016 9 as these focus on product quality in general.Yet, as the primary goal was to analyze the state of practice as reported by the papers on automotive software development or safety-critical systems in general, the ISO/IEC 25023:2016 thresholds have not been included (see also Section 3.4.1).
F I G U R E 8 Categorization of the metrics according to the artifacts addressed.
Note that a metric can address more than one artifact, i.e., all numbers refer to the 112 SLR-metrics Finally, we are interested in the applicability of the metrics in the software life cycle.In this regard and in the context of the study at hand, we define applicability as the availability of a software tool that helps collecting data and computing a metric.As most of the metrics found in the systematic review are categorized as code metrics (Figure 8), we analyzed the result set for these code metrics.Specifically, we analyzed 20 free and commercial tools used in the automotive sector investigating which of the metrics obtained from the systematic review are supported by the tools (cf.Section 3.4.2).The analysis was conducted in two steps.The first step was concerned with extracting the programming languages for which the metrics provide support, and the second step was concerned with studying the tool support for the metrics.
Figure 10 summarizes the findings of the first analysis step.Assignments were made (i) by categorizing a metric whether it is a code-related metric or not and (ii) by categorizing a metric according to (programming) language families.Figure 10 shows that 128 assignments ¶ in total were made in the category for code-related metrics of which 21 assignments are not bound to a specific programming or modeling language.As the figure shows, the most assignments were made for object-oriented programming languages (52 assignments), followed by the procedural programming languages (28 assignments), which are still very popular in automotive software development.Our result set also includes metrics that can be applied to modeling languages such as VHDL and Simulink.As the number of assignments shows, several metrics can be applied to ¶ Note that a metric could have multiple assignments.The 128 assignments are based on the initial classification from Figure 8. Yet, in the detailed analysis, further assignments have been made on the basis of available information about documented use in supported programming languages.That is, if the documentation of a metric states that, e.g., this metric is applicable to Java and C ++, this generates two assignments in Figure 10.The second step of the analysis was focused on the selected tools as such.Table 6 lists the 20 tools we analyzed and presents the number of metrics these tools already include (built-in metrics).In the context of our study, we studied which of the metrics from the systematic review are supported by the tools listed in Table 6.For this, Table 6 provides two assignments: the first assignment is between the tools and the SLR-metrics.
The second assignment is between the HIS-metrics and the tools studied.
Figure 11 provides a visual representation of the tools' coverage of the SLR-metrics.The figure shows that the tools provide a variety of metrics and that over 40% of the SLR-metrics are supported by these tools.In total, we found 48 SLR-metrics supported by the selected tools, F I G U R E 9 Integrated perspective on the metrics defined in the HIS (top) and the ISO/IEC 26262:2018 phases (bottom; only those phases that include the metrics of interest) and their coverage of the quality attributes as defined by the ISO/IEC 25010:2011.The mapping is shown through the blue boxes, which illustrate that a metric is assigned to a quality attribute Language families (modeling, programming, etc.) supported by the metrics obtained from the systematic review yet not a single tool supports all of these metrics.To study which metrics are supported the most, Figure 12 provides an overview of the 48 out of 112 SLR-metrics that are supported by at least one of the selected tools.The detailed information on which tool provides support for which metrics can be taken from Table 6.
T A B L E 6 Tools analyzed and assignment of tool-supported SLR-metrics and HIS-metrics (note that the metric IDs refer to the unique metric ID for the SLR-metrics (Table D1) and, for the HIS-metrics, to the HIS-index from Table C1)

| DISCUSSION
In this section, we discuss the findings and provide a synthesis as well as discussion of the results.Furthermore, we discuss the threats to validity of our study.

| Answering the research questions
In Section 3.2, we posed four research questions.To start the discussion of the findings of the study at hand, we wrap up the key findings and answer the research questions as follows:RQ1: The first research question is about the number of metrics reported in literature for the use in automotive software development.Our systematic review of 38 papers yielded 112 different metrics of which 61 (54.5%, e.g., McCabe, LoC, Henry and Kafura, and Halstead) were categorized as code-related metric and 43 (38.4%) were categorized as (general) product metric.Product metrics, notably, are mainly concerned with models and architecture descriptions.Also, the analyzed papers provided formulae for 54 metrics only, another eight formulae were extracted from "gray literature," but for 50 metrics, no detailed information was made available, for example, regarding the structure of the metric or a formula used to compute the metric.Among all the found metrics, however, few metrics only are explicitly mentioned in the context of automotive software development (the HIS metrics).That is, the majority of the metrics reported in practice is of generic nature, and these metrics are only interpreted for and applied to automotive software development.F I G U R E 1 2 Overview of the 48 tool-supported SLR-metrics and their quantified support by the analyzed tools (how many tools support a metric)

| Discussion
Our findings provide building blocks to discuss the use of metrics in automotive software development.Specifically, we are interested in developing proper views to help practitioners selecting subsets of metrics that provide a maximum coverage of quality attributes and that, at the same time, help in establishing a compliant measurement tool support.For this, we did not only analyze literature for the purpose of collecting metrics but also put the found metrics into context.In total, we identified 112 metrics (Table D1) in our result set.These metrics were mapped to established standards (see Appendix C1 for further details) and a number of tools used in automotive software development to collect and evaluate metrics (Section 4.5).
Our findings as summarized in Section 5.1 indicate that handling metrics with the focus on value ranges, quality attributes, and context in practice is challenging.[15]66,67 However, when performing a mapping procedure (Section 4.3), our findings show a focus on maintainability-related quality attributes.Metrics addressing quality attributes like those related to functional suitability or security (Figure 1) were not found in the studied literature.Moreover, few studies only provided information about an exact mapping between a particular metric and the quality attributes addressed.This indicates an often observed challenge in practice: misunderstandings among clients and contractors about required qualities, acceptable value ranges (including boundary values), 66,67 and what is considered good or bad, and how this perception is reflected by a metric or a metric set. 13 many metrics in software development are focused on code (Figure 8), using tools to evaluate software using metrics is the straightforward approach.Table 6 shows that a number of tools is available, and each tool supports numerous metrics.A finding from our study is that only 48 out of 112 SLR-metrics are supported by the selected tools.Besides the standard metrics LoC and McCabe, only few tools support the metrics found in our study, whereas many tools provide support for metrics that are not included in our metric catalog.This raises the question of whether the metrics reported in the literature we analyzed are relevant at all.However, this question has to be answered in the light of the standards applied to automotive software development (Section 2.1).The mappings performed (Section 4.3 and Appendix C1) show the found metrics properly addressing the relevant standards.This potential disparity could be caused by the way metrics and measurements are implemented in practice using a tool or a tool set as facilitator.That is, we argue that when a tool is deployed to a company, those metrics from the tool's built-in catalog that can be implemented with the least possible effort will be chosen to implement the measurement processes.Other metrics or a measurement program that puts quality attributes (from which the metrics of interest have to be derived) in the spotlight challenge the companies, which might be a reason for the difficulties companies have to adapt more comprehensive quality models. 15,67On the basis of our results, we further argue that metrics are used because tools provide them rather than because of implementing a sound metric catalog or quality model designed to properly address the quality attributes of interest.We argue that this is also a reason for the obvious absence of metrics specific to automotive software development, because most metrics used in automotive software development projects are selected from such standard-often unadjusted-metric catalogs.The missing explicit links between metrics and quality attributes that have become obvious in the study at hand also support this statement.The various mappings and the links between metrics and quality attributes (including different standards) constitute a first step toward such an integrated perspective yet require further research.As a major outcome of our study, we created a large structured dataset that links together • A set of 112 metrics collected from a systematic literature review (SLR), • Two standards/metric catalogs used in the automotive industries (ISO/IEC 26262:2018 and HIS), • A general standard on product quality (ISO/IEC 25010:2011, including ISO/IEC 25023:2016), and • A set of 20 selected tools used to assess and monitor software quality.
Our findings lay the foundation for creating a knowledge base that helps practitioners select appropriate metric sets to measure quality attributes of interest.Instead of defining a new comprehensive quality model, which would however be the most straightforward solution, we propose using the different available information blocks and to combine them in a pragmatic way.That is, we propose to use all relevant standards for automotive software development and to provide mappings and clearly defined characteristics (as, for instance, described in the Quamoco approach 15 ).Together with detailed information about the actual tool chain in a company or a project, a recommender system can compute proper metric sets that (i) provide the required coverage of standards and (ii) align the tools available or show gaps that need to be filled.
Figure 13 illustrates the resulting model from a bird's-eye perspective.The figure illustrates how a respectively designed future knowledge system can be asked, for example, for quality attributes of interest and returns proper metric sets (including alternative metrics) and can also recommend tools that can handle the proposed metric sets.The bidirectional arrows mean that there is a relationship between the records about properties and IDs in the datasets.The unidirectional arrows describe a possible mapping into the dataset.All records can be linked together: for instance, by mapping HIS-metrics to the SLR-metrics, the ISO/IEC 25010:2011 and the supporting tools can be used to select those tools that support this study provides the data basis only, implementation and evaluation of such a knowledge system remains subject to future work, but recent research, for example, Tsuda et al., 14 shows the relevance of flexible and more precise measurement and evaluation frameworks.

| Threats to validity
We discuss the threats to validity of our study following the categorization by Wohlin et al. 68 A general threat to be discussed is the publication bias, that is, the phenomenon that positive results of a study are more likely published than negative ones. 26,69As a literature study, the study at hand is affected by this particular threat, for example, by suffering from potential incompleteness of the search results and the general publication bias.Furthermore, we decided to ground our study in the WoS meta-search engine only.The analyzed articles from the literature might not include publications reporting failed experiments and the lessons learned from such experiences, and, besides the handpicked papers, our result set does not contain papers published in venues not indexed by the WoS.Whereas the first issue cannot be resolved in the context of the study at hand, the second issue (meta-search engine) can be mitigated by other researchers reimplementing the research design described in Section 3 while considering the subsequently discussed specific threats to validity.

| Internal validity
The internal validity could be threatened by personal ratings of the researchers involved in the paper selection (selection bias).To address this risk, we followed a proven procedure [24][25][26] that, among other things, includes researcher triangulation to support dataset cleaning, study selection, study classification, and content analysis.For this, at least three researchers of the author team were not involved in actual analysis tasks but focused on the quality assurance only.The internal validity could also be affected by the limited data (only 38 papers resulting from the study selection).We mitigated this threat by applying a combined search strategy that includes manual selection, automated search, and a snowballing approach.Furthermore, information that was considered relevant in the study but was not available form the study's result set was collected by including external and "gray" literature, for example, the ECSS standard, 17 which was used to fill gaps in the dataset.Finally, the tools selected for the analysis of the applicability of the metrics have been selected in an opportunistic approach, which was influenced by the industrial context of the practitioners in the team.To mitigate this threat, the academic researchers initially provided a tool selection, which was evaluated and completed by the practitioners.

| External validity
The external validity is threatened by the missing knowledge about the generalizability of the results.Notably, the scope of the research could limit the generalizability as the field of application does not provide further information about the practical use of metrics.To mitigate this threat, we also analyzed papers at the borders of the field of interest, such as avionics, which also is a safety-critical industry sector.Furthermore, we also used the ECSS standard 17 as an external source of evidence complementing the general ISO-norms and thresholds.However, still, the findings of the study at hand need further independently conducted studies for cross-checking as our findings purely rely on literature, which suffers Overview of the result model and outline of future work from the selection bias. 26,69Also, we only included peer-reviewed papers.That is, gray literature was excluded from the result set as well as PhD theses, white papers, and any other literature not fulfilling the quality assessment criteria outlined in Section 3.3.2.Gray literature was only included if gaps in the result set had to be closed, for example, by adding definitions or formulae.

| Conclusion validity
The conclusion validity might be impacted by the data included in the study.Specifically, the conclusions drawn from the systematic review could be too positive (publication bias) and grounded in an incomplete dataset (selection bias).To improve the conclusion validity, we included external standards, even outside the actual field of interest, into our study.Furthermore, conclusions drawn from the systematic review have been double-checked by researchers and practitioners not involved in the initial analysis.

| CONCLUSION
In this paper, we report findings from a systematic literature review in which we analyzed 38 papers for metrics used in automotive software development.Our study yielded 112 metrics of which 48 are supported by tools commonly used in automotive software development.A mapping of these metrics to the standards ISO/IEC 26262:2018 and ISO/IEC 25010:2011 and to the German HIS metric catalog showed a good coverage.
The metrics defined in the HIS catalog and the SLR-metrics put emphasis on the Maintainability category of the ISO/IEC 25010:2011.Our analysis was focused on the quality attributes defined by the ISO/IEC 25010:2011; notably, we focused on the 31 subcategories to provide a proper categorization of the metrics.We developed a catalog of 112 metrics including a harmonized description/definition for these metrics, indication for tool support, and, where possible, formulae to illustrate how a metric is computed.
The findings from our study provide an extensive data basis to be used for the development of a knowledge-based support tool that will help practitioners select proper metrics in response to the quality attributes defined for a software product under consideration.Users of such a tool will be able to characterize a software product through the desired qualities.In response, users will be provided with a set of metrics properly addressing the quality attributes and a number of tools that support the measurement.Furthermore, users will be provided with "alternative" metrics; that is, if a specific metric is only supported by one tool, but another tool providing a better coverage of other metrics offers one or more substitutes, users can fine-tune their metric sets based on a tool configuration.However, to implement such a recommendation system, in the first step, further studies are required to fill those gaps that we identified in the study at hand.That is, our study revealed that several quality attributes are not (yet) well-covered by metrics found in our result set.Further studies with the purpose of identifying appropriate metrics are thus required.Moreover, the metric catalog developed in the study at hand needs to be checked with practitioners not involved in this study, notably regarding the practical relevance of a metric.The second step will include the development and the evaluation of a recommendation system as outlined above.

APP E NDIX A: DATA STRUCTURES FOR DATA COLLECTION AND EXTRACTION
This appendix shows the data structures used for extracting the information in the systematic review as described in Section 3.3.The data structures were used for the first two phases shown in Figure 2, that is, the systematic mapping study for scoping the research as well as the actual systematic literature review for collecting, structuring, and analyzing the data.Table A1 shows all metadata that was collected for every paper included in the analysis.Besides collecting the metadata of the respective papers, we applied three standard classification schemas to characterize the dataset.Although we fully applied the rigor-relevance model as defined by Ivarsson and Gorschek, 32 the two classification schemas RTF and CTF were tailored for application in the study.Table A2 lists the categories used for the application of the RTF schema, and Table A3 shows the categories for the application of the CTF schema.
T A B L E A 3 Applied contribution type facets, adopted from Shaw

Database
The database that generated the item in the paper list (source for manually selected paper or WoS)

General
Paper exclusion, voting model for the evaluation of the exclusion criteria (Table 2)

General
Paper inclusion, voting model for the evaluation of the inclusion criteria (Table 2) RTF Evaluation of a paper according to the research type facets according to Wieringa et al. 30 CTF Evaluation of a paper according to the contribution type facets according to Shaw 31 Rigor/relevance Evaluation of a paper according to the rigor/relevance model according to Ivarsson and Gorschek 32 T A B L E A 2 Applied research type facets as proposed by Wieringa et al. 30

Criteria Description
Evaluation research Implemented in practice, evaluation of implementation conducted; requires more than just one demonstrating case study

Solution proposal
Solution for a problem is proposed, benefits/application is demonstrated by example, experiments, or student labs; also includes proposals complemented by one demonstrating case study for which no long-term evaluation/dissemination plan is obvious Philosophical paper New way of thinking, structuring a field in form of a taxonomy or a framework, secondary studies like SLR 25 or SMS 24 Opinion paper Personal opinion, not grounded in related work and research methodology Experience paper Personal experience, how are things done in practice For the data extraction, the data structure shown in Table A4 was used.As outlined in Section 3.3.2, the data structure was initially developed, but we evolved the data structure in the course of conducting the study.For instance, the field Tool Support was added when we agreed in the team that tools should be included in the study to improve the understanding of practical relevance (data field Relevance Practice).
Table A4 also includes explanations for value ranges, classification categories, and grades.
Of special importance was the evaluation of the relevance of specific metrics in the context of the different standards and the relevance to practice.The different mappings and relevance ratings-as agreed in the study team-are in detail: ISO/IEC 25010.This mapping describes which of the 31 subcategories of the ISO/IEC 25010:2011 (Figure 1) are covered by a specific metric.Each metric was analyzed for its application to the elements needed for the calculation.Additionally, in workshops with the industrial partner, we discussed which quality criteria are met by a specific metric and which further quality criteria a metric might also fulfill.The assignment was made to the subcategories on the basis of the classification from the literature, practical experience from projects of the past, and current best practices.We also discussed if an evaluation of a fulfillment of the top-level categories was necessary.However, we decided to remain on the level of subcategories as, if necessary, the fulfillment of the top-level categories can be computed from the respective degree of fulfillment of the subcategories.
Relevance 25010.This rating describes to what extent a metric helps fulfill a subcategory.For this, the descriptions of the individual subcategories were analyzed in terms of their satisfiability.This satisfiability was discussed and graded in workshops with our industry partner and using The actual description of a metric as provided by the respective paper Description (TUC) After the harmonization, this field contains a harmonized description of a specific metric, which also includes the particularities of the different sources.This harmonization is done by the researchers from the Clausthal University of Technology (TUC) and aims at providing one agreed description of a metric Formula Available An indicator to evaluate if a formula was provided by the source that introduced a metric or if the formula was obtained from other sources in case the paper from the result set does not provide a formula

Formula
The formula used to compute the metric (if available)

Metric Type
The type of the metric, e.g., code metric or process metric

Artifact
The artifact that is addressed by the metric, e.g., code or models

Language
In case the metric is a code metric, this field lists the languages for which the metric is available Assignment Indicates whether the assignment of a specific metric to one or more of the 31 sub-categories has been made by the authors of a paper (value: This field scores how the metric fulfills the selected criteria from the ISO/IEC 25010:2011 in a classification 1 to 05: 1.The metric fulfills the criterion completely. 2. The metric fulfills the criterion partially (mostly if a criterion is assigned to multiple top-level and sub-categories).
3. The metric fulfills the criterion conditionally (e.g., if a measure provides results that required expert knowledge for interpretation).4. The metric does not really fulfill the criterion (e.g., if a measure provides results that require further results provided by other metrics to come up with an interpretation).5.The metric does not fulfill the criterion (in this case, the assignment has to be re-checked).
Relevance Practice Evaluation of the (practical) use reported for a metric in a classification 1 to 5: 1.The metric is used in real-world systems, i.e., the metric is established and considered a standard metric.
2. The metric is reported to be used in industry projects (without further context information).literature.Notably, for completed and ongoing projects, we discussed to which extent a specific metric can help fulfill a subcategory.On the basis of all these aspects, a metric was "graded" on the scale shown in Table A4.
Relevance Practice.This rating describes how widespread and therefore practically relevant a metric is.For this, in the workshops with the industry partners, we interviewed the industry partners and discussed their experience.Notably, we discussed which metrics were already known and even in use in the practitioner's day-to-day business.To get more input, it was agreed that tools be added to the study.Hence, a search for tools that support data collected and data analysis to provide metrics for software development projects was conducted (Section 4.5).The results of this extra search were used in addition to the main analyses, that is, the number of tool-supported SLR-metrics, to assess the (perceived) practical relevance.

APP E NDIX B : RESULT SETS FROM THE SYSTEMATIC REVIEW
This appendix contains the list of papers obtained in the different stages of the paper selection process as outlined in Section 3.3, that is, the papers obtained in the first two phases shown in Figure 2. Table B1 lists the 12 reference papers selected during the manual search process, and  In this appendix, we present the detailed mappings performed following the procedures described in Section 3.4 and in Appendix A1.At first, Table C1 and the following ones provide the detailed mappings performed in the course of studying RQ2 (Section 4.3).
The next public standard of relevance in the studied domain is the ISO/IEC 26262:2018.For the different development phases defined for road vehicles, ISO/IEC 26262:2018 provides 85 metrics.Table C3 provides a summary of these metrics and links these metrics to the ones found in the systematic review (including references to the papers that mention these metrics).Table C3 provides the input for the mapping of metrics to quality attributes as presented in Section 4.
As a baseline to analyze the different quality attributes, we opted for the ISO/IEC 25010:2011 (see also Sections 2.1 and 3.4).For this standard, the ISO/IEC 25023:2016 already provides a mapping of selected metrics to the quality attributes and also to the subcharacteristics.
Figure C1provides an overview of this mapping with the quality attributes and subcharacteristics in the columns and the assigned metrics in the rows.Table C4 provides the mapping of the metrics defined in the ISO/IEC 25023:2016 with the metrics found in the systematic literature review.Finally, Figure C2 provides the mapping between the ISO/IEC 25010:2011 and the ISO/IEC 26262:2018.
T A B L E C 1 Mapping of the metrics extracted from the systematic review to the metrics provided by HIS metric catalog (the column "MetricID" refers to the unique ID of a metric used in Table D1) T A B L E C 3 Identified metrics from the ISO/IEC 26262:2018 including a mapping to the metrics found in the systematic review (cf.Table D1) and the papers that refer to these metrics (the degree of recommendation to use corresponding methods depends on the ASIL: ++, method is highly recommended for the identified ASIL; +, method is recommended for the identified ASIL; o, method has no recommendation for or against usage for the identified ASIL; the ASIL integrity requirements are categorized into the categories A = lowest to D = highest)

APP E NDIX D : METRICS FROM THE SYSTEMATIC REVIEW
This appendix provides a compact overview of the metrics found in the systematic review.Table D1 this summary of the full dataset, which includes the metric's name, a short description, references to papers mentioning the metrics, values ranges (thresholds; where available)

FALSIFYALGO Algorithm
Description: This metric measures weighted STL semantics that associate a weight with each predicate to normalize the numerical difference and improve the expressiveness.Formula: Metric is computed using a complex algorithm that can be obtained from. 84 60

Fault Coverage
Description: This metric measures the number of mutation of an FSM implementation that is incompatible with the FSM specifications.Formula: Metric is computed using a complex algorithm that can be obtained from. 34

FINDPARAM Algorithm
Description: This metric is an indicator for the falsification problem: instead of minimizing the satisfaction function in an attempt to make it negative, we can try to maximize it in an attempt to make it positive.Formula: Metric is computed using a complex algorithm that can be obtained from. 84 62 FixThrMetric Description: This metric is an indicator for additional performance with respect to the design requirements in data-intensive applications.Formula: FixThrMetric = power throughput ≈ energy operation 80 63  Frequency Execution (Freq) Description: This metric measures the number frequency of execution of the nodes.Formula: Interface Appearance Customizability This metric is an indicator for user behavior.Formula: X = A B A denotes the number of interfaces and B the number of interface elements.For 0 ≤ X ≤ 1 holds: the closer X is to 1.0, the better the result.Threshold:  Project Health Index (PHI) Description: This metric is an indicator for the relative importance of the various project management factors.Formula: Metric is computed using a experienced-based equation that can be obtained from. 9096  Requirement Mining Algorithm (STL, PSTL) Description: The algorithm determines requirements from closed-loop models with the help of a requirements template expressed in parametric signal temporal logic: a logical formula in which concrete signal or time values are replaced with parameters.Formula: Metric is computed using a complex algorithm that can be obtained from. 84esponse for a Class (RFC) Description: This metric count the number of all possible methods to be executed.It evaluates all possible direct and indirect method calls that can be reached via associations.Formula: RFC of a class is defined as the sum of the number of methods in the class and the number of external methods directly called by those methods.Threshold: Table 5 35,42 99  Self-explanatory Error Messages Description: This metric is an indicator for observe user behavior.Formula: X = A B A denotes the number of error conditions and B the number of error conditions tested.For 0 ≤ X ≤ 1 holds: the closer X is to 1.0, the better the result.Threshold:  Formula: Metric is computed using a complex algorithm that can be obtained from.Understandable input and output Description: This metric is an indicator for the number of input and output items understood by a user.Formula: X = A B A denotes the number of input and output data and B the number of input and output available from the interface.For 0 ≤ X ≤ 1 holds: the closer X is to 1.0, the better the result.Threshold: Usability Compliance Description: This metric is an indicator for the specify required compliance items based on standards, conventions, style guides or regulations relating to usability.Formula: X = 1− A B A denotes the number of usability compliance items that have not been implemented during testing and B the total number of usability compliance items specified.For 0 ≤ X ≤ 1 holds: the closer X is to 1.0, the better the result.Threshold: Weight Methods for Class (WMC) This metric corresponds to the number of all complexities of all methods of a class.Formula: WMC = X n i = 1 c I Threshold: Table 5 35 112  Worst Case Execution Time Analysis (WCET) Description: This metric is an indicator estimation determines the actual worst case based upon the facts derived in the earlier phases.

5
Publication frequency of the papers in the result set F I G U R E 6 Categorization of the papers according the research type facet and the contribution type facet 1. Process metrics.This category includes metrics to quantify properties of the software development process.

Finding 3 : 4 . 5 |
Only 17 out of 112 analyzed SLR-metrics define boundary values.From these 17 boundary values, only 11 could be found in the result set of the systematic review and another six were added by the external ECSS standard.If available, boundary values are also defined for different application scenarios, i.e., boundary values are specific to particular contexts and project setups or even for different ASIL levels.This fact makes it dicult to define uniform and comparable boundary values for a domain.RQ4: How are metrics implemented and supported in software development practice?

T A B L E 5
Thresholds of metrics obtained from the systematic review and extended by the ECSS standard; note that the minimum and maximum values found only indicate boundary values, which are, however, specific to the actual context and, thus, do not allow for absolute statements Note: X = maximum number of paths (natural number); T = number of maximum nesting; D = mean time taken to learn use a function correctly (can not represent the maximum); MetricID 26: >0 copy from text; explicit specification.multiple categories.For instance, the metrics Directed Dependency between Software Components(52) or Number of Statements(91) can be applied to numerous programming languages, which could also be a reason for lacking boundary values for these metrics as stated in Section 4.4.

T A B L E A 4
Overview of the data extracted for each metric found in the dataset Field Description PaperID Paper ID (field No. from Table A1) to identify the paper that contains a specific metric MetricID Unique ID of a metric (used to harmonize the metric sets) Name (Synonym) Name of a metric.In case of multiple occurrences or different naming, synonyms are also collected Description (Paper)

T A B L E B 1 70 Functional 74 Developing measurement systems: an industrial case study 41 Construction of membership function for software metrics 75 Software 42 APaper title 78 Hardware/software codesign of aerospace and automotive systems 79 Design space extension for secure implementation of block ciphers 80 Partitioning 81 Metrics for requirements engineering 38 A measurement framework for improving verification processes 33 Measuring 82 Reliability growth by failure mode removal 34 Test 85 Process-family-points 86 Parallelizing highly complex engine management systems 87 A
Overview of the 12 reference papers selected in the manual search process Reference Paper title safety measurement in the automotive domain: adaptation of PSM 71 Traceability metrics as early predictors of software defects?72 Measuring the constraint complexity of automotive embedded software systems 73 Service-oriented software and systems engineering-a vision for the automotive domain framework for software defect prediction and metric selection 43 On error-class distribution in automotive model-based software 77 Metrics for verification and validation of architecture in powertrain software development T A B L E B 2 Overview of the 26 selected papers from the automated search process Reference of hardware-software embedded systems: a metrics-based approach the impact of changes to the complexity and coupling properties of automotive software systems sequence generation for controller verification and test with high coverage 83 Scheduling real-time systems with cyclic dependence using data criticality 84 Mining requirements from closed-loop control models 45 Reasonability of MC/DC for safety-relevant software implemented in programming languages with short-circuit evaluation practical approach to size estimation of embedded software components APP E NDIX C : THE HIS METRIC CATALOG, ISO/IEC 25010:2011, 25023:2016, AND 26262:2018 METRICS AND MAPPINGS

88 9
Audits on the relation between the ASIL of hazards and the derived safety goalsDescription: This metric is an indicator if the associated ASIL of the functional safety requirements comply with the safety goals addressed.

88 10
Belady's bandwidth (nesting levels) Description: This metric indicates the average level of nesting or width of the control flow graph representation of the program.Formula: BW = 1 n P i iL i ISO/IEC 25010 Category: Reusability Threshold: Table 5 Duration (ECDuration) Description: The metric quantifies the time span between a stimulus and response event of an Event-Chain.Thus, the reaction time of critical processing paths in the system, e.g., across multiple REs of different tasks can be evaluated.ISO/IEC 25010 Category: Time Behavior 8657  Execution Time (TimeEx) Description: This metric measures the execution time of a given macro node M having N child nodes.Formula: TimeExðMÞ = P FreqðChild i Þ × TimeExðChild i Þ -Points (PFP) Description: This metric is an indicator for software system families to support a structured reuse of components and a high degree of automation based on a common infrastructure.ISO/IEC 25010 Category: Reusability 85 94

84 97
Requirements met by ScenariosDescription: This metric is an indicator for how well does the work comply with the standard requirements on the process.8898

82 104 1 A 80 107
Testability (average input interface size) Description: This metric is an indicator for how can the testability of an individual software component be determined at the architecture level based on the number of interfaces.Formula: Average input interface size of the component = ni N ISO/IEC 25010 Category: This metric is an indicator for design purposes byanalyze the system at the macro-node level, by considering its through-put depending on the number of operations to be completed and the delay under the general assumption of resources limitation.Formula: throughput = operations second ≈ This metric is an indicator for the space of a target architecture on a silicon area.Formula:Total Area = ACPU + AmemO:S: + A memðdata, swÞ +X n i = I=O + A datapath + Amem + A control À Á Traced Components per Requirement (CR) Description: This metric measures how many components are traced to a requirement.Formula: C R = 8R i :X comp j = 1 IsLinked C j ,R i Component (RC) Description: This metric measures how many requirements are traced to a component.Formula: R C = 8C i : X req j = 1 IsLinked C i ,R j À Á ISO/IEC 25010 Category: Maturity 71 109

Metrics ID Tool Ref Built-in From SLR (Table D1) HIS-metrics supported (Table C1)
Our study shows that there are metrics for which language-specific interpretations exist, i.e., that are available for different programming languages.However, only 48 out of 112 SLR-metrics are supported by the tools analyzed in this study.Even tools that 1 Number of the supported metrics from the systematic review in relation to the total number of metrics provided by the analyzed toolsFinding 4: provide about 100 or more built-in metrics, support a maximum 14 of the SLR-metrics.The most supported metrics are Lines Of Code (20 tools) and McCabe (15 tools).Other metrics are supported by few tools only, often by only one (specialized) tool.
TableD1provides the full catalog of metrics resulting from our systematic review.RQ2:An assignment of the metrics obtained from the systematic review shows that most of the metrics address the category Maintainability of the ISO/IEC 25010:2011 standard (Figure1).Our findings show that all subcategories as defined by the ISO/IEC 25010:2011 are covered by the SLR-metrics.Furthermore, the SLR-metrics provide an almost full coverage (only one metric was not present in the result set) of the HIS metric catalog, which is used in the German automotive software development business.Finally, our findings reveal that certain focal points are set by the SLR-metrics.For instance, whereas the ISO/IEC 25010:2011 category Maintainability is well-covered, other categories are not well-represented.
been found, they are mostly defined as specific to a particular project setup, that is, a set of general boundary values was not identified.Detailed information is provided in Table5.RQ4: Our study shows a number of metrics that exist for different modeling and programming languages.However, for only 48 out of the 112 SLR-metrics, tools that support these metrics have been found.Even those tools that support a variety of metrics support only 14 of the SLR-metrics at maximum.Whereas there are "standard" metrics like LoC that are supported by (almost) all tools, other metrics are, if at all, supported by only one tool.Details are presented in Section 4.5.
IEC 26262:2018.Metrics that fulfill all three ISO standards and that are supported by tools thus provide valuable information.However, as HIS-metrics while fulfilling the quality criteria of ISO/IEC 25023:2016.The connection can be established through the findings of our systematic literature review.The output generates a table with the corresponding information from ISO/IEC 25023:2016, ISO/IEC 25010:2011, and ISO/ 31 1) or if the assignment was made in the course of this study (value: 0) ISO/IEC 25010 This group of Boolean fields assigns a metric to the eight top-level and, if possible, to the 31 subcategories of the ISO/IEC 25010:2011 Product Quality Model

Table B2
lists the 26 finally selected papers obtained from the automatic search.

ISO/IEC 25010:2011 In paper Manual MetricID, see Table D1
Software metrics for the Boeing 777: a case study35An empirical study to improve software security through the application of code refactoring 39 Calibration of software quality: fuzzy neural and rough neural computing approaches Mapping of the metrics found in the systematic review to ISO/IEC 25010:2011 quality attributes; the table also shows if a metric was assigned to a category by the authors of the respective paper or if it was done manually in the course of this study Functional appropriateness -40,70,74,80,88,90 6, 7, 8, 9, 16, 31, 64, 65, 66, 67, 72, 73, 83, 84, 86, 87, 88, 89, 90, 94, 97, 106 Metrics from the ISO/IEC 25023:2016 including a mapping to the metrics found in the systematic review (cf.TableD1) and the papers that refer to these metrics T A B L E C 4

Table 5 ,
equations, and the mapping to the ISO/IEC 25010:2011 quality attributes.The table also introduces the Metric ID, which is a unique identifier throughout the whole study; that is, whenever a metric is referred through its ID, this identifier refers to the unique metric ID presented in TableD1(and also in the study's data).
Function points are used in software development as a basis for cost estimation, benchmarking and generally for the derivation of productivity and quality metrics.A function-point rating is independent of the underlying technology of the application.This metric is an indicator if all results from the impact analysis incorporated in the safety plan and it activities.This metric is an indicator for a current situation in a company to predict required reactive measurements systems to support changing metric programs.The metric quantifies the amount of data in bits per time unit, which is exchanged between the cores.It is an indicator for the expected cross-core communication overhead.ISO/IEC 25010 Category: Time Behavior 8669

Table 5
ISO/IEC 25010 Category: User interface aesthetics 44(Continues) This metric is an extension of McCabe and Halstead by Cobbs and measures graph-oriented length-and width-type that could defined one could formulate a meaning of "area", as a function of the length-and width-type measures.ISO/IEC 25010 Category: Reusability 39

Table 5
84O/IEC 25010 Category: User error protection This metric is an indicator for the temporal behaviors of reactive systems; originally input-output systems with Boolean and discrete-time signals.Formula: Metric is computed using a complex algorithm that can be obtained from.84