“Through the looking‐glass …” An empirical study on blob infrastructure blueprints in the Topology and Orchestration Specification for Cloud Applications

Infrastructure‐as‐code (IaC) helps keep up with the demand for fast, reliable, high‐quality services by provisioning and managing infrastructures through configuration files. Those files ensure efficient and repeatable routines for system provisioning, but they might be affected by code smells that negatively affect quality and code maintenance. Research has broadly studied code smells for traditional source code development; however, none explored them in the “Topology and Orchestration Specification for Cloud Applications” (TOSCA), the technology‐agnostic OASIS standard for IaC. In this paper, we investigate a prominent traditional implementation code smell potentially applicable to TOSCA: Large Class, or “Blob Blueprint” in IaC terms. We compare metrics‐based and unsupervised learning‐based detectors on a large dataset of manually validated observations related to Blob Blueprints. We provide insights on code metrics that corroborate previous findings and empirically show that metrics‐based detectors perform highly in detecting Blob Blueprints. We deem our results put forward a new research path toward dealing with this problem, for example, in the scope of fully automated service pipelines.


| INTRODUCTION
Since the rise of DevOps, 1 the shift from on-premise infrastructure to cloud-based infrastructure has introduced new challenges and opportunities driven by the continuously increasing demand for fast and high-quality software services and their orchestration.DevOps covers a broad spectrum of organizational and technical practices to achieve this integration, including continuous integration (CI) and continuous deployment (CD).In particular, service orchestration is strongly influenced by infrastructure-as-code (IaC), which rapidly became a crucial practice to automate infrastructure and ensure consistent and repeatable routines for service provisioning and configuration changes. 2 On the one hand, IaC substantially benefits the maintainability and quality of the overall service properties and service-level agreements, for example, faster closing-the-loop iterations and, therefore, faster service evolution cycles.At the same time, it reduces the time, effort, and specialized skills required to provision and scale infrastructure services while improving consistency by reducing ad hoc configuration changes and updates (a phenomenon known as configuration drift * ).Nevertheless, on the other hand, IaC is still a code and, therefore, subject to all potential problems (e.g., due to bad coding practices or human error).Similarly to software written in traditional application languages, IaC artifacts-often called blueprints-bear the same coding horror † fatalities, which can become even more impactful for IaC, where bad coding practices contribute to introducing IaC defects 3 ; their impact can be massive because it manifests at runtime and often via running expensive infrastructure costs and connected shortfalls.In 2017, for example, a services infrastructure failure within Amazon Web Services took down websites such as Expedia.com, Slack.com,Medium.com, and the US Securities and Exchange Commission for several hours.‡ In this study, we focus on bad coding practices, which reveal another harm in software development: Symptoms indicating wrong style usage or a lousy design known as code smells.Smells do not directly cause system failures but violate best practices and design principles, negatively affecting readability and code maintainability. 4

| Motivation
Code smells are broadly researched for traditional source code development, 5 as their detection can enhance software quality.In the scope of IaC, Guerriero et al. 6 investigated the state of practices in adopting IaC using the data from 44 semi-structured interviews with senior developers.
They observed that large IaC scripts, referred to as Blob Blueprint, occur among the most common bad practices in the industry when developing infrastructure code.
Unfortunately, only a few works exist on IaC code smells, 7 and they focus on specific technologies and languages, such as Puppet or Ansible.However, Guerriero et al. 6 identify as one of the best practices "recombining diverse formats by abstraction using the OASIS TOSCA standard for IaC and including multiple formats inside node-type definitions".Indeed, technology-agnostic infrastructure code, such as the OASIS Topology and Orchestration Specification for Cloud Applications § (TOSCA) can build upon existing configuration and orchestration languages to improve the readability and portability of configuration files across platforms. 8,9It enables automated deployment of technology-independent and multicloud compliant applications, managing applications, resources, and services regardless of the underlying cloud platform or infrastructure.From a business viewpoint, TOSCA "expands customer choice, reduces cost, and increases business agility across the application life cycle.The synergy between these benefits accelerates overall time-to-value". 9at is more, the fourth and sixth authors of the paper are active members of the OASIS TOSCA Standard Technical Committee and used this presence to enact discussions about this work in some of the committee's meetings.As a result, practitioners shared concerns about "long, blob-like blueprints."In particular, they mentioned that the most critical hazards connected to infrastructure abuse come from lousy coding practices concerning the security of IaC blueprints.The targets for such practices are mainly found in long, blob-like blueprints.Such issues reportedly can yield irreparable infrastructure damage as well as loss or even theft of data as much as leaking of industrial secrets to a point in which manual inspection of long blueprints is required.¶

| Research questions
In this study, we conjecture that identifying code smells in technology-agnostic infrastructure code (i.e., TOSCA) and related metrics opens up opportunities for building a general, automated service continuity quality model designed explicitly for configuration orchestration languages.Furthermore, we deem that analyzing such code smells pave the way to understanding how lousy coding practices affect service infrastructure continuity. 10In particular, the goal is to identify structural code measures that characterize complex blueprints and analyze the effectiveness of We conducted a case study involving 749 blueprints and a prominent traditional implementation code smell, known as Large Class or Blob, that highly impacts fault-and change-proneness [11][12][13] and that could potentially be observed and easily implemented in TOSCA.The smell is one of the most frequently investigated for traditional application code. 14In addition, Guerriero et al. 6 observed it among the most common bad practices in the industry when developing infrastructure code; they refer to it as Blob Blueprint, namely, a too-large IaC script.From here on, we use the same nomenclature.We selected this smell for its implementation ease, frequency, and potential impact on infrastructure code quality.
We build upon the studies by Sharma et al 15 and Schwarz et al. 16 Specifically, we analyze traditional structural code metrics for IaC smell detection to corroborate their findings on a technology-agnostic language (i.e., TOSCA).Additionally, we investigate how metric-and unsupervised learning-based techniques perform to detect Blob Blueprints.
The motivation behind focusing on the metrics-and unsupervised learning-based detectors is twofold.On the one hand, metrics-based smell detectors are the most frequent and easy to implement. 17They calculate a set of metrics, such as lines of code, coupling, and cohesion, upon the original source code and detect smells if they exceed a given threshold. 17However, determining a suitable threshold demanded by metrics-based smell detectors is a nontrivial challenge.On the other hand, the interest in ML-based methods, prevalently supervised, is growing to overcome shortcomings such as determining threshold values.However, they come with other drawbacks, such as the need for accurate (labeled) training data, which might be hard to acquire. 18Despite this, unsupervised learning might significantly reduce the effort of collecting and identifying smelly blueprints, similarly to previous work on software defect prediction. 19,20

| Contribution
This study contributes to research with an empirical study that compares metric-and unsupervised learning-based techniques to detect Blob Blueprint.In particular, we compare several popular clustering techniques with a detector applying the interquartile rule on a dataset of manually validated observations related to Blob Blueprints.We provide insights on code metrics that corroborate previous findings and empirically show that metric-based detectors perform well in detecting Blob Blueprints.Finally, we provide a replication package and a comprehensive dataset of publicly available TOSCA blueprints, including source code measurements calculated on these blueprints and manually validated observations related to the Blob Blueprint smell.#

| Paper structure
Section 2 presents background on IaC, focusing on TOSCA, and reviews the existing literature on IaC smell detection.Section 3 outlines the measures for blueprint complexity that characterize Blob Blueprints.Section 4 describes the empirical study to evaluate metric-and unsupervised learning-based techniques in detecting Blob Blueprints, and Section 5 reports the experiment results.Section 6 discusses limitations and threats to validity; Section 7 discusses the implications for researchers and practitioners and lessons learned.Section 8 concludes the paper and outlines future work.

| BACKGROUND AND RELATED WORK
This section provides a brief grounding about Infrastructure-as-Code (IaC), and TOSCA, as well as previous literature on code smells in IaC.

| IaC and TOSCA: An overview
The Topology and Orchestration Specification for Cloud Applications (TOSCA) is the official OASIS industry standard for IaC.It is a YAML-based domain-specific language that allows for the automated deployment of technology-independent and multicloud compliant applications.In other words, it can manage applications, resources, and services regardless of the underlying cloud platform, software environment, or infrastructure. 9rthermore, unlike other configuration management tools, such as Puppet, Chef, Docker, and Ansible, TOSCA covers the complete application life cycle rather than just deployment and configuration management. 9Thus, it provides a higher abstraction level while incorporating those above and additional technologies to serve a specific need.
The creator of a cloud service captures its structure in a service topology-a graph of nodes representing the service's components and relationships that connect and structure nodes into the topology.Both nodes and relationships are typed and hold a set of type-specific properties.
Types define reusable entities that define the semantics of the node or relationship (e.g., properties, attributes, requirements, capabilities, and interfaces).Templates form the cloud service's topology using these types.In particular, they define how to instantiate the respective type for use in the application.They allow defining the start values of the properties by specifying their defaults.However, they can overwrite and extend the types to adjust them for their respective application.These types are conceptually comparable with abstract classes in Java, whereas the templates are comparable to concrete classes. 8sting 1 (left) shows a blueprint to start Elasticsearch on a Virtual Machine.k The template topology_template (line 24) defines the service structure and incorporates a Node Template (line 28) of type tosca.nodes.indigo.Elasticsearch, defined at the top under node_types (line 6).
Other essential components besides nodes and relationships are capabilities, requirements, and interfaces.Nodes can offer specific capabilities to other nodes at runtime.For example, in Listing 1 (left), an additional node can be added to offer the capability to host the elasticsearch node, which means that the node can function as a host for other nodes.On the contrary, nodes may require to implement specific functionality.
For example, the elasticsearch node in Listing 1 (left) might require the additional node as host.However, as relevant to this work, the example focuses on interfaces for clearness and space.An interface is a set of operations accompanied by a specific implementation of how to execute that operation.For example, the elasticsearch node in Listing 1 (left) is configured by the configure interface (line 16).In addition, it specifies the script that performs this configuration (line 17) and its input parameters (lines 18-20).Although it uses an Ansible playbook ** to implement this configuration -Listing 1 (right) -, it can use other languages or technologies such as Python, Chef, or Puppet.

| Code Smells in IaC
Code smells are sub-optimal code structures that may cause undesired or harmful effects, 4 such as complex classes or duplicated code.Although many code smells were proposed and studied for traditional programs, 4,[21][22][23] IaC-specific smells, particularly their detection techniques, remain unripe.Sharma et al. and Schwarz et al. 15,16 were the first to apply the concept of code smell in the context of IaC.In particular, they proposed a catalog of 13 implementation and 11 design configuration smells applicable to Puppet, and later 17 code smells peculiar for Chef.The catalog was then benchmarked against 4, 621 Puppet repositories.Interestingly, design smells showed higher average co-occurrence than implementation smells: one wrong or non-optimal design decision introduces many quality issues in the future, thereby showing that the notion of code smells also applies to IaC.
From there, the research area of code smells in IaC evolved.Schwarz et al. 16 built upon Sharma's work by selecting the ten most frequently occurring smells and converting them into detection rules for a static code analysis tool designed for the Chef language.Furthermore, they categorized the code smells for their dependence on a specific technology, which facilitates mapping these smells to other languages in future research.
Later, Rahman et al. 24 identified seven security smells for IaC found in Puppet scripts and showed that these smells could persist for an extended period.These were extracted from qualitative analysis of Puppet scripts in open source repositories and comprise (i) granting admin privileges by default, (ii) empty passwords, (iii) hard-coded secrets, (iv) invalid IP address binding, (v) suspicious comments (such as "TOD" or "FIXME"), (vi) use of HTTP without TLS, and (vii) use of weak cryptography algorithms.
Our work can be seen as complementary to those mentioned above, as it studies the impact of unsupervised learning-based code smells detection for configuration management artifacts.Indeed, previous work extensively used metric-based smell detection methods only.These methods take source code as the input, detect a set of source code metrics that capture the characteristics of a set of smells, such as lines of code, coupling, and cohesion, and detect smells by applying a suitable threshold. 5,25Nevertheless, these approaches suffer from subjective interpretations and threshold dependency.Indeed, determining a suitable threshold is a non-trivial challenge, while machine-learning techniques are promising to overcome shortcomings such as determining threshold values but need more investigation. 26,27typical Machine-Learning method starts with a mathematical model representing the smell detection problem.Existing examples and source code models are used to instantiate a concrete populated model.The method results in a set of detected smells by applying a chosen machine learning algorithm on the populated model. 5For instance, one could train a Random Forest classifier for supervised-learning code smell detection using different metrics for each smell; and use the resulting model on other programs to detect that smell using the corresponding metrics.
Because IaC research has not yet explored this, it is unknown to what extent they can distinguish sound and smelly blueprints.In addition, the number of conducted empirical validation studies is marginal.Although recent work has expanded to other IaC languages, such as Ansible, most empirical studies on IaC quality are limited to Chef and Puppet, which makes generalizability hard.Finally, to our knowledge, no publications related to configuration smells in TOSCA exist, despite the language being the IaC industry standard.

| THEORETICAL MODEL: THE BLOB BLUEPRINT
With this work, we do not intend to introduce new smells for IaC, but we conduct a case study revolving around one specific code smell called Blob Blueprint.In traditional application code, researchers refer to this smell as Large Class; it represents a class that typically contains too many fields and methods and relies on several external data classes, making it low cohesive. 4,5 IaC, only two works relate to Blob Blueprint, although they do not target it directly. 15,16For example, Sharma et al. 15 and Schwarz et al. 16 presented Insufficient Modularization.This smell represents an abstraction (e.g., a resource, class, "define," or module) that is large or complex and thus can be modularized further.They instantiated it for Puppet and Chef, respectively, and provided three conditions for their detection: 1. configuration files that contain more than one class (in Chef) or resource (in Puppet); or 2. class declarations that are too large (more than 40 lines of code); or 3. class declarations that are too complex (max nesting depth more than three).
In TOSCA, node and relationship types and templates are analogous to abstract and concrete classes. 8Therefore, their size is a leading indicator for Blob Blueprints, and we consider the cumulative number of types and templates in condition (1).As for condition (2), it is accepted that the larger a module, the more difficult it is to comprehend.Indeed, on average, the number of conditions can likely increase proportionally to the module size.† † Hence, we consider the number of code lines as a simple measure of its size.Finally, condition (3) is computed in the respective paper using the maximum nesting depth (e.g., in an if ) for an abstraction.Unfortunately, this is impossible in TOSCA because of its declarative nature, and we had to define a different complexity measure.However, because of its novelty and difference compared with traditional programming languages, it is unclear what can be considered a complexity measure in TOSCA.
In general, a complexity measure tries to capture the difficulty in understanding a module (i.e., a blueprint in this case).Following the definition of Large Class above, we computed the number of interfaces and properties as analogous to the number of methods and attributes.In addition, in line with previous studies on the matter, 15 we further computed the well-known lack of cohesion of methods (LCOM) 28 since a higher value of LCOM indicates decreased encapsulation and increased complexity. 29,30In particular, the latter measures the number of connected components in a class, where a connected component is a set of related methods and class-level variables.First, related methods, which access the same class-level variables, are grouped.Then, LCOM equals the number of connected groups of methods.Ideally, there should be only one component in each class.Unfortunately, due to its different structure and characteristics, we cannot use the same metric for infrastructure code.Thus, our model defines a connected component as a set of related types or templates (rather than methods in traditional languages) and blueprint-level properties (rather than attributes in traditional languages).Then, similarly to Sharma et al., 15 we use the following algorithm to compute LCOM in a blueprint: 1. Consider each declared element, such as node and relationship templates, as a node in a graph.Initially, the graph contains the disconnected components (DC) equal to the number of elements.
2. Identify the parameters of the topology template and the used variables.We refer to these elements as data members.
3. For each data member, identify the components that use it and merge the identified components into a single component.

Compute the lack of cohesion as LCOM ¼ jDCj.
Listing 2 shows an example of this measure for TOSCA.The two topology nodes (Docker and marathon) access two disjoint groups of input each (i.e., mem_size and num_cpu for Docker, and rclone_user and rclone_password for marathon).Therefore, there are two connected components, each consisting of one node, hence LCOM = 2.
In addition to cohesion, we counted the NUMBER OF IMPORTS as a measure of the efferent coupling (a.k.a.fan-out) that defines the number of components on which a particular component depends.Components with a high efferent coupling value are sensitive to the changes introduced to their dependencies.Moreover, the deficiencies of their dependencies naturally manifest themselves in these components.
Please note that we first relied on our knowledge of "complexity" in application code to elicit a set of relevant metrics in TOSCA.Then, we performed preliminary non-structured interviews with the OASIS TOSCA Technical Committee to understand their view on the matter and L I S T I N G 2 An example of how the measure LCOM can be interpreted and computed in TOSCA.The two nodes in the topology (Docker and marathon) access two disjoint groups of input each: mem_size and num_cpu for Docker, and rclone_user and rclone_password for marathon.Therefore, LCOM = 2 evaluate the extent to which our definition of blobs and infrastructure complexity match.We conducted several rounds of interviews, where the experts were free to map IaC characteristics they perceived as factors of complexity to blob code smells.We repeated the process until two key complexity aspects emerged.Such factors still require dedicated attention even beyond the scope of this study: 1. Blob Blueprints reflect a lower bound for complexity.Blueprints that implement multiple modules, interface hooks, and dependencies tend to develop maintainability problems and vulnerability to infrastructure penetration or chaos. 31 Automated testing Blob IaC is nigh impossible.A considerable number of negative characteristics are sparsely related to automated testing (e.g., low code understandability and low code reuse); this warrants the necessity of defining operationally multiple and fine-grained complexity measures to be used in a combined complexity function for further automation.
While this study focuses on the first of the above points, we are releasing all materials and automation borne of this study to encourage further replication of our computational results and further research on the matter.Finally, we refined and implemented the considered metrics based on their input.

| STUDY METHODOLOGY
In this section, we present the methodology followed throughout the study, consisting of four phases: (i) Data collection, (ii) data preparation (exploratory analysis and data pre-processing), (iii) detectors building, and (iv) performance evaluation and comparison.The goal is to investigate how metric-and unsupervised learning-based detectors identify Blob Blueprints, to provide improved tooling to identify them in practice.The perspective is for researchers and practitioners.The former is interested in assessing, through in vitro experimentation, the effectiveness of metric-and unsupervised learning-based code smell detection applied to TOSCA.The latter is interested in evaluating how unsupervised learning-based smell detection works in practice.

| Data collection
TOSCA is a novel standard.To get a comprehensive set of blueprints, we mined GitHub to look for all repositories related to the search query tosca.The search returned 636 repositories that we analyzed to collect TOSCA blueprints.First, we discarded repositories with no releases because we are interested in blueprints considered functioning.Then, we collected all the files with the extension .toscafrom the last release of each project.However, TOSCA blueprints can also have a .ymlextension.Therefore, we searched for the presence of the keyword tosca_definitions_version for YAML files.‡ ‡ That keyword identifies the versioned set of normative TOSCA type definitions to validate those types defined in the TOSCA Simple Profile and is mandatory.§ § Please note that we discarded blueprints used for testing or examples, that is, those containing test or example in the file path, as they are not representative of production blueprints targeted by this study.This way, we collected 1036 blueprints from 42 repositories.

| Data preparation
The blueprints collected in the previous section were scanned to extract the metrics defined in Section 3 to create the dataset for experiments.
To this end, we implemented an open-source tool for TOSCA available on GitHub.¶ ¶ Please note that 287 blueprints were discarded at this point because of invalid YAML files (123) or duplicates (164) based on the extracted metrics, leading to a comprehensive data set of 749 distinct blueprints.
From that, we manually annotated a statistically relevant random set of 290 blueprints that we sampled to have an acceptable margin of error of 5% and a 95% confidence level to create the ground truth for the exploratory analysis described below and compare techniques.The annotation was performed before the subsequent analyses so that we could avoid additional bias; furthermore, the inspectors actively discussed their operations multiple times to convey a decision in order to reduce subjectivity.
The first and fourth authors scanned each resource and labeled it as smelly or sound based on their experience and understanding of the blueprint's semantics.The authors have at least four years of experience in code quality and IaC research.In addition, the fourth author is an ‡ ‡ This step was performed on June 9, 2021.§ § https://docs.oasis-open.org/tosca/TOSCA-Simple-Profile-YAML/v1.3/TOSCA-Simple-Profile-YAML-v1.3.html.¶ ¶ https://github.com/radon-h2020/radon-tosca-metrics.
active member of the OASIS TOSCA Technical Committee ## ; as such, he participates in monthly meetings where the Committee discusses with TOSCA practitioners about status and challenges of the language.
Each blueprint was subjectively analyzed considering the overall length, the number of nodes and relationships, and their scope based on type, description, properties, and interfaces; we defined these criteria based on the theoretical model in Section 3 before annotating.We also considered complexity in terms of difficulty in understanding the operations performed by the blueprint.A web application was developed and shared among the assessors to facilitate the annotation.kk Then, Cohen's Kappa 32 was measured to compute the degree of agreement between the assessors.Cohen's Kappa ranges between 0 and 1, with 0 indicating no agreement between the two raters and 1 indicating perfect agreement.In case of disagreements, the assessors met and discussed the disagreed blueprint to convey a decision.Following this procedure, we obtained a ground truth consisting of 248 sound and 42 smelly instances after reaching a complete agreement in the resolution phase from an initial Cohen's Kappa of 0.56 (i.e., moderate agreement).
Below, we describe the exploratory analysis performed on the ground truth.

| Exploratory analysis
We tested each metric separately using statistical analysis before employing them for predicting Blob Blueprints.For each metric, we measured whether the distribution of this metric within Blob Blueprints is statistically different from the distribution within all other blueprints.To this end, we applied the non-parametric Mann-Whitney U test 33 with a significance level α ¼ 0:01.
To better control for the randomness of our observations, we used Bonferroni's correction 34 to adjust the significance level according to the number of comparisons (i.e., five).Thus, the results are significant at the significance level α ¼ 0:002.p values below this show that the two groups differ for the considered metric.While we acknowledge that a metric distributed differently does not necessarily distinguish Blob and sound blueprints, these results hint at why a machine learning approach that combines these features can be successful.
Beyond the p-value interpretation, we calculated the effect size using Cliff's delta 35 to measure the magnitude of the difference between two populations and ranges from zero to one.For example, according to Kampenes et al., 36 a value below 0.147 is considered trivial; between 0.147 and 0.33, it is small; between 0.33 and 0.474, it is medium, and it is large above 0.474.In the following, we discuss how we preprocessed these metrics for experimentation.

| Preprocessing
First, we normalized data as a common requirement for many machine learning estimators.Typically this is done by removing the mean and scaling to unit variance.However, outliers can often influence the sample mean or variance negatively.Therefore, we resorted to the RobustScaler available in scikit-learn for the task.*** It scales data similarly to the min-max normalization but uses the interquartile range rather than the maxmin range to be robust to outliers.Therefore it follows the formula: As for feature selection, we meant the metrics for code smell detection.However, some metrics may correlate to others.The latter might be a problem in unsupervised learning, as the concept they represent gets more weight than other concepts.Thus, the final model might skew toward that particular concept, which might be undesirable.For that reason, we controlled for multicollinearity through the variable inflation factor (VIF), 37 discarding the features having a value larger than 10, a widely used rule-of-thumb. 38In addition, we used a stepwise forward selection to determine the optimal set of features to build the detectors described below.More specifically, all the metrics but those already selected are tested against the MCC at each step.The metric that significantly improves MCC the most is added to the set.

| Detectors building
Afterward, we used the preprocessed data and the selected features to build the metrics-and unsupervised learning-based smell detectors described below.

| Metrics-based detectors
A metric-based detector takes source code as the input, calculates a set of source code metrics that capture the characteristics of a given smell, and detects that smell by applying a suitable threshold. 5In most cases, setting the threshold values is a highly empirical process, and it is guided by similar past experiences and hints from the metrics' author. 39For example, as mentioned in Section 3, Sharma et al. 15 and Schwarz et al. 16 detect a similar smell called Insufficient Modularizatio if a configuration script contains more than 40 lines of code or an abstraction contains more than one class or define.Nevertheless, those thresholds do not hold to TOSCA because of the differences between these languages.Indeed, blueprints rarely contain a single type because of their nature, and types are usually small.In addition, no previous works on Blob Blueprint detection for TOSCA exist.Therefore, no hints are available from past experiences.
Statistical techniques can define a suitable threshold for each metric when no hints are available.In this case, we used the Interquartile Rule as in previous works 25 as a baseline: The formula defines threshold spotting blueprints representing upper outliers for a specific metric (i.e., smelly instances).It makes use of the third quartile (Q 3 ) and the interquartile range (IQRðxÞ ¼ Q 3 ðxÞ À Q 1 ðxÞ) extracted from the blueprints selected for this tuning.Typically, a threshold is calculated for each metric, and a rule is defined that combines them through logic operators.In this case we used the logic OR: an instance is detected as smelly if any metric exceeds the respective threshold.Conversely, the logic AND might be impractical, as the probability of detecting smelly instances drops when the number of metrics increases.For this reason, we relied on multivariate outliers to consider multiple metrics at once.The standard method for multivariate outlier detection uses the Mahalanobis distance, 40,41 a measure of the distance between a point p and a distribution D.More specifically, it measures the number of standard deviations the point p is from the mean of D. The distances are typically interpreted by comparing the corresponding χ 2 value (with the degrees of freedom equal to the number of variables) to a cut-off p value.Cases with p value <0.001 are likely to be considered outliers.

| Unsupervised learning-based detectors
The unsupervised learning detectors proposed in this study revolve around four popular clustering techniques available in the Python framework scikit-learn, 42 namely, KMeans, AgglomerativeClustering, MeanShift, and BIRCH.We relied on these techniques for their popularity † † † and because they are common in code smell detection for traditional source code. 43,44Furthermore, we used the implementations provided by scikit-learn to ensure easy operationalization for practitioners and replication for researchers.A detailed description of these techniques is available in the scikitlearn's official documentation.‡ ‡ ‡ Number of clusters.Most of the algorithms mentioned above require specifying the number of clusters in advance.Being that unavailable information, we resorted to the Silhouette coefficient 45 to validate the goodness of a clustering technique and select an appropriate number of clusters.It ranges between À1 and +1: a coefficient close to +1 indicates that the objects are well-matched to their cluster and poorly matched to neighboring clusters.A value close to À1 indicates too many or too few clusters, whereas a coefficient close to 0 indicates overlapping clusters.
Therefore, we performed a randomized search on different hyperparameters configurations for every clustering technique and retained the configuration that maximized the Silhouette coefficient across ten runs.
Cluster labeling.The clusters resulting from the previous step have unique identifiers from a technical standpoint; these identifiers are not labels that indicate the cluster's smelliness.However, Zhang et al. 46 proposed a heuristic to label clusters in the context of defect prediction that Xu et al. 47 adapted to support scenarios with more than two clusters.Figure 1 depicts the heuristic that we instantiated for smell detection by changing the labels as follows: 1. Sum up the feature values of blueprints in each cluster (SFB).
Table 2 shows the results of our experiments for each detector in terms of MCC, Precision, Recall, F1, and ARI (for cluster-based detectors).Similarly, Figure 3 depicts violin plots to compare detectors' performance in terms of MCC.The average MCC ranges from 0.5 for the worstperforming detector, AgglomerativeClustering, to approximately 0.8 for the best-performing detector, that is, the metric-based detector based on the interquartile rule.It is also the detector with the lowest standard deviation, alongside the other metric-based detector using the Mahalanobis distance.Thus, it yields the most stable results regardless of the metrics used; the minimum is below 0.73, the maximum is 0.85, and the standard deviation is 0.03.In addition, an unsupervised learning-based detector reaches the best precision, namely, the one using the AgglomerativeClustering algorithm.However, the recall has a high drop-off.As can be observed by analyzing the F1, which summarizes precision and recall, the two metric-based detectors perform the best, with an average of 0.6 and 0.8.The best unsupervised learning-based detector, MeanShift, achieves slightly similar results.However, its performance is significantly worse than the best metric-based detector (22% worse in MCC), with a large effect size.
Table 3 shows that the differences among the detectors are, in most cases, very high and of practical significance.For example, all the detectors have 22% to 35% lower MCC than the best metric-based detector with a large effect size.However, most detectors generally reach moderate to high MCC and F1.These results are encouraging since false positives and false negatives are minimal.Please note that false positives and negatives might significantly impact users in the context of code smell detection.For example, false positives can have negative consequences, as developers put less trust in the tool when falsely alarmed multiple times for smells that do not exist.If this happens too often, developers could stop using the smell detector.False negatives, on the other hand, are comparably harmful.As the smell detector helps the developer during Quality Assurance, they might become less vigilant during code reviews as they fully trust and rely upon the detector.The overall code quality can even decrease when it suffers from a high rate of false negatives.detectors, the same metric increases MCC only for MeanShift, while for the remaining, it decreases it.Precision tends to increase or stay constant.
The opposite applies to recall.
In general, looking at the figure is clear that all the considered metrics improve performance.Therefore, one should consider these metrics together while building detectors.Please note that we could not analyze all combinations for time's sake.Nevertheless, we observed several metrics that recur across the optimal subset of features, shown in Table 4.Among them, the number of interfaces-analogous to the number of methods in traditional programming-is the one that occurs the most.

Summary of RQ 3 :
The number of interfaces appears to be a leading metric to maximize the overall performance.Follows the number of types and templates, code lines, and LCOM.

| THREATS TO VALIDITY
This section describes the threats that can affect the validity of our study.

| Construct validity
Threats to construct validity concern the relation between the theory behind the executed methodology and the found observations by assessing whether the observed outcome corresponds to the effect we think we are measuring.
In this work, we used source code measurements, which may not appropriately represent the intended characteristic by the researchers.We mitigated this threat by using two approaches.First, we looked for measures that have been empirically validated multiple times for traditional code smell detection and that could be ported to TOSCA.Then, we consulted the official TOSCA documentation to identify possible measures related to the blueprint complexity and size source.Second, we implemented these measurements following a test-driven development approach 52 : The developer first creates unit tests for the intended functionality.Afterward, they implement the measurement and improve it until all the initial unit tests pass.
Another threat relates to the construction of the ground truth, done manually.As manual work, such as labeling, can be prone to human errors, we acknowledge this as a possible threat to construct validity.We tried to mitigate this threat by summarizing the most prevalent definitions found in the literature for the analyzed smell.Furthermore, the authors of this study performed the validation, which poses a threat to the construction validity due to the bias regarding the perception of what metrics or quality attributes characterize the smell; involving external experts would mitigate this bias.However, it is worth noting that the authors involved in the validation have multiple years of experience in code quality and IaC research; although the annotators were not external, they could still be considered experts enough for this task.Finally, the annotators performed the validation before the subsequent analyses to mitigate additional bias; they also actively discussed their operations multiple times to reduce subjectivity.
Lastly, we collected GitHub repositories automatically based on a search string.Therefore, we may have missed relevant repositories due to a conservative search string.First, the obtained data set is considerably small.Although it includes a large subset of publicly available TOSCA blueprints, the size of the data set can still negatively affect the modeling performance of the clustering algorithm used.
Another threat might be that our dataset does not correctly represent the population of TOSCA blueprints because we could select only those publicly available.For example, the blueprints used in industrial contexts cannot be shared due to company regulations, but demo blueprints can.In that situation, our dataset could not represent the actual population of TOSCA blueprints.
Lastly, we only used four clustering algorithms.However, such algorithms are among the most popular and intuitive, easing operationalization and interpretability for practitioners, although less traditional algorithms might be evaluated as well.For example, spectral clustering could be analyzed, given its ability to solve more complex situations, such as arbitrary nonlinear shapes, as it does not make assumptions about the shapes of the clusters.

| Conclusion validity
Threats to conclusion validity concern the appropriate usage of statistical tests and reliable measurement procedures, for example, to ensure the high quality of the conclusions.A possible threat to conclusion validity might be related to our work's implementation of the detectors used and the applied evaluation strategy.We followed the instructions given by previous authors 15,16 for building metrics-based detectors.As for the unsupervised learning detectors, we used the implementation provided by the Python framework scikit-learn. 42The metrics used to evaluate our clustering-based detection approach (i.e., Silhouette, Precision, Recall, and MCC) are widely used techniques for evaluating the performances of binary classification tasks.
Furthermore, the test used for the statistical analysis to estimate the difference between measures among detectors is a threat to the conclusion validity.There are many statistical tests whose choice relies upon the data structure, data distribution, and variable type, and the result can differ accordingly.To mitigate this threat, we applied a nonparametric test, which does not make assumptions about the data distribution, that is commonly used in literature.Besides, because we conducted multiple hypothesis tests at once, there is a chance that at least one of the tests produced a false positive.Therefore, we used Bonferroni's correction to adjust the significance level to control the probability of committing a type I error and mitigate this threat.

| DISCUSSIONS, IMPLICATIONS, AND LESSONS LEARNED
In RQ 1 and RQ 3 , we found that traditional source code metrics, such as the number of methods (interfaces in TOSCA), classes (types and templates in TOSCA), code lines, and lack of cohesion are good indicators of complex blueprints when mapped to their respective concepts in TOSCA.This result corroborates, on a technology-agnostic language, the findings of Sharma et al 15 and Schwarz et al. 16 Besides, RQ 2 shows that practitioners should prefer metrics-based detectors to unsupervised learning-based detectors.The latter helps overcome shortcomings such as determining threshold values required by the former and possibly reduce the effort of collecting and identifying smelly blueprints.We believe that, despite the performance observed in this study, unsupervised-learning-based detectors can still play a role in detecting Blob Blueprints and other smells in TOSCA.However, a broader range of TOSCA blueprints and metrics may be needed to enhance them.

| Implications for researchers and practitioners
The results above pose several implications for researchers and practitioners, described below.
• Implications for researchers: There is still room for research in this area, and we argue for more empirical work on configuration smells to broaden our knowledge of complex blueprints and enhance the catalog of code smells for IaC.Our findings put a baseline to investigate which metrics should be used to detect Blob Blueprints.However, further research is needed to understand the relationship between the smelliness of the TOSCA code and the collected metrics.These results can lead to a better understanding of which features to utilize to improve code smell detection in TOSCA and enable the comparison of competing approaches.
• Implication for practitioners: Practitioners can build upon our findings and shared material to implement novel methods and tools based on a small set of features such as those elicited in this paper.These tools will warn developers of complex blueprints and ultimately help reduce technical debt.For example, we already used those metrics as a proxy to predict failure-prone TOSCA blueprints within the scope of the European project called RADON, aimed at pursuing a broader adoption of serverless computing technologies within the European software industry.§ § § One of the RADON key pillars is quality assurance for TOSCA.The results from analyzing blueprints in one of our partner's use cases within the project show that they help distinguish blueprints that may induce technical debt.¶ ¶ ¶ According to them, when keeping IaC quality, these metrics could ensure avoiding complex code representations.

| Lesson learned
In addition to the implications above, we report some insights we observed while validating complex blueprints that we hope future researchers can benefit from to identify more fine-grained complexity measures for the Blob Blueprint smell.
• Refactor nodes based on target technology.During the validation, we observed several Blob Blueprints defining too many nodes, targeting different technologies.For example, a blueprint from Alien4Cloud ### contains 16 nodes, a subset of which targets three different technologies: Consul, Samba, and Elasticsearch. Figure 5 shows those nodes and their dependencies.It might be advisable to refactor those types in different blueprints to reduce the complexity, each grouping type targeting the same or similar technology.Then, one should import those blueprints into the one at hand.
• Refactor nodes based on types.TOSCA provides types to describe the possible building blocks for constructing a service template.For example, node types to describe kinds of nodes, relationship types to describe possible relations among those nodes, and policy types to logically group TOSCA nodes that have an implied relationship and need to be orchestrated or managed together to achieve some result.While it is possible and might be advisable that a blueprint define one or more components of each type, we noticed that having too many different types makes the blueprint more challenging to understand.In this case, we suggest refactoring them in separate files for each type or group.§ § § https://radon-h2020.eu/.¶ ¶ ¶ https://radon-h2020.eu/wp-content/uploads/2021/09/D6.5-Final-Assessment-Report.pdf(Section 4.3.3.4).
### https://raw.githubusercontent.com/alien4cloud/csar-public-library/d08f5ac3f3f5279ad65fdf8c025459fafac37e75/org/alien4cloud/alien4cloud/topologies/a4c_ha/type.yml.I G U R E 5 An excerpt of a Blob Blueprint with nodes targeting different technologies and a possible refactoring suggestion • Move workflows into separate files.Some of the analyzed blueprints had large workflows that contributed the most to increasing their size.
Practitioners use workflows to automatically deploy, manage runtime, or undeploy TOSCA topologies.We suggest moving those workflows into different files and importing them into the current blueprint.Please note that although we noticed that large workflows could decrease the readability of a blueprint, we did not implement a measure for its size.The reason is that, among the collected blueprints, we observed workflows only in the project Alien4Cloud.In a preliminary investigation, we observed that this metric is too noisy, weighting the prediction of smelly blueprints toward those blueprints only.

| CONCLUSION
In this work, we enhanced the current knowledge of current practices in IaC and the detection of configuration smells.As indicated by Rahman et al., 7 current scientific works insufficiently address the characteristics of best practices within IaC, and only a handful of previous works investigated configuration smells.We conducted a study on the official OASIS standard for IaC called TOSCA, for which we constructed a comprehensive dataset of publicly available blueprints, deduced the characteristics of current practices, and investigated the performance of metric-and unsupervised learning-based techniques for smell detection.The implementation is made available on Github, accompanied by an explanation for usage and research reproduction.kkk The main findings of this work are many-fold.First, we observed significant characteristical differences between smelly and sound blueprints based on their code structure for the current practices concerning TOSCA blueprint development.Our findings concerning configuration smells are also noteworthy.The range of researched configuration smells in previous work is relatively small because IaC is a new research area.However, we argue for more empirical work on configuration smells to broaden the smell catalog for IaC.Finally, other researchers can enhance this work based on the constructed dataset by applying more sophisticated techniques and analysis to investigate Blob Blueprints further and open opportunities for extensive studies on code smells in TOSCA.

RQ 1 RQ 2
candidate metric-and unsupervised learning-based techniques in detecting those blueprints.Motivated by this goal, this study aims at addressing the following research questions in the context of TOSCA: To what extent can structural code metrics distinguish between Blob and sound blueprints?To what extent can metric-and unsupervised learning-based techniques detect Blob Blueprint?RQ 3 What metrics are the most effective to maximize the performance of those detectors?

F I G U R E 3
Matthews correlation coefficient across metric-and unsupervised learning-based detectors.The detector built using the Interquartile rule performs statistically better than those relying on unsupervised learning.Legend: (rb) = rule-based; (ml) = machine learningbased T A B L E 2 Average performance statistics across 100 experiments

T A B L E 4 54 6. 3 |
Features that maximize MCC for each detector RB) = rule-based; (ML) = machine learning-based.6.2 | Internal validityThreats to internal validity concern the possibility that other factors could cause the outcome but were not measured during the research.A possible threat to internal validity is the source code measurement selection.It could be possible that other not included source code measurements could significantly influence the result.We mitigated this risk by selecting measurements that correlate with and can identify defective IaC scripts.53,Externalvalidity Threats to external validity relate to the generalizability of the obtained results outside the scope of the research.We observed various threats to external validity in this work.
Note:The best value among algorithms for each measure is reported in bold.(RB) = rule-based; (ML) machine learning-based.