Kulla‐RIV: A composing model with integrity verification for efficient and reliable data processing services

This article presents the design and implementation of a reliable computing virtual container‐based model with integrity verification for data processing strategies named the reliability and integrity verification (RIV) scheme. It has been integrated into a system construction model as well as existing workflow engines (e.g., Kulla and Makeflow) for composing in‐memory systems. In the RIV scheme, the reliability (R) component is in charge of providing an implicit fault tolerance mechanism for the processes of data acquisition and storage that take place in a data processing system. The integrity verification (IV) component is in charge of ensuring that data transmitted/received between two processing stages are correct and are not modified during the transmission process. To show the feasibility of using the RIV scheme, real‐world applications were created by using different distributed and parallel systems to solve use cases of satellite and medical imagery processing. This evaluation revealed encouraging results as some solutions that assumed the cost (overhead) of using the RIV scheme, for example, Kulla (the Kulla‐RIV solution), achieve better response times than others without the RIV scheme (e.g., Makeflow) that remain exposed to the risks caused by to the lack of RIV strategies.

result of one stage is the input data of a next one. 4,5Each processing stage (applications or services) can be implemented, deployed, and executed by one single organization, a set of partners, or even end-users by using software-oriented architectures or microservices architectures. 6,7In this type of scenario, the organizations not only develop their own required applications to process their data but they could also need to integrate, into their systems, external applications or services produced by any of partners, external developers or end-users, or cloud service providers. 8,9n practice, developers are exploring methods/techniques for the composition and integration of applications/systems based on loosely coupled architectures as a solution to easily maintain cloud-based systems by using design patterns, [10][11][12] that is, identifying themes and idioms that can be codified and reused to solve specific problems in software engineering.
When organizations outsource the data management and system deployment to a public cloud provider, the composition of systems should consider mechanisms to establish controls over the validation of the interconnection of processing applications, data preparation for withstanding failures as well as verification of data integration to avoid processing unexpected incoming data and producing inconsistent outcomes.
The software composition based on patterns establishes schemes to distribute workload and/or responsibilities of the applications of a system. 13,14Popular patterns to compose systems by interconnecting applications are pipeline, 15,16 client/server, 17 master/slave, 16,18 and manager/worker. 13,14Some data processing patterns are based on large-scale processing models, for instance, the map/reduce model. 19Figure 1 shows examples of processing patterns.
4][25] However, these solutions still face limitations in implementing management strategies that enable applications to exchange data with other systems. 26,27Moreover, troubleshooting IT processes, system portability, flexibility, efficiency, reliability, and integrity are challenging tasks in real-world scenarios that these frameworks solve only in partial manners, especially when organizations require to deploy systems on different infrastructures.For instance, when applications are migrated or installed over different types of cloud infrastructures (e.g., any private, public, or hybrid), portability is not commonly granted by developers, producing in some cases the vendor lock-in problem (i.e., a scenario where organizations or users produce, store and process large volumes of data in a single cloud provider 28 ).In this situation, issues related to failed installations, missed dependencies or misplaced environment settings must be solved by IT staff using debugging procedures, causing either downtime or disturbing business continuity. 29,30n alternative to solve/prevent this kind of problem is using virtualization technologies in the composition of systems. 23,31,32This composition approach integrates applications with dependencies and management software (operating systems, libraries, etc.) into virtual machines (VM) or containers (VC) to deploy applications on heterogeneous IT infrastructures without suffering errors caused by software dependencies that existed in the environment in which those applications were tested.The avoiding/reducing of troubleshooting processes after the integration of applications is not the only issue arising in the composition of applications for processing large volumes of data.Performance is another important issue to be solved especially when outcomes of processing data systems are needed for critical decision-making processes.
To improve performance in composite systems, composition engines mainly focus on implementing parallel programming models considering the available resources at developing time. 27,33In this context, there is still an opportunity window for improving resource profitability at execution time without affecting the feasibility and flexibility of the composite solutions/systems.When a composite system is in operation, additional challenges arise such as ensuring the continuous processing of data even in situations of data unavailability and ensuring that data is not wrongly altered when exchanged between applications.Some issues can arise when these challenges are ignored, such as delays in the delivery of outcomes in the composite system or the generation of altered/unreliable information for decision-makers.
Presently, there is a demand for methods that exhibit sufficient flexibility in composing distributed and/or parallel systems tailored for processing substantial volumes of data.These methods should take into account the secure interconnection of processing applications, preparation schemes, and software, with the aim of regaining control over data in outsourcing scenarios.This capability empowers organizations to proactively avoid or mitigate the adverse consequences associated with vendor lock-in.
This article presents the design and implementation of a fault-tolerant and integrity scheme (the RIV scheme) that can be integrated in virtual container-centric construction solutions (e.g., Kulla) or workflow engines (e.g., Makeflow) for composing systems to process large volumes of data, which includes a transparent and secure integration of data preparation tools and scaling strategies.The RIV scheme facilitates the creation of fault-tolerant composite systems (i.e., a set of interconnected applications, services or microservices), providing availability of involved applications and integrity in the data exchanged between them.
The main contributions of this article are as follows: • A novel integrity verification (RIV) scheme that can be integrated into system construction models (e.g., Kulla) as well as existing workflow engines (e.g., Makeflow) for creating parallel and distributed composite systems.
• Prototype implementation of a virtual container-centric construction method to compose parallel and distributed systems with fault tolerance and integrity verification (Kulla-RIV) for processing large volumes of data in real-world applications.
The rest of the article is organized as follows.Section 2 summarizes the fundamentals of the Kulla model. 8The design of the reliability and integrity verification scheme (RIV) is presented in Section 3. Prototype implementations of the RIV scheme in a real-world using the Kulla model (Kulla-RIV) and Makeflow, experiments and results are presented in Section 4. Section 5 presents matrix and Section 6 provides conclusion and future work.

THE KULLA MODEL
Kulla is a virtual container-centric construction model that mixes loosely coupled structures with a parallel programming model for building infrastructure agnostic distributed and parallel applications. 8The Kulla parallel programming model enables developers to couple interoperable structures, facilitating the creation of continuous dataflow, wherein parallel patterns can be created without modifying the code of applications.The following parallel programming patterns are allowed in Kulla: Divide&Conquer (D/C, data parallelism), Pipe&Blocks (streaming), and Manager/Block (M/B, task parallelism).Recursive combinations of Kulla instances can be grouped in deployment structures called Kulla-Boxes, which are encapsulated into virtual containers (VCs) to create infrastructure-agnostic parallel and/or distributed applications.The design principles of Kulla rely on the following construction structures: Kulla-Blocks (KBs): Abstract constructive structures to create single application images that take the data flow from input sources, passing through processing stages, to a sink or output stage.A Kulla-block includes Input/Output interfaces that are configurable based on user requirements.These interfaces can be set up as a file system (e.g., where the data source and data sink reside in a local file system or Docker volume) or as a network interface (e.g., establishing a connection between the data source and data sink through a Socket port or REST-API provided by a storage service located at a specific URL).For a more in-depth understanding of the Kulla-block input/output interfaces, please refer to the extended explanation in the Kulla model. 8ulla Bricks (KBrs): Abstract constructive structures used to create processing patterns by interconnecting KBs.The composition of sequential or parallel systems is possible by the use of KBrs.
Kulla-Boxes (KBoxes): Constructive and deployment structures used to encapsulate a KB or KBrs into a virtualized entity (e.g., VCs), theses KBoxes are used to compose infrastructure-agnostic solutions.One KBox could contain a sequential or parallel system within it.
Brick of Kulla-Boxes: Management structures formed of multiple KBoxes interconnected following some processing patterns(described as Bricks KBrs) to compose infrastructure-agnostic distributed solutions.
Kulla-Silo: Container image storage service created to concentrate all the Kulla-Boxes images in a single repository to simplify the management of Kulla solutions to developers and end-users.
The construction structures proposed by the Kulla model (Kulla-Blocks and Kulla-Bricks) can be encapsulated within a deployment structure (Kulla-Box) allowing it to be transported and executed in different IT infrastructures.The structures and parallel programming patterns allowed in Kulla are depicted in Figures 2 and 3.
Through the utilization of the Kulla model, developers have the flexibility to integrate non-functional requirements (NFR) as processing stages within a Kulla-Box or Brick of Kulla-Boxes solutions.However, when developers are tasked with integrating particular NFR, such as RIV, into the Kulla model, they need to adhere to a detailed process.This process involves implementing application code for each requirement, encapsulating it into a Kulla-Block, connecting this block with a solution through a Kulla-Brick, deploying this brick as a Kulla-Box, and incorporating it as a processing stage in a Brick of Kulla-Boxes solution.Executing these tasks necessitates prior knowledge of implementing NFR and encapsulating them using the Kulla model.
In the Kulla-RIV model, we leverage the foundational structures of the Kulla model, namely Kulla-Box and Kulla-Bricks, and introduce new components and elements into these structures.These enhancements empower the new elements to monitor the progress of a Kulla-Box and Bricks during deployment and execution.This ensures the transparent fulfillment of NFR such as RIV for designers, developers, and users.
The foundation of Kulla-RIV relies primarily on VCs, driven by the motivation to offer developers a seamless tool for interconnecting existing applications within a workflow.An alternative approach involves the use of a serverless workflow, harnessing a serverless computing model where developers create functions executed in response to events.Serverless platforms autonomously manage the underlying infrastructure, abstracting away servers and containers.Unlike the reuse of existing applications in our container-centric approach, developers in a serverless model need to design applications to align with its specific requirements.Our container-centric approach is designed to ensure the secure interconnection of processing applications, preparation schemes, and software.This strategic emphasis aims to reestablish

RELIABILITY AND INTEGRITY VERIFICATION (RIV) SCHEME
The use of composition models (e.g., Kulla 8 ), workflows engines (e.g., Makeflow, 24 Parsl, 33 or DagOn* 34 ), and CI/CD pipelines manager (e.g., Jenkins 35 ) for the composition of distributed solutions has allowed developers to produce large-scale applications across different IT infrastructures.These tools allow to chain and distribute applications in cloud environments and, in some cases, allow the use of parallelism techniques (task-based or data-based parallelism).Figure 4 depicts an example of a composite system that interconnects different applications (E x ) in different stages using processing patterns.These applications could be running in different cloud environments (e.g., E 2 C1 and E 3 C1 in private and public cloud respectively).The construction of composite systems, by means of composition models or workflow engines, is mainly focused on how to establish a sequence of the involved processes and, in some cases, how to improve the data processing efficiency (e.g., by cloning containerized application processes).However, these approaches do not focus on the NFR, such as data/process reliability (R) and integrity verification (IV).These requirements become necessary as the stages involved in the workflow (i.e., applications or services) exchange data (which could be sensitive) with each other.Since the stages of the system are chained, they also depend on the correct functioning of the others (i.e., their adjacent stages, upstream and downstream applications).
Currently available tools, such as Jenkins, Makeflow, Parsl, and Kulla assume that there are no failures during the data acquisition, and the execution of the processing stages, and that the data exchange among stages is correct.Assuming this behavior can be critical in a real scenario.Some tools allow to encapsulate execution errors and failures detected during the processing stages (e.g., Makeflow, Jenkins).However, in case of failure, the involved task or process is not executed again unless the developer or manager does it manually.In distributed solutions similar to Figure 4, where a containerized processing stage E i delivers data to an adjacent processing stage E i+1 , developers are responsible of establishing some IV method to ensure that the data received by E i+1 has not been modified, otherwise E i+1 assumes that the data from E i is correct.
In real scenarios where sensitive data is handled, such as space agencies and hospitals, developers must perform an IV process at least at the beginning and end of the workflow, and monitor the execution of the workflow processes to ensure that the data has passed through all the involved processing stages.Examples of this approach are the overlay model 36 and Gearbox. 9However, there is no data verification process between every stage to validate that data sent and received between adjacent stages are reliable.This situation motivated the design of a RIV scheme that can easily be integrated into construction models (e.g., Kulla) as well as existing engines for creating processing workflows (e.g., Makeflow).

F I G U R E 4
Example of a pipeline in a data processing system with sequential (E 1 ) and parallel (E 2 and E 3 ) stages.

Design of the RIV scheme
Figure 5 shows the proposed design of the RIV scheme, which is composed of four main components: reliability (R) module, IV module, middleware, and structure manager (SM).At the bottom of the design is a virtual container platform (Docker Swarm) operating on a heterogeneous infrastructure, providing the execution environment for the proposed scheme.The RIV architecture has been meticulously crafted with modularity as a guiding principle, facilitating adjustments to various layers.The decision to currently employ Docker Swarm is rooted in historical considerations, stemming from the simplicity it offers as our foundational container platform during the proof-of-concept phase for the RIV scheme.However, it is important to note that the RIV design is inherently flexible and is positioned to seamlessly integrate with different container orchestration platforms, with Kubernetes (K8s) being a notable example.The SM and the Middleware have been specifically designed to incorporate scripts and functions for deploying and executing containers across diverse platforms.Currently, our ongoing efforts are focused on integrating Kubernetes as an additional option for the container orchestration platform within the RIV schemes.

R module:
It is in charge of providing a fault tolerance service to the processes of data acquisition and storage in a workflow.Furthermore, it provides an execution verification process for applications involved in a composite system (i.e., a processing structure in a workflow created by Makeflow or a Brick of Kulla-Boxes created by the Kulla model).IV module: The IV of data occurs in this module, which ensures the correctness of data transmitted/received between two processing stages of a deployed solution (i.e., a composite system).Middleware: This module is responsible for the interaction between the R and IV modules in a combined (RIV) or independent way with the SM.The middleware has different responsibilities depending on the time (execution and deployment) in which it is used (for each stage in the life cycle of a data processing solution).During execution time, the middleware is in charge of monitoring the execution of a solution and provides a reactive manager that will influence the processing flow in the event of a failure.During deployment time, the middleware is responsible for the encapsulation of the SM, middleware, and RIV modules, invoking the container engine, and communicating with the orchestrator (a special adapter in the Kulla model).The middleware's roles can be summarized as follows: (a) Interaction with the directed acyclic graph (DAG) executor monitor to provide data and process resiliency.(b) Interaction with Kulla adapters (orchestrator, launcher, and engine) to create the Kulla image and monitor the process execution and status of the Kulla-Box.(c) Serving as a library for the IV scheme, enabling other engines to integrate it as an operation between each stage of the workflow.This integration can take place either as part of their components, as we have done for the Kulla model, or as an external application, as we have done for Makeflow.Structure manager (SM): It comprises scripts and functions executed both externally and internally to solutions.
External scripts handle encapsulation, transport, and execution phases, while internal functions become part of solutions during encapsulation and play a role in deployment and execution phases.Figure 6 illustrates the SM phases, showcasing the behavior of functions and scripts both inside and outside a solution, using various construction models or engines.These scripts and functions dictate the Middleware's behavior, facilitating the integration of the proposed RIV scheme into different construction models or workflow tools, for example, Kulla or Makeflow.To ensure seamless integration, SM must be described and included as an integral part of the composite system for every construction model or workflow tool used.Unless the proposed RIV schemes are generic, modifications to the RIV schemes may be necessary to ensure correct functionality for each considered engine or model.During the encapsulation phase (refer to Figure 6), SM should be part of the dependencies of applications, offering the necessary configuration, encapsulation, transport, deployment, and execution for a composite system.The primary phases included in SM are: Encapsulation: Virtual container images are generated, serving as the building blocks of the composite system.These images are crafted from Dockerfiles, which encompass all the essential libraries and dependencies necessary for assembling applications or services within the composite system.The Dockerfiles also incorporate the requisite commands for importing RIV components and functions in the case of Kulla-RIV solutions, or IV components for third-party engines like Makeflow.
Transport: A Kulla client process is tasked with deploying a container image onto the designated IT infrastructure.This image is retrieved from a Kulla repository or silo.To initiate the process, the Kulla client must be executed on the target computer, server, or cloud, facilitating the download of the image for subsequent execution.
Deployment: This operation is facilitated by the middleware layer, guided by a configuration file specifying the virtual container instances essential for the composite system.The configuration details, including I/O interfaces, network ports, and data volumes, are outlined in this file.
Execution: This operation is also supported by the middleware, which triggers the corresponding engine to initiate system execution.The middleware also supervises the execution of the RIV scheme for Kulla-RIV solutions or the required RIV components for third-party solutions, for example, the IV components for Makeflow solutions.

F I G U R E 7
Abstract representation of a composite system construction, applications (a) and (b), by using the Kulla Framework and the Makeflow engine that includes the proposed RIV scheme.
As a proof of concept, two SMs are considered: one for the Kulla model and the other for the Makeflow engine.For Kulla, the SM was implemented as a library that facilitates the interaction between the Kulla original processing structures and the proposed RIV scheme.These modules were designed to be seamlessly integrated into the Kulla model.For Makeflow, two applications were developed (load/retrieve) for it to be able to provide an integrity validation (IV) scheme.These two applications must be included in a Makeflow workflow as an intermediary process between each pair of processing stages included in the workflow.Due to Makeflow limitations for the monitoring of tasks during the workflow execution of the processing stages, it was not possible to also add the reliability (R) module scheme in the Makeflow engine (see Figure 7), reason why only the IV scheme was included in this workflow tool.It's worth noting that the SM module is designed to be open to the utilization of various engines in the construction of composite systems.

Reliability (R) module
This module is focused on providing data and process reliability (R) for parallel and distributed systems.

Data reliability
Data reliability is supported by creating multiple file or chunk replicas called clones/dispersals, which are distributed in different locations.These dispersals contain information about the cloning process or the dispersal technique.The original content can be obtained even when one of the clones/dispersals is not available.
The basic replication approach works as follows: content C of size L that should be transferred through a workflow is duplicated k times, and each copy (C 1 , … , C k ) is stored in different locations.An important issue of this approach is that the amount of data stored will grow by k * L, increasing the storage cost.8][39] This algorithm adds reliability features to the contents for the storage system to withstand service failures (data missing, data bit errors, unavailability of servers, etc.).The IDA technique splits each content C of length L into n pieces called dispersal files (dfs), each of length L dfs = L∕m.Where m represents the number of dfs sufficient for reconstructing C. As a result, this algorithm can withstand the unavailability of (n − m) dfs.The capacity used by this algorithm is Cap = n * (L∕m), producing an overhead of Ov = Cap − L.
In practice, the implementation of this algorithm represents a suitable cost-effective solution for the preservation of sensitive contents for organizations because of the trade-off between fault tolerance and storage consumption achieved by this technique.

F I G U R E 8
Example of a composite system deployed using the Kulla model.

Process reliability
The Kulla model proposes the use of a software tool that deploys one instance called Kulla-Box by using one configuration file.This file (known as DAG configuration file) describes all the processes interacting inside a Kulla-Box, forming a DAG structure.Inside the Kulla-Box, the Engine adapter supervises the execution of all involved applications following the extraction, transform, and load (ETL) model. 8The incoming data is received by the input interface of the Kulla structure (extraction), then all the processes described in the DAG configuration file are executed (transform) and the results are finally sent to the next Kulla structure by the output interface (load).Figure 8 shows an example of an agnostic distributed composite system built using the traditional Kulla model.It includes a set of three Kulla-Boxes that interact by implementing a Brick of Kulla-Boxes pattern, where the most left Kulla-Box executes a single Kulla-Block, the next one includes two Kulla-Blocks (i.e., a Kulla-Brick), and finally a single Kulla-Block is executed in the third Kulla-Box.The traditional Kulla model (i.e., not the Kulla-RIV version) is not resilient as it produces composite systems whose processing can be interrupted when an application within a Kulla-Box crashes and, as a consequence, the reception of data is stopped.In this scenario, system administrators can deploy multiple clones of a given Kulla-Box to reestablish system execution.However, this deployment process must be done manually by system administrators, which are also in charge of monitoring the log files of the Kulla-Boxes to detect failures and redirect the traffic to one of their available clones.To provide resilience to the composite system, the reliability (R) component of the Kulla-RIV model considers the automatic deployment of a set of c Kulla-Box clones.This process is described in Algorithm 1.

Algorithm 1. Deployment of the Kulla-Boxes replicas
Algorithm 1 shows how the Kulla deployment tool is executed r times, producing r replicas.The DAGConf file contains the configuration of the Kulla-Box to be replicated.It is important to note that when the number of clones increases the available infrastructure resources decrease.However, the use of lightweight VCs minimizes the impact of resource consumption and does not affect the active Kulla-Box (i.e., the running Kulla-Box that will receive the incoming data).

F I G U R E 9 HMAC generation and verification processes.
To determine the status of a Kulla-Box and test the execution of its Kulla-Blocks (i.e., applications or processes) defined in a DAG configuration file, a new Kulla component called DAG execution monitor was included.This component is in charge of producing a logfile with the status (pending, in process, completed, and failed) of all applications included in the Kulla-Box.The default status of an application is pending which means that the application is waiting for execution.When an application is run, its status changes to in process.If the running application finishes without errors, its status changes to completed, otherwise it changes to failed.
The execution of an application inside a Kulla-Box is described in Algorithm 2. The DAG execution monitor deploys all applications (included in a Kulla-Box, identified by KBPosition) defined in the configuration file (DAGConf).The running Kulla-Box reads/receives the incoming data through the Access Layer then the Kulla pipeline executor process runs the applications in sequential order.If one application fails, the executor tries i times to get a completed status.Only when all the applications finish with a completed status the Kulla-Box is tagged as available, otherwise, it changes to failed and starts a failure detection protocol.
The failure detection protocol indicates to the access layer that an application is in failed status and is not responding and, as a consequence, the Kulla-Box is in failed status too.The access layer of the failed Kulla-Box sends a message to the previous Kulla-Box (or to the data source, if the Kulla-Box is in the first processing stage) reporting that it is in failed status and recommends to redirect its traffic to one of its clones.

Integrity verification (IV) module
To verify the origin of the data as well as its integrity, hash-based message authentication codes (HMAC) and the elliptic curve Diffie-Hellman (ECDH) protocol were used.HMAC is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. 40ECDH is a key agreement protocol based on elliptic curve cryptography (ECC).The ECDH protocol allows two parties, each having an elliptic-curve public-private value, to establish a shared secret value over an insecure channel.This secret value can be used as a key to encrypt subsequent communications using a symmetric key cipher. 41Our proposed IV scheme uses this secret value as the key required in HMAC.
Figure 9 shows the HMAC generation and verification processes carried out by two processing stages E 1 and E 2 respectively.
The parameters required for generating the HMAC are described as follows: • secret key (K): It is the symmetric private key that is shared between two processing stages.The shared key was generated from the ECDH protocol.
• ipad or inner padding: it is the block-sized inner padding, consisting of repeated bytes valued 0x36.
• opad or outer padding: it is the block-sized outer padding, consisting of repeated bytes valued 0x5c.
• H or hashing function: it is the hash function used in the HMAC process.In the implementation of the prototype, we design a Kulla block called Hash KB that is used to compute the hash codes.The HMAC is constructed by hashing(H) the concatenation (||) of: • the XOR (⊕) of the secret key K with the outer padding opad (K ⊕ opad) and • the hashing of XOR (⊕) of the secret key K with the inner padding ipad concatenated with the message m (H(K ⊕ ipad||m)).
The following equation shows the HMAC calculation process.
For the Kulla model, the RIV schemes were included in the access layer, for Makeflow the IV scheme is divided into two applications that were incorporated as intermediaries into each pair of processing stages of the workflow.The development of an IV process requires several configuration parameters as input.For example, the module for establishing the shared secret by using the ECDH protocol requires the selection of the type and size of the required elliptic curve.It is necessary to select a hash algorithm for using the hashed message authentication codes (HMAC).When selecting these security parameters, it is important considering that there are some elliptic curve sizes or hash functions that could be discontinued by NIST. 42For our proposed IV module the B-283, B-409, and B-571 elliptic curves were used, jointly with the SHA3-256, SHA3-384, and SHA3-512 hash functions.These functions are included in the OpenSSL Library. 43isk matrix: To reduce the number of possible configurations that can be used in our data IV process, a group of parameters with similar security levels was made.A risk matrix was defined grouping these configuration parameters into three security levels (low, medium, and high) as shown in Table 1.At a high-security level, the number of scalar operations increases, which increases the time required for the integrity model.Hash algorithms require more processing time by increasing the size of the input data.In general, by selecting a higher security level, the response time of a complex composite system will increase due to the operations performed by the RIV scheme.
Kulla-RIV includes a configuration file that allows developers to change the security level at execution time.This configuration file contains the security level, the curve type (B for pseudo-random curve), and the hash function (SHA3-256, SHA3-384, and SHA3-512).The middleware layer is responsible for reading the configuration file at execution time.
It is important to note that the inclusion of the RIV scheme in a composite system using Kulla-RIV is optional, which means that it could be ignored by developers, especially in cases when composite systems are created in a secure and reliable infrastructure (e.g., an isolated cluster or a private cloud).
Threat model: The Kulla-RIV model considers two possible attacks: 1. Attacks initiated by malicious insider Kulla components (Brick or Block) being part of a composite system that intentionally or unintentionally alter of modify the content of the input data, that is, it fabricates the data to produce a desired output.2. Attacks initiated by outsider attackers, which are unauthorized applications that could gain access to the continuous processing flow of a composite system and intentionally alter or modify the content of the input data of a Kulla component (Brick or Block).
The IV module in the Kulla-RIV model prevents these attacks over the data exchanged by the involved applications in a composite system by ensuring the data integrity security service over the input data to a Kulla-Brick or Kulla-Block component as well as ensuring the authenticity of the Kulla component data originator.In the Kulla-RIV model, each Kulla component has a single and unique symmetric cryptographic key that is used to create authentication codes over the data that is sent to the destination Kulla component, which must perform integrity checks.The authentication codes are generated from a secure implementation of HMAC, which is a well-known and widely used secure algorithm for authentication and integrity.The security relies in the strength of the symmetric key used by both the originator and destination Kulla component.The symmetric key is established between originator and destination using a secure implementation of the ECDH key exchange algorithm.From only public parameters shared by all applications in the composite system, a unique key is established by the Kulla components directly connected in the Kulla-RIV model.Thus, the integrity and authentication code generated by the i component in the continuous processing flow will be verified by the i + 1 component.It is assumed that an insider/outsider application module, not being considered as part of the composite system at deployment time, will be unable to execute ECDH and thus, to securely communicate to any genuine Kulla component already deployed.Consider the insider attack scenario, where it is assumed that all deployed Kulla components are trusted.In this context, each component is expected to adhere to the protocol accurately, refraining from discarding or fabricating input data.This adherence is verifiable and can be ensured during implementation and deployment.In the case of an outsider attacker, despite gaining the ability to inject input data, they will be unable to establish a valid key for HMAC.Without a valid authentication code, any data received within a Kulla-Brick or Kulla-Block will be detected and subsequently discarded.
Kulla-RIV technical details: Virtual container images and instances are created and launched by using the Docker platform.The Kulla-RIV control adapters such as divide, containerize, conquer, manager, launcher, orchestrator, worker and so forth were written in C programming language.Versions of some of them (e.g., manager and workers) are available in Java, Python, and C++.Scale and deployment approaches were developed by using C scripting and structured data (i.e., Json, Dockerfiles, and Docker compose files).The SHA3-224, SHA3-256, and SHA3-384 hash functions are based on the public Github project called C implementation of SHA-3 and the Keccak algorithm with Init/Update/Finalize API. 44he ECDH protocol implementation is based on the public Github project called Tiny ECDH / ECC in C. 45 The HMAC SHA-224, SHA-256, and SHA-384 functions are based on the OpenSSL library.43

EXPERIMENTAL EVALUATION AND RESULTS
As a proof of concept of the proposed RIV scheme, two composite systems were deployed, one using the Kulla-RIV model and another the Makeflow engine (with the IV scheme).For the Kulla-RIV model, a composite system was built by using the updated version of the Kulla-Boxes (that include the RIV components).For the Makeflow tool, a composite system (workflow) was created integrating two applications (producer and receiver) that provide the IV scheme to the Makeflow solution.Makeflow was considered a good representative workflow tool to be used for comparison purposes, as it allows the execution of large complex workflows on clusters, clouds, and grids.
In the first experiment, the functions required by the ECDH protocol in the RIV scheme were evaluated.Several security configurations in a two-stage workflow generated using the Kulla pipes and blocks pattern were tested.In a second experiment, a case study was conducted to evaluate the feasibility of using the Kulla-RIV model to deploy agnostic and parallel solutions for data processing considering NFR, such as reliability and integrity verification (the RIV aspects).A similar solution was implemented with Makeflow considering only the IV scheme.The points evaluated in the case study were: • IV between adjacent stages in a composite system.
• Performance of the RIV schemes.
-When using the Pipes&Blocks pattern.
-When deploying a scale in/out scenario.
A client process was developed to execute a workload producer bot for sending requests to the composite systems.The client also includes an adapter for capturing a set of metrics (see Section 4.1) to analyze the composite system performance.

Metrics
The following metrics were extracted from the experimental evaluation: • Service time (ST): represents the elapsed time in which content is processed by each application encapsulated into an instance.
• Response time (RT): represents the spent time by a composite system to successfully dispatch requests sent by the client bot.This is the sum of the ST produced by each application instance considered in a solution plus the time spent to retrieve/deliver data from/to a data Source/Sink.
• Percentage of performance gain: represents the performance increase in percentage when comparing response times produced by composite systems with and without the RIV scheme.

Infrastructure
Table 2 describes the infrastructure for each deployment scenario of evaluation.Two data processing scenarios were considered: parallel data processing inside an instance (i.e., parallel or scale-in scenario) and distributed data processing (i.e., scale-out scenario).Hybrid processing strategies that combine scale-in and scale-out scenarios were also considered.To validate the R and IV schemes, we run different tests.In the first test, the verification process was included in a composite system that is deployed on a single server to measure the service time of the implementation of the ECDH protocol in each stage.In a second test, a composite distributed system built by using parallel patterns was used to deploy solutions on a cluster of VCs on multiple servers, adding the cost of transportation to the complete verification process.

Performance analysis of the ECDH and HMAC implementations
The core components of the proposed IV scheme are based on the ECDH protocol and HMAC.This section presents a performance analysis of different implementations of these core components.

Test scenario
A simple composite system was built using the new version of the Kulla-Box structure that includes the proposed RIV module.This composite system was created and deployed in a single server (Server16, cf.Table 2) by using the scale-in processing strategy.Each experiment was performed 31 times in which the median of every metric was captured.The ECDH functions analyzed in the implementation of the new Kulla-Box were: • Key generation: Generates the public and private keys for the entity.
• Get public value: Obtains the value of the public key in the format required for transportation.
• Transform public value: Transform the received public key (produced for the Get public value process) to a compatible value for the compute secret function.
• Compute secret: Computes the common secret elliptic curve point (x k , y k ) of the two entities (for this example, the entities were called E 1 and

Results
This section presents the results obtained after executing the experimental scenario described in Section 4.3.1.Figure 10 shows the breakdown of the median of response times (in milliseconds) produced by all the functions of ECDH included in the IV process evaluated for eight elliptic curves of different sizes.
As it was expected, when the size of the selected elliptic curve increases, the response time to compute the main functions of ECDH increases too.Furthermore, it can be seen that the most time-consuming processes are key generation and compute secret because both of them include scalar multiplications.The time consumed by the other ECDH processes is imperceptible.To visualize the behavior of the response time of a solution changing the elliptic curve size or the hash algorithm, we evaluate different configurations for ECDH and HMAC separately.An outline of the system performance is shown in Figure 11, revealing an increasing service time associated with the security level applied in the IV elliptic curve parameters, that is, the three elliptic curves, B-283, B-409, and B-571, considered in the risk matrix (cf.Table 1).Different workloads were used in this test, using file sizes ranging from 1 to 1000 MiB, in a system running on Server12, described in Table 2.
Figure 11 shows the median of the service time (in seconds) obtained by executing 31 times the ECDH process to compute a shared secret (without measuring the HMAC process) varying the elliptic curve and file size.This time is an overhead that must be added to the service time required for the processing application.We can see that the impact these processes (in the ECDH protocol) have on service time, using different elliptic curves and varying the file sizes is almost null.In this experiment, the maximum overhead generated by them is 1.5 s.
The median of the service time required for the hash algorithms used to carry out the HMAC (without including the ECDH process) is shown in Figure 12.In this evaluation, the same workloads with file sizes ranging from 1 to 1000 MiB were used.This experiment was also carried out on Server12.
Figure 12 shows how a more secure algorithm produces a higher service time and the size of the content has a notorious impact on performance.This impact can be observed using any of the hashing algorithms.For instance, using the SHA3-256 algorithm with files of 1 and 1000 MiB produces a response time of 0.035861 and 27.41454 s respectively.This time is considered as overhead and must be added to the service time of the processing application.
The overhead generated by the ECDH functions and hash operations must be added to the service time between each pair of processing stages of a composite system, making the security level selection process a critical decision when designing systems with an IV scheme.The time overhead can be reduced by decreasing the size of the elliptic curve

F I G U R E 13
Example of a composite distributed system deployed in Kulla-Boxes that implement the RIV scheme.
at the price of reducing the level of security in the solution.For instance, considering the values shown in Figures 11  and 12, the time overhead produced by the B-283 elliptic curve plus the one produced by the SHA3-256 hash function will be much lower than the combination of B-571 plus SHA3-512, which means a better response time in the composite system.

Evaluation of a pipeline processing workflow
This section presents an evaluation of a pipeline processing workflow that includes IV, implementing the ECDH process and HMAC verification, which are part of the Kulla-RIV composition model.

Test scenario
A distributed composite system is deployed by means of two Kulla-Boxes (E1 and E2) as shown in Figure 13.The first Kulla-Box (E1) encapsulates a Kulla-Brick that includes the LZ4 and AES Kulla-Blocks with LZ4 compressing and AES encrypting applications respectively.These Kulla-Blocks were built by using Pipes&Blocks method taken from the Kulla model.The first Kulla-Box E1 reads content from the data source (DSr), processes the content through the LZ4 and AES Kulla-Blocks, and sends the resulting content to the next Kulla-Box (E2).The second Kulla Box (E2) receives the processed content, executes IV, and if verification successes, registration is made by a Log application, and the final content is stored in the data sink (DSk).
The applications, included in the two Kulla blocks embedded in the Kulla brick shown in Figure 13, are described next: • LZ4: Implements LZ4, which is a lossless data compression algorithm that is focused on compression and decompression speed.This implementation is based on the public LZ4 libraries available in Github. 46In this stage, the Kulla-Block reads a given file as input and produces a compressed/decompressed version of the original file as output.The output is sent forward to the next stage in the pipeline.
• AES: Implements the traditional AES algorithm, 47 to encrypt/decrypt incoming content |C|.In this evaluation, it was used the AES version developed by OpenSSL, using a 128 bits length key.
The second Kulla-Box only includes a Kulla-Block called Log that is in charge of registering all the received contents in a log file.This Kulla-Box receives the incoming data through the network interface and computes the verification process (only when the IV scheme is activated), adds a registry in the log file, and stores the results using the file system interface.
This experiment applies the same workload used in previous ones with file sizes ranging from 1 to 1000MiB and the following configurations: 1.No verification: The processed content produced by the first Kulla box (E1 in Figure 13, which is a stage deployed in one node of the cluster described in Table 2) is sent and received without any verification process (i.e., IV scheme) in the next Kulla box (E2, deployed in a different node of the same cluster).2. Verification: In this case, the IV scheme is activated between E1 and E2.The verification process follows the three security levels presented in Table 1.
• Low: It implements the ECDH protocol by using the B-283 curve size, and applies the HMAC SHA3-256 function for the message IV.
• Medium: The ECDH protocol is implemented by using the B-409 curve size, applying the HMAC SHA3-384 function for the message IV.
• High: It uses the B-571 curve size and applies the HMAC SHA3-512 function for the message IV.
For all processing stages, the HMAC verification confirms if the HMAC received from the first Kulla Box (E1) is the same that the HMAC computed for the second Kulla box (E2).The HMAC will not be equal if the ECDH security levels of the processing stages are different, for example, the first Kulla box implements low security and the next Kulla box requires medium or high security level.

Results
This section presents a performance analysis from the experiments carried out considering the test scenario described in Section 4.4.1.The test was executed 31 times and the points are the median values.Figure 14 compares the response times produced by the composite system with the IV scheme activated (evaluating the defined three security levels varying the workloads) and without verification.It can be seen that the security levels have different performance costs.When a verification process is added (with any security level), the processing time increases compared with the no-verification configuration.A higher security level corresponds with a higher response time of the composite system because of the overhead produced by the IV scheme.In the high and medium security levels for the smallest evaluated file (1 MB), the response time increases about 12 times and 8 times respectively.This behavior persists for the other file sizes but not on the same scale.The impact of the IV scheme is lesser when the processed content size increases.For example, using the 1000 MB file size, the response time increases 1.07 and 0.75 times for high and medium security levels respectively.Using 1 MB files the processing overhead is notorious as the difference between the service time produced by the processing applications (LZ4 and AES) and the verification process (i.e., hashing, ECDH, transportation of keys and HMAC) is higher.It is worth noting that the executed application (LZ4 and AES) are faster when the file size is small.

F I G U R E 15
Scale-out/in solution represented as a directed graph.

Case study: Processing medical images with RIV schemes using the scale in/out scenarios
This section shows the results of using the RIV schemes in composite systems deployed using both the Kulla model (Kulla-RIV) and Makeflow tools.For this test scenario, a case study on processing medical images was carried out.

Test scenario
A more complex composite system was deployed using new processing patterns.The new composite system is an extension of the one presented in Figure 13.A directed graph representation of this new solution is shown in Figure 15 wherein new processing patterns were added to the original composite system.The new composite system is deployed implementing a Manager/Block processing pattern (M∕B, cf.Section 2).The composite system represented in Figure 15 has one data source (DSr), three Kulla-Boxes (M, LZ4 1 and LZ4 w ) and two data sinks (Cloud1 and Cloudw).In this example, every Kulla-Box includes only one Kulla-Brick.The Kulla-Box M represents the Master and the Kulla-Boxes LZ4 1 and LZ4 w represent the blocks or workers in the M∕B processing pattern.The number of workers and data sinks could be more than two (LZ4 1 … LZ4 w and Cloud1 … Cloudw respectively), it depends on the size of the problem or task to solve.In this composite system, the original LZ4 Kulla-Block shown in Figure 13 was converted into a new Kulla-Brick that includes a Divide&Conquer (D&C) processing pattern.This new Kulla-Brick is encapsulated (i.e., containerized) into a new Kulla-Box.In the D&C pattern, a software instance called Divide (D) segments the input data into s segments that are sent to s number of workers (IDA 1 … IDA s in Figure 15) that were cloned and launched at setup time.Conquer (C) is another software instance that receives the results produced by the workers and consolidates them into a single one, sending this new result to the next stage in the schema or to a data sink (Cloud x in this example).
In this new composite system, the scale-in scenario is presented inside of the Kulla-Bricks when multiple blocks or workers are launched in parallel, running an IDA (IDA 1 … IDA s ).The scale-out scenario is presented when the Kulla-Boxes (LZ4 1 … LZ4 w ) are instantiated in a distributed (scale-out) environment.
The master (M) adapter of the Kulla-Brick reads files from the data source (DSr) and sends them one by one to each worker or block (LZ4 1 … LZ4 w ).Each worker compresses its input file and sends the compressed file to the next stage of the process.The compressed file is received by the IDA processing instance that will be executed following a D&C processing pattern.In the divide (D) phase, the compressed file (F) is split into s segments {F 1 , F 2 , … , F s }, which are sent to s workers (IDA 1 … IDA s ) that add redundancy to its respective segment F x .Every file segment with redundancy is called dispersal (Df 1 , Df 2, … , Dfs).In this IDA configuration, the number of segments is 5 (s = 5) and the number of dispersals sufficient for reconstructing the original file (F) is 3 (see IDA description in Section 3.2.1).
In this composite system, the first Kulla-Box (M) starts the M∕B pattern, where the manager launches workers that execute a Pipe&Block Kulla-Brick.The Kulla Bricks included in the Kulla Boxes (i.e., Docker containers for this implementation), at the second level of processing in the composite system, contain the application that implements the LZ4 compression algorithm and a launcher of an image (Docker image) of Kulla-Box that contains the implementation of a D&C Kulla-Brick and the RIV schemes.The Docker image of this Kulla-Box was used to deploy two and more Kulla-Boxes on different servers (scale-out) as shown in Figure 15.Each Kulla-Brick that implements the D&C pattern executes a master process (D∕C x ) and as many workers (IDA x ) as cores available in the servers.
The exchange of data between the processes that implement the M/B and D&C patterns in the Kulla-Boxes was performed through the network I/O interface (sockets).The implicit execution of the ECDH and HMAC processes is considered part of the data reliability and integrity verifications (RIV schemes).For simplification, these verification processes do not appear in Figure 15 as they were shown in Figure 13.It is worth noting that the configuration that determines the deployment of a composite system using the Kulla model can vary according to the needs of the users, for instance, some parameters to vary are the number of servers wherein the composite system would be deployed and the number of cores that would be used in a server.
This case study considers a fault-tolerant mechanism in which each Kulla-Box is cloned three times.One of these clones is the active running process and the others remain in an idle status ready for activation if the active process fails.Idle clones will try to connect to the active process three times, if the active process does not attend a petition in 5 s, the traffic is redirected to another clone.
The performance of this composite system deployed with the Kulla-RIV model was compared with the performance of a similar deployment using the Makeflow tool.Makeflow was configured to execute a pipeline for the encoding medical images stored in a server by using all its cores and then sending the encoded segments to the rest of the servers (three servers).
The experiment was carried out by using a computed tomography (CT) imaging repository where solutions processed 55 images with quality (512 × 512), of 512 MB size each.This repository was processed by both deployments, one using the Kulla-RIV proposal and another using the Makeflow engine.

Results
Figure 16 shows the response times obtained by the composite system using the Kulla-RIV and Makeflow deployments.
In both deployments, 12 workers were executed for data processing.In the Kulla-RIV deployment 6 workers were part of the execution of the M/B pattern and the other 6 were part of the D&C processing pattern, that is, a K-M/B(6w)K-D/C(6w) configuration.For the Makeflow distribution limitation, only one server was used for processing and the rest was used for storage.In the Makeflow deployment, the main server executed the data processing stages of the workflow, using all server cores.The resulting processed data were shared with the other three servers using the IV scheme.These three servers had the role of storage servers or data sinks (DSk) in the composite system.The Kulla-RIV solution processed the image repository (27.5 GB) in 7.18 min, while Makeflow took 11.5 min, approximately at a rate of 2.5 GB/min, during the execution of their pipelines.Table 3 presents various statistical metrics (average, median, and 90th percentile values) to depict these results, showcasing their consistency.Each bar in Figure 16 illustrates  the response time divided into two segments.The larger segment represents the time generated by the execution of the applications comprising the workflow, while the shorter segment reflects the overhead produced by the processes enabling the RIV or IV schemes in Kulla and Makeflow respectively.This presentation allows users to visually assess the feasibility of the cost incurred to mitigate the risks associated with the absence of RIV strategies in a workflow.The Kulla-RIV fault tolerance approach of the Kulla-RIV solution was tested using a bot application that was in charge of simulating crashes in some components of the system, by randomly killing some Kulla-Boxes.In this case, the composite system was able of being online all the time as requests sent to downed Kulla-Boxes were redirected to its clones, keeping data processing without interruption.Our proposed reliability (R) scheme was not incorporated into the Makeflow solution because when a process fails the Makeflow engine kills that process, restarts it, and continues processing another content.When Makeflow finishes the complete task, the developer must check the log file to verify that all the files were processed, and if a failure was detected the developer must manually restart the failing process and run the unprocessed files.In Kulla-RIV, if all replicas are down the dataflow stops, and it is reestablished until one of the clones is detected online again.
The results obtained using the Kulla-RIV model are encouraging as it allows the use of different processing patterns in a composite system that can improve its overall performance besides providing RIV.

RELATED WORK
Data processing patterns play an important role when carrying out big data processing tasks.Some data processing patterns used in state-of-the-art literature are: pipelines, 15 microservices, 48 building blocks, 49 and pipelines for the cloud. 50In addition to the processing patterns, some tools allow the generation of workflows through different infrastructures such as clusters, grids, clouds and so forth.Some of the relevant tools for managing workflows are: Askalon, 51 Hyperflow, 52 Moteur, 53 Pegasus 54 Swift, 25 Taverna, 55 Makeflow, 24 Triana, 22 Comps, 27 PyComps, 56 Parsl, 33 DagOn*. 34hese tools allow the creation of workflows based on different types of architectures and data distribution.Table 4 presents various data processing architectures, detailing their distribution type and whether they incorporate NFR.A solution is considered centralized when all its elements are executed on a single computer, constraining its performance to the capabilities of that singular infrastructure.On the other hand, a solution is considered distributed over the network when its stages can be deployed across different nodes within the same network.Furthermore, it can be distributed across multiple nodes, whether they are physical or virtual, within the same network or different networks.
In the case of microservices, they allow the deployment of the processing stages in different computers that communicate through the network using an API (for example REST), making them easy to distribute.However, it is necessary to generate additional tools for the distribution of the workload and the deployment of microservices.The microservices are portable but do not consider NFR as part of its architecture.
Workflows can be centralized or distributed, however, the type of distribution requires that the infrastructure have third-party applications to achieve the distribution of tasks.Usually, the deployment of applications requires installing third-party dependencies on the target infrastructure, for instance in the public cloud, potentially affecting the portability of the applications.While some workflow engines enable the containerization of their applications for enhanced portability, it's essential to note that these workflows must include NFR as processing stages since they are not inherently considered part of the workflow behavior.For instance, in the evaluation of this work, we incorporated the IV scheme into Makeflow to facilitate a performance comparison with the proposed Kulla-RIV composition model.Some existing solutions address issues identified in their respective contexts by implementing reliability (R) and integrity (I) schemes.Cao et al. 58 present a protocol for reliability processes, encapsulating operations into virtual machines (VMs) that detect failures and deploy new VM instances of the process upon detecting failure in one or multiple stages.Similar to the proposed Kulla-RIV composition method, applications in this protocol are encapsulated into virtual instances deployable across different infrastructures to prevent all instances from failing on the same server.However, the Kulla-RIV solution distinguishes itself by encapsulating applications into VCs, offering improved response times during both deployment and execution.IntegrityCatalog 59 offers a solution for the management and IV of metadata content, capable of detecting inconsistencies and retrieving the original content of metadata.This proposal primarily focuses on managing IV for stored data.In contrast, the Kulla-RIV solution concentrates on real-time IV checks for data shared between instances located in different infrastructures.In cases of inconsistency detection, Kulla-RIV instances reject data transportation until the data passes the IV check or exceeds the allowed attempts.
End-to-end solutions execute their processes at both ends of a solution (e.g., a producer node and a consumer node), limiting its performance to the computer equipment used by these entities and ignoring the risk and performance of components or nodes that are in the middle of them.These solutions did not consider NFR.The Kulla model requires developers to implement NFR as stages in their solution, necessitating time and experience in using both NFR and Kulla.In our proposed Kulla-RIV composition model, we integrate NFR as an integral part of the architecture.Consequently, developers have the flexibility to choose whether the R or IV schemes (both or either) can be deployed with their solution to meet NFR requirements.
Next, we introduce works that implement the data processing architectures mentioned in Table 4.
Pipeline processing: Brinkmann et al. 15 presented results comparing the performance of Unix pipes versus their own pipeline implementation (in C++ language).The objective was to show that it is possible to create a new pipeline implementation that improves the traditional Unix pipes.Different pipelines were evaluated by executing the following processes: Hash + Compress + Encrypt.The set of algorithms (Blake2sp, LZ4, and AES128 respectively) that showed the best performance in terms of files per second processed, with file sizes in the order of megabytes, were selected.These algorithms were tested varying the number of pipelines.The results showed that when the number of pipelines increases the performance does, in terms of MB/s of data processed.The tests were executed on a single server, which limits scalability.Yanez et al. 60 implemented filters as processing units in end-to-end applications, proposing an architecture that allows the generation of pipelines for encryption, access control, generation of digital signatures, and object packaging.The security services are implemented in processing units as part of a pipeline, they communicate through the I/O interfaces such as network, memory, and file system.Unlike the work made by Brinkmann et al., 15 this proposal allows the distribution of the processing filters in different infrastructures.However, the processing units are only deployed on the computers that run both ends (producer and consumer), limiting the possibility of exploiting processing patterns in the rest of the infrastructure.
SkyCDS 61 is a content delivery service based on components and subcomponents for the secure publication and subscription of data in a diversified way in the cloud.SkyCDS elements ensure different behaviors and workflow from the configuration of its components.Each component contains essential behavior for the correct flow of data through the service.This solution does not exploit processing patterns avoiding taking advantage of the available infrastructure.The components are encapsulated in heavy-weight virtual machines.Microservices: The University of California Curation Center presented an open-source micro-services infrastructure 57 that is being used to manage the diverse digital collections of different University campuses.The basis for this microservices 62,63 approach is Unix pipelines, 64 which gave rise to the software architecture of computing pipelines.This proposal is based on a centralized system, where each microservice includes the accessible web addresses of other services that it requires for its operation, the communication between them is carried out through RESTful requests.This solution lacks generality, since it solves a single problem, using software patterns for its solution, that is, for a specific purpose.Building Blocks: Sacbe 49 is a proposal of a modular software architecture that allows encapsulating software modules in their abstract containers called building blocks (BBs).I/O communication interfaces such as networks, file systems, and memory are used by BBs for data storage, sharing, and security.This proposal was made for end-to-end cloud storage applications.The developer builds the solutions by forming a workflow, wherein the data is processed through a pipeline whose final result is transmitted to a cloud storage service.Unlike Unix processing pipelines, Sacbe uses a cloud storage service instead of the local file system as a transparent storage service.Pipeline processing on the cloud: Nowadays, there exists a novel proposal 50 for the construction of data processing pipelines on the cloud based on a modular software architecture called building block(BB).In this proposal, a set of BBs are used to encapsulate processing filters or pipelines.A parallel data processing scheme called Divide and Encapsulate allows the segmentation of the data content to be processed (|C|) in n segments (s 1 , … , s n ).It clones n times a selected process (filter, stage) and assigns a segment to each process within VC, getting a performance improvement of the processing pipeline.Workflows: Workflow has been a traditional solution to create composite systems. 65Some relevant tools to generate processing workflows through different infrastructures are: (a) Hyperflow 52 : A computing model, programming approach, and engine for generating scientific workflows.(b) Taverna 55,66,67 : An open-source workflow management system comprised of a set of tools used to design and execute scientific workflows used to aid in silicon experimentation and bioinformatics works.Taverna is currently an incubating project from Apache.* (c) Swift 25,68 : It is a scripting language designed to generate application programs in parallel applications that can run on multi-core processors, clusters, grids, clouds, and supercomputers.(d) Triana 22 : It is an environment that facilitates the generation of workflows by using a graphical user interface (GUI) and an underlying system that allows integration with multiple services and communication interfaces.(e) Askalon 51 : A development and execution environment for grid computing systems.It allows the dynamic composition of large amounts of distributed resources to perform computationally costly tasks.
The Kulla-RIV model arises from the analysis and limitations of different architectures and proposals, enriching the vision of how to design a new virtual container-centric method for processing large volumes of data, taking advantage of processing patterns.

CONCLUSIONS AND FUTURE WORK
Kulla-RIV is a virtual container-centric construction method to compose parallel and distributed systems with fault tolerance and IV for processing large volumes of data.It represents an evolution of the Kulla model. 8ulla-RIV construction model relies on the integration of the following software structures and components: (i) Kulla-Block that encapsulates a given application with its dependencies and environment requirements, ready to be interconnected with other Kulla structures and data sources and sinks; (ii) Kulla-Brick that produces implicit parallelism by coupling a set of Kulla-Blocks in the form of parallel patterns, which are built in transparent manner and without modifying the application code by using methods such as Divide&Containerize, Pipe&Blocks and Manager/Blocks; (iii) Kulla-Box that coverts Kulla-Blocks, Kulla-Bricks or a combination of these structures into an infrastructure-agnostic application by means of embedding them in lightweight and immutable VCs; (iv) Brick-of-Kulla-Box that interconnects Kulla-Boxes to build infrastructure-agnostic distributed and/or parallel applications; (v) A fault tolerance component that clones the deployed Kulla-Boxes and detects failing process to redirect the incoming traffic to an idle clone; (vi) IV component based on HMAC that detects alterations in the transportation of data between Kulla-Boxes.
Those software structures and components enable developers/end users to create continuous delivery and to build parallel patterns without altering/modifying the applications' code improving their efficiency.All these structures are operable by using an ETL model (extract-transform-load), where extraction and load can be mapped to different I/O interfaces (memory, file system, or network).This composition model provides Kulla-RIV structures with self-similarity property, which means complex systems can be built by coupling multiple of these structures in a recursive manner.
The feasibility of Kulla-RIV was tested by means of deploying real-world composite systems to solve use cases whose main tasks were focused on processing medical and satellite imagery.These solutions were also implemented with a state-of-the-art workflow tool (Makeflow) for comparison.The evaluation revealed the effectiveness of Kulla-RIV to provide systems with portability, flexibility, and efficiency properties to create parallel and distributed systems.Kulla-RIV allows incorporating of different security levels for providing RIV properties to complex systems in a transparent manner, facilitating developers to choose the required configuration for their deployment environments.Experimental evaluations revealed that the transparent RIV schemes provided in Kulla-RIV produce less overhead than the one produced by a Makeflow deployment using an external IV module.The RIV schemes of Kulla-RIV mask and detect intermittent/temporal failures in the applications (Kulla-Boxes) as well as alterations in data that is exchanged between them.It is important to note that the security level of Kulla-RIV solutions depends on the developers' knowledge of the actual deployment environment to determine the appropriate level (low, medium, and high) for each system.
As future work, the following features are expected to be integrated into Kulla-RIV: • Current Kulla-RIV implementation allows developers defining scale-in/out deployment strategies in deployment time.
Next version of Kulla-RIV looks for these deployment strategies can be dynamic, that is, improving this profitability also in execution time, by using automatic/semi-automatic monitoring of resources.
• A new Kulla-RIV version will allow developers to compose complex systems defined by code, by means of using programmable scripts, as an additional approach to the use of the current GUI and the Kulla-Silo service.
• Kulla-RIV will incorporate multiple data replication methods to its reliability (R) scheme to improve storage utilization and performance.

F I G U R E 5 F I G U R E 6
Design of the proposed RIV schemes.Main phases of the SM.

F I G U R E 10 F I G U R E 11
Breakdown of the response time of the ECDH main functions.Service time of the generation of private (key) and public values in the ECDH protocol when processing different file sizes.

F I G U R E 12
Service time of hash algorithms used to produce the message authentication.

14
Response time comparison of security levels.
IV and RIV schemes overheadF I G U R E 16Breakdown of the composite system performance obtained using Kulla-RIV and Makeflow.
Taxonomy of the elements proposed by the Kulla construction model.control over data in outsourcing scenarios.Such a capability equips organizations with the proactive means to sidestep or mitigate the adverse consequences associated with vendor lock-in.

Algorithm 2 .
Execution of the applications defined in a DAG configuration file (DAGConfig) inside of a Kulla-Box Require: DAGConf : Configuration file containing applications (Kullablocks) to be deployed within a Kullabox.Require: DAGMap[]: Structure to reference the deployed applications as a DAG.//Deploy Kullablocks following the DAG structure.DAGMap = produceMAP(DAGConf ) //Initialize the index pointing to a specific Kullablock (KB) in the DAG.
Risk matrix.
TA B L E 1 Verifies if the secret of the two entities (E 1 and E 2 ) are equal.wheredAand d B are the private keys and Q A and Q B represent the public values of the entities E 1 and E 2 .The elliptic curves used for this test are described in Table1.
In this notation, while d A and d B are secret values for E 1 and E 2 , respectively, Q A and Q B are the public ones, respectively.dA ⋅ Q A (the same for d B ⋅ Q B ) is a scalar multiplication, the most time-consuming operation in an elliptic curve that is computed as a series of point-additions (the group operation).•Validation:

TA B L E 3
Average, median and 90th percentile of the response time (RT) obtained using Kulla-RIV and Makeflow.
Comparison of common architectures used for data processing.