A verified and optimized Stream X‐Machine testing method, with application to cloud service certification

The Stream X‐Machine (SXM) testing method provides strong and repeatable guarantees of functional correctness, up to a specification. These qualities make the method attractive for software certification, especially in the domain of brokered cloud services, where arbitrage seeks to substitute functionally equivalent services from alternative providers. However, practical obstacles include the difficulty in providing a correct specification, the translation of abstract paths into feasible concrete tests and the large size of generated test suites. We describe a novel SXM verification and testing method, which automatically checks specifications for completeness and determinism, prior to generating complete test suites with full grounding information. Three optimization steps achieve up to a 10‐fold reduction in the size of the test suite, removing infeasible and redundant tests. The method is backed by a set of tools to validate and verify the SXM specification, generate technology‐agnostic test suites and ground these in SOAP, REST or rich‐client service implementations. The method was initially validated using seven specifications, three cloud platforms and five grounding strategies.


INTRODUCTION
Software certification is the process of guaranteeing that a piece of software performs exactly according to its specification. Software certification is increasingly relevant in cloud computing, especially in multi-partner cloud service ecosystems [1], in which services are offered by many providers to many consumers, brokered by intermediaries who resell customized and repackaged service bundles. In a market where consumers can select from competing service offerings on the basis of functionality, performance and cost, cloud brokers [2] play a significant business role, offering intermediation (added-value services, unified access and identity management), aggregation (construction of composite services out of simple services, with secure data movement) and arbitrage (dynamic selection and substitution of services, to optimize cost or performance ‡ , § ).

of 38 SIMONS AND LEFTICARU
Enhancing the role of the cloud broker as guarantor of quality assurance [3] was the premise behind the EU FP7 BrokerCloud project [4], which investigated methods and mechanisms for continuous quality assurance and optimization of brokered software services in the cloud. The project demonstrated a brokerage platform which could validate and test software services prior to uploading (certification at onboarding) [5], manage the service lifecycle from creation to decommissioning (lifecycle governance) [6], regulate the performance and availability of services (monitoring and adaptation) [7] and recommend alternative service bundles, according to customer preferences (preference-based arbitrage) [8]. The current article describes the novel verification and testing approach that was developed to assure functional quality and substitutability, as part of a service certification strategy.

Functional service certification
Certification of services includes testing their functional behaviour (to assure their correctness) and non-functional aspects (to assure their performance). While the monitoring and enforcement of service-level agreements (SLAs) has been investigated [9][10][11] as a kind of performance testing, the functional certification of service behaviour has received much less attention [12].
Our vision was to provide a web services aligned XML specification format that could be used by tools hosted at distributed locations in the cloud, at different stages in the service lifecycle. A specification should be amenable to checking for consistency and completeness; it should later be used as the basis for model-based test generation, yielding test suites that check an implementation for full conformance to the specification. The ability to test services for conformance not only guarantees that the service is implemented correctly but also ensures that any substituted service behaves in exactly the same expected way. By offering a common XML specification format, we aim to apply a gentle standardizing pressure, to promote the creation of compatible services that may be substituted by a broker during arbitrage.
Testing software services in the cloud is challenging for several reasons. Firstly, service-oriented architectures (SOA) are implemented using a diverse range of technologies, which include standard web service protocols (WSDL ¶ and SOAP || ), popular internet conventions (REST ** and JSON † † ) or bespoke client-server streaming (rich-client desktops and AJAX ‡ ‡ ). Finding a suitable common specification model and test generation approach that might suit all of these is extremely difficult. Secondly, the stateless nature of HTTP protocols makes tracking the states and transitions of web services hard, unless this information is exposed by the services through design-for-test conventions. Thirdly, cloud-based web services are highly complex in their handling of concurrent requests, sessions and multiple tenancies; however, because tenancy and sessions are handled by a different authorization layer of the software, we believe this may be treated independently of service functional behaviour.

of 38
A specification is developed by modelling the control states of the software as a finite state machine (FSM), whose transitions are functions acting upon memory. The memory is an arbitrary tuple of variables, of arbitrary types. The simple state-transition graph is augmented by guarded transitions, where the guards are sensitive to inputs and memory states. The functions may also update values stored in memory. The classic SXM testing method requires a deterministic, complete and minimal specification [14,15,18]. Not only does the fundamental testing method [19] generate positive paths that should execute in the software but also negative paths that should be prevented by the software, thereby providing strong guarantees of correctness [16,17].
However, there remain gaps between the theory and practice of using SXMs, which can constitute serious obstacles to their adoption. We enumerate a number of these as follows: The mathematical formalism is potentially opaque to software engineers, such that it may be hard for the engineer to provide a deterministic (complete and non-blocking) specification. The testing method elides over how abstract sequences are to be converted into executable test sequences, by assuming the existence of a hypothetical test function [15] that maps sequences onto test inputs. The test suite is generated by exploring the associated automaton, disregarding whether sequences are blocked by the guards, which leads to formal workarounds (controllability, input uniformity [20,21]) to deal with infeasible paths. Though generated test suites are highly discriminatory, they are still too large and could be optimized by making reasonable design-for-test assumptions; in particular, the use of extended characterization sequences in the fundamental testing method [19] to identify reached states multiplies the size of the test suite by a factor equal to the size of the characterization set.

Theoretical and practical innovations
Our solutions to these problems are a mix of theoretical innovation and practical engineering. They depend on making a complete model of the Stream X-Machine specification available to a set of verification and testing tools that can reason about every aspect of the specification. The tools complement each other in the way they detect faults at the appropriate stage in design. Altogether, we claim the following innovations: a wholly available specification model that exposes not only the abstract state-transition behaviour of the automaton but also the concrete input, output, precondition and effect (IOPE) behaviour of operations acting upon memory, to the tools that reason about the specification; a novel verification algorithm that, for each operation, computes all symbolic partitions of inputs and memory, eliminating inconsistent partitions by constraint satisfaction, prior to determining whether the operation is consistent (deterministic) and complete (non-blocking); a novel test generation algorithm that also determines path-feasibility during test generation, addressing concerns about machine controllability or input uniformity and state reachability via the r-state cover [20,21]; a novel test optimization algorithm that eliminates infeasible paths and redundant sequences with proven trivial prefix cycles from the generated test suite, replaces extended characterization sequences by reliable state oracles and supports test-compression via merged multi-objective tests; and a novel test grounding method that uses test input constraints from the specification to generate concrete test inputs and invocations, generating code via a combination of design patterns, creating executable tests in a variety of implementation technologies.

Design-for-test conditions
To ensure that tested software services are aligned with the assumptions required by the testing method, service providers must implement a number of design-for-test criteria. These are not onerous and are frequently already part of internal logging mechanisms. A service must provide a reliable clean reset operation to put the service into its initial state and rebind its memory to the initial values; this is cheaper than re-initializing a cloud service from scratch and may already be provided as 4 of 38 SIMONS AND LEFTICARU part of an abort mechanism. It must provide a reliable transaction log that records which responses were triggered in reply to which requests; this is to satisfy the output distinguishability [15,16] criterion directly, in cases where operations do not naturally produce outputs. Finally, it must provide a reliable state oracle that reports the abstract state in which the service finds itself; this replaces extended characterization sequences [19] to identify reached states.
The clean reset is needed, because tests generated by the W-method [19] always assume restarting in the initial state. We also considered the transition tour method [22] which avoids reset; however, there is no guarantee that such a tour exists in the specification and this method cannot assure correct state transfer. We considered the unique input-output (UIO) method [23] which identifies each state via one sequence, but this method cannot detect all faulty implementation states. The state oracle method to verify states was inspired by reliable state oracles [24] that avoid extended characterization sequences; we developed proofs to integrate this within existing SXM theory, and other optimizations depend on this.

BrokerCloud Verification and Testing Tool Suite
The novel algorithms described earlier are demonstrated in the Verification and Testing Tool Suite (VTTS), one of the outputs of the EU FP7 BrokerCloud project [4]. The tool suite is freely available to download under an Apache 2.0 license, and individual tools may be explored online, using sample service specifications [25]. Users may also upload their own specifications, developed according to the user guide. This demonstration via the Internet may also constitute an early example of Testing as a Service.
The tools all use an Internet-transmissible XML format for encoding Stream X-Machine specifications and generated test suites. The specification language combines finite state machines (FSMs) with a popular web services functional protocol known as input, output, precondition and effect (IOPE), which in our previous work was shown to be compatible with Stream X-Machines [26]. The XML specification language maps directly to a Java model, defined by a metamodel, which supports the model-based reasoning performed by the tools.
Checking of model specifications is performed by two tools: a validation tool, which checks the explicit and implicit state-transition behaviour against the designer's expectations, by exploring the state machine, and a verification tool, which checks the specification's operations for consistency and completeness, by a combination of symbolic model checking and constraint satisfaction. These tools annotate the specification with warnings if faults are discovered.
The test generation tool generates complete functional test suites, in a technology-agnostic XML format, but which contain full grounding information about invocation sequences, test inputs and corresponding outputs, triggered transitions and reached states. The test suites are optimized to remove infeasible and redundant tests and may be further compressed as multi-objective tests. For a typical service with 2-7 states and 10-50 transitions, these optimizations reduce the size of a generated test suite to as little as 10% of its original size, without loss in fault detection.
The test grounding tool translates these abstract test suites into three sample concrete web service execution formats, by model-based code generation. As a proof of concept, we provide groundings to Java web services, creating JUnit test drivers for JAX-WS (SOAP services), JAX-RS (REST services) and for plain Java. However, these are not the only possibilities. BrokerCloud industry partner SAP SE created a bespoke grounding for use with the Selenium test engine, executing a SAP OpenUI5 rich-client application on the HANA platform [27]. We later developed a further bespoke grounding for the SOAP UI test engine [28].

Overview of the article
The rest of this article is structured as follows. Section 2 justifies the XML specification format, based on its logical adequacy and its relevance to service-oriented standards. Section 3 presents the novel theoretical optimizations we make to the Stream X-Machine (SXM) test generation approach, and Section 4 formalizes the novel verification method used to check specifications. Section 5 describes the BrokerCloud Verification and Testing Tool Suite, and Section 6 develops a complete example of specifying, verifying and testing a cloud-based data warehouse and then summarizes 5 of 38 similar results for seven different case studies on three cloud platforms. Section 7 contrasts our approach with related work in testing service-oriented architectures, and Section 8 concludes with an evaluation of its benefits and future research opportunities.

SPECIFICATION LANGUAGE AND METHOD
There are several considerations when choosing a format for testable specifications that aims to become a useful standard for the cloud. Firstly, the format must be adequate to capture the semantics of the system under test, so that all deviations from required system behaviour may be detected during design and testing. Secondly, the format should be open and portable, communicable via the Internet and amenable to automatic machine processing at distributed locations. Thirdly, the format should be reasonably close to the culture of the community that is expected to use it, to encourage adoption.

Adequacy and acceptability criteria
In the web services community, the de facto standards for web protocols are the XML-based WSDL interfaces with SOAP data wrappers and the simpler HTTP-based REST interfaces with JSON data packets. Alternatively, the semantic web community offers MSM (the Minimal Service Model § § ), a minimal extension to other RDF ontologies (GR, SKOS and FOAF) based on linked data principles [29]. These formats specify required interfaces and input/output data types but fail to capture the underlying state-related semantics of services, as has been noted many times in the literature [26,[30][31][32][33]. While the data types could be used to synthesize test inputs, there is no way of linking these to corresponding outputs, because there is no internal model of service behaviour.
Proposals for modelling service semantics have included UML state machines [34], OCL contracts [35], graph transformation rules [12] or dependency information [36]. The adoption of SAWSDL (semantic annotations for WSDL and XML [37]) catered to this trend, by supporting linkage from WSDL to arbitrary semantic documents. Two precursors that influenced our work [26,32] used this approach to link WSDL interfaces respectively to the SWRL ¶ ¶ and RIF-PRD |||| XML rule dialects to express the semantics of operations abstractly, in terms of their inputs, outputs, preconditions and effects (a style known as IOPE). Both approaches noted the affinity between the IOPE format and EFSMs; our earlier work [26] showed how a Stream X-Machine could in principle be extrapolated from the domain partitioning effect of the preconditions. Stream X-Machines are Turing-complete, so are adequate to model any software system [13].
Our chosen specification format is therefore one that unites the EFSM and IOPE views. We model a Stream X-Machine as a combination of the EFSM automaton and the IOPE protocol, linked through common labelling of transitions and guarded branches. The whole specification is an XML document, for reasons of Internet transmission. This is particularly relevant in cloud brokerage scenarios, where the cloud broker will host a collection of specifications and offer these as templates to potential service providers or as guarantees to service consumers. All three cloud roles (broker, provider and consumer) [2] will want to certify services in distributed locations at different points in the service lifecycle [6].

Overview of a service specification
In the following, we specify a simple bank account service, as a motivating example. The starting point is to create the state machine of the service, as shown in Figure 1(a). This machine has two control states (closed and open) that represent a high-level abstraction over the service's memory and various transitions, for example, withdraw/ok, withdraw/blocked and withdraw/error, indicating when particular operations are available. The absence of a transition indicates non-availability (interpreted as a null operation, rather than an error). The designer need only specify explicit transitions, §   for economy, and the tools later complete all missing transitions. The transition labelling indicates request/response pairs, where the same request may trigger a different response, in different memory state or input contexts.
The state machine is linked through this labelling to a more detailed protocol specification, which describes the memory and operations of the service. The memory declares a list of constants and variables, including the balance of the account. Operations have names such as deposit and withdraw, which correspond to requests submitted to the service. Each operation consists of a set of scenarios (cf. the UML sense of a single execution path), with labels such as withdraw/ok, withdraw/blocked and withdraw/error, which correspond to distinct request/response pairs. The scenario labels correspond exactly to the transition labels in the state machine. A scenario may specify an output, or an update to memory, or both. The IOPE protocol for the withdraw operation is shown in Figure 1 In terms of the IOPE protocol [26,32], all inputs and outputs for the operation are named explicitly (viz. amount and result), and each scenario is guarded by a precondition, which if satisfied, triggers an effect (if . . . then . . . ). A precondition may examine any input or memory variable (respectively amount and balance), while an effect may bind any output or memory variable (respectively result and balance). An operation with only one scenario may have a trivial precondition true, otherwise all the scenarios of an operation must have mutually exclusive and exhaustive preconditions. Where no effect is specified, or fewer than the available variables are rebound, this is interpreted as a no-change axiom (avoiding the logical frame problem).
Another aspect that is relevant to automated testing is the inclusion of test input constraints to trigger each scenario (indicated by the test clause in Figure 1(b)). In principle, test inputs could be synthesized by analysing the precondition; however, there are cases where the constraint on inputs and memory is unsatisfiable (e.g. when the balance is zero and one scenario in Figure 1(b) is infeasible), so we prefer to allow the designer to suggest an input constraint that, under suitable memory conditions, will eventually trigger the scenario, as a way of limiting the search for test inputs.
The state machine in Figure 1(a) and the protocol fragment in Figure 1(b) are to be understood as visualizations of the XML specification. In principle, different tool vendors may provide their own editors for developing service specifications that render these in different visual styles.

Service specification model
A service specification is an XML document conforming to the XML schema ServiceSchema.xsd [25] visualized in Figure 2(a), where elements are indicated along with their required multiplicities. Each node in the figure corresponds to an XML element in the schema. Each element is also mapped to a corresponding class in a Java metamodel, which models the behaviour of that element. An XML specification may be unmarshalled directly to a Java model-instance, which is then capable of being analysed or manipulated by the various tools in the BrokerCloud Verification and Testing Tool Suite (Section 5). Further details of the mathematical and logical expression language are elaborated in Figure 2(b) and are described in Section 2.4. 7 of 38 Figure 2. Compositional structure of a service specification and expression language metamodel. As shown in Figure 2(a), a Service consists of a Machine, describing the control logic, and a Protocol, describing the functional logic. The Machine consists of one or more States, each of which specifies zero to many Transitions exiting that state. Exactly one State is marked as the initial state. The transitions correspond to events handled explicitly by the machine. Where a machine is not fully specified, missing transitions are treated implicitly as trivial cycles returning to the same state. Each Transition refers to its source and target State and is labelled with the name of the handled event, styled as a request/response pair. The same binary names are used to label Scenarios (described in the following paragraph) and so connect the Machine and Protocol.
The Protocol consists of a Memory and one or more Operations. The Memory is a tuple of Constants and Variables, with an initial Binding of values to variables. The signature of each Operation is described in terms of its Inputs and its Outputs (or Failures) and its executable body is described by one or more Scenarios. Each Scenario represents a distinct branching path, guarded by a mutually exclusive and exhaustive Condition, eventually triggered by an input Binding constraint. The resulting Effect is a posterior binding of Outputs (or Failures) and memory-Variables to values. The schema allows some elements to be optional in context; model-consistency is automatically checked and co-references are resolved when the XML specification is unmarshalled to a Java model-instance. The earlier specification model is an extended finite state machine (EFSM). Because its transitions are atomic functions acting upon memory, it is also an X-Machine (XM) [38]. Because all atomic functions are triggered by inputs and memory, yielding outputs and updated memory, it is also a Stream X-Machine (SXM) [14,15], a class of X-Machine that is fully testable under known design-for-test conditions. The correspondence with an SXM is obtained deliberately through the equivalence between a Scenario (a mutually exclusive guarded branch of an operation) and an atomic function in SXM theory [14], in order to leverage the power of the associated complete functional testing method [15][16][17].

Formal expression language
The specification language includes a mathematical language for describing Boolean, arithmetical and set-theoretic operations, inspired by the widely known Z notation [39]. The language supports built-in types (e.g. Void, Boolean, Integer, Double, and String) with all the usual primitive operations and arbitrary uninterpreted set-theoretic types (e.g. Document and Person) with equality. It supports Z's powerset, sequence, product and function types (  that are implicitly universally quantified but offers no explicit universal and existential quantifiers, in order to limit the complexity of verification. The specification style is similar to writing a Z specification. So a phone book recording phone numbers against names might be modelled as a Map[String, Integer] with a suitable initial binding to a constant representing the empty map. The expression language is defined by a metamodel, shown in Figure 2 Table I. These functions have standard names and are polymorphic: for example, the function searchAt returns an element mapped by a key in a Map or indexed by an Integer in a List. All functions are side-effect free, such that insertAt and removeAt return a fresh copy of the data structure in which the structural changes are manifest. Assignment must be used explicitly to rebind a variable to a new value; assignment can be exact (equals) or bind the variable to a boundary value just inside an exclusive limit (lessThan and moreThan).
An XML specification is unmarshalled as an instance of the earlier metamodel. The whole model specification is available to different reasoning tools. Not only can the state-transition graph be explored (cf. previous studies [33,40]), but the expression language can also be simulated, both forwards and backwards. The different metaclasses support specific reasoning strategies through their meta-methods, for example, Arithmetic expressions can be simulated forwards for execution or backwards for constraint solving, Comparison expressions can be treated as symbolic values amenable to symbolic subsumption, and Proposition compounds can be split and complemented, to obtain atomic expressions. It is this capability that enables the novel verification algorithm (Section 4) on which the novel test generation algorithm depends (Section 3). In particular, all operations may be proven deterministic and complete before testing; and the whole Stream X-Machine may be simulated, updating memory and observing the blocking effects of guards, during test generation, such that all generated paths are known to be feasible.

TEST GENERATION APPROACH
In this section, we present the optimized Stream X-Machine testing method. X-machines were first explored as an interesting class of EFSMs, whose transitions are processing functions acting upon memory [38]. They were later found to be well suited for specifying complex software systems, through an ability to model the control and data of a system separately [13]. A fully controllable and observable testing method was later developed for the variant known as the Stream X-Machine (SXM), which includes input and output streams as part of memory [14,15,21].
The earliest application of the SXM testing method guaranteed correct integration of processing functions that were assumed to be individually correct [14,15]. Later work used hierarchical SXMs to prove the correctness of complex systems recursively, using a divide-and-conquer approach [16]. Recent work has shown that it is possible to perform integration and component testing at the same time [41]. 9 of 38

Stream X-Machine foundations
We introduce the following notation. For a finite alphabet †, † represents the set of all finite sequences with members in †. For sequences a; b 2 † , ab denotes the concatenation of the two sequences a and b and " denotes the empty sequence. For sets of sequences U; V Â † , U V D fab j a 2 U; b 2 V g denotes the concatenated product. The language consisting of sequences of finite length U n is defined by U 0 D f"g and U n D U n 1 U , n 1. The bounded Kleene star language consisting of all sequences up to length n is defined by U OEn D U 0 [ U 1 : : : [ U n . We assume that the reader is otherwise familiar with finite automata and related concepts, such as reachable states, distinguishable states, the minimal automaton and the accepted language (see Ipate [18] for a brief introduction).
A Stream X-Machine (SXM) differs from a simple FSM, in that it has internal data storage or memory (a tuple of variables), and its transitions are labelled by atomic processing functions, whose execution may be guarded, instead of simple input/output symbols.

Definition 1 (SXM)
A Stream X-Machine (SXM) is a tuple Z D . †; ; Q; M;ˆ; F; q 0 ; m 0 / in which: † is the finite input alphabet; is the finite output alphabet; Q is the finite set of states; M is a finite set called memory; ***ˆi s a finite set of distinct processing functions, where every 2ˆis a non-empty (partial) function of the type W M † ! M ; andˆis also known as the type of the SXM; F is the (partial) next-state function of the type F W Q ˆ ! Q; q 0 2 Q is the initial state and m 0 2 M is the initial memory value.
In the rest of this paper, we mostly consider DSXMs. A sequence p 2ˆ of processing functions induces a function jjpjj that shows the correspondence between a (memory, input sequence) pair and the (output sequence, memory) pair produced by the application, in turn, of the processing functions in the sequence p. A computation of the SXM Z represents the traversal of all transition sequences in the associated automaton A Z and the application of all the corresponding processing functions. These are applied successively, consuming inputs, possibly updating memory and producing outputs. The correspondence between the input sequence and the output produced gives rise to the relation (or function) computed by Z.

Definition 4
The relation computed by SXM Z, f Z W † ! is defined by .s; g/ 2 f Z if there exist p 2ˆ and m 2 M such that .q 0 ; p/ 2 dom F and jjpjj.m 0 ; s/ D .g; m/. We say that Z computes f Z . Note that for a DSXM Z, the relation f Z is a function, that is, f Z W † ! .

Definition 5
An SXM Z is said to be completely defined if dom f Z D † .

of 38 SIMONS AND LEFTICARU
In other words, an SXM is completely defined if every sequence of inputs can be processed by at least one sequence of functions accepted by the associated automaton. An SXM that refuses some inputs can always be transformed into a completely defined one by adding a distinct error output (not in the output alphabet) and completing the automaton with self-looping transitions (ignored events), or alternatively, transitions to an extra error-state (fatal errors).

The W-method for testing finite automata
The fundamental DSXM testing method [42] is an adaptation of Chow's W-method for testing finite automata (FA) [19]. This assumes naturally that the specification and implementation have the same input alphabet † and testing seeks to ensure that every path in the specification exists in the implementation (both accepted and refused paths, for robust positive and negative testing). Apart from this, whereas the specification is minimal with n > 0 states, the implementation may contain n 0 n estimated states. Other important notions include the following: a state cover, a set V consisting of sequences that reach every state of the machine; V is either chosen or determined by exploring the automaton; a transition cover, a set T consisting of sequences that reach every state of the automaton and then exercise every transition in the alphabet from that state; the transition cover can be computed by T D V [ V †; and a characterization set, usually labelled W , that distinguishes between every pair of states, according to whether sequences from W are accepted or refused from that state.
Given the earlier definitions, test suites may be defined according to the W-method formula: The idea behind this is that the product V †OEn 0 n C 1 will exercise at least the transition cover (where n 0 D n, this is equal to V †OE1 D V [V † D T ) and the final product with W ensures that the implementation reaches the same state as expected by the specification. In case the implementation contains n 0 n > 0 extra states, longer test sequences up to length n 0 n ensure that these states are reached and behave like duplicates of the expected states.
The W-method requires a reliable reset in the implementation that places the system in its initial state, before each test sequence is executed, but requires no direct state inspection, relying instead on W to identify states. It is robust in identifying correct and incorrect paths and states. The transition tour method [22] avoids reset but cannot guarantee correct state transfer (and the tour may not exist). Similarly, the unique input-output (UIO) method [23] cannot reliably test for unwanted states but is less expensive than W, using single state-identification sequences.

Adaptation for DSXMs using design-for-test conditions
Because the transitions of a DSXM represent (partial) functions 2ˆ, rather than inputs 2 †, the adapted W-method therefore constructs sequences of atomic processing functions rather than sequences of inputs: Each (partial) processing function 2ˆmay have a restriction on input/memory that may prevent it from firing unconditionally. Therefore, test sequences which cover all states and transitions of the associated automaton might not reach all states or transitions in the DSXM, due to the blocking effects of guards. Certain extreme settings of memory may be hard to reach, leading to theoretical treatments of the controllability or input completeness of the SXM [15,43]. Similarly, the state cover of the automaton must be replaced by a realizable r-state cover in the DSXM [20,21]. Likewise, states which are distinguishable using W in the automaton may require function sequences that can never be applied, due to blocking; W must be replaced by realizable separating sets [21]. To ensure that tested systems are controllable and observable, it is therefore necessary to adopt a number of design-for-test conditions.

of 38
The input completeness (or controllability) of a DSXM assures that any sequence of processing functions in the associated automaton can be triggered by suitable input sequences. This property is rather strict; most real-world systems are not by default input complete. Testable systems must admit special inputs, used only during testing, that circumvent the blocking effect of guards and drive the system directly into extreme memory states. Recent work has relaxed controllability only slightly, replacing this by input uniformity [20], a property that requires concrete inputs to be found, one at a time, for each processing function in a sequence. The output distinguishability property assures that it is possible to determine which atomic processing function was applied, from the output produced in response to any given input. This is an important test oracle, serving to determine whether the correct or incorrect function was triggered in the implementation. However, many real-world systems do not produce distinguishing outputs for every action. Testable systems must in practice instrument their operations to produce extra output symbols, where needed.

Fundamental DSXM testing theorem
Given a finite automaton (FA) specification A and a class of implementations C , a test set is a set of input sequences that, when applied to any implementation A 0 in the class C , will detect any response in A 0 that does not conform to the response specified by A. We show in the following discussion how this definition generalizes to DSXMs. In the following, L A is the language accepted by the automaton A, A Z is the associated automaton of the DSXM Z and L A Z is the language accepted by this automaton.

Definition 8
Let A be a deterministic FA and C a set of deterministic FAs having the same input alphabet as A.
Similarly, for DSXM, a test set is a finite set of input sequences constructed from the DSXM specification that produces identical results when applied to the specification and the implementation only if the specification and the implementation compute identical functions.

Definition 9
Let Z be a DSXM and C a set of DSXMs having the same input alphabet † and output alphabet as

Definition 10
Two DSXMs Z and Z 0 are called weak testing compatible if they have identical input alphabets, output alphabets, memory sets and initial memory values. Two weak testing compatible DSXMs are called testing compatible if they have identical types, namely, their corresponding sets of processing functionsˆandˆ0 are identical.
The W-method generates sequences of inputs (for the FA) or processing functions (for the DSXM) from the specification. However, in DSXM testing, these abstract sequences must first be converted into concrete test inputs (requests with their actual parameters), in order to test implementations. For this, we assume the existence of a test function that translates sequences of processing functions into sequences of inputs.

Definition 11
A test function of an SXM Z is a function t Wˆ ! † that satisfies the following conditions: (1) Let D 1 : : : k 2ˆ , k 1 ı Suppose 1 : : : k 1 2 L A Z and there exists 1 ; : : : ; k 2 †, 1 ; : : : ; k 2 and m 1 ; : : : Then, t. / D 1 : : : k for some 1 : : : k that satisfy this condition (2) ı Otherwise, t. / D t. 1 : : : k 1 /. ( The test function associates a sequence of inputs that exercises the longest prefix of 1 : : : n that is a path in the SXM and, if k < n, also exercises kC1 , the function that follows after this prefix. If the typeˆis input complete, the input sequence 1 ; : : : ; n will always exist. Otherwise, the sequences produced may not all be feasible in the DSXM and consequently they cannot be mapped into actual input values. In this case, as Definition 11 case (3) specifies, only the longest subsequence will be mapped into actual input values. Furthermore, the test function of a DSXM is not uniquely determined; many suitable input sequences may exist. Notwithstanding how the test function is to be constructed (an issue that is elided in the SXM-testing literature, which we address in Sections 2 and 5), the earlier considerations lead to the expression of the fundamental DSXM testing theorem. This is the basis for a number of important results, such as the guarantee of correct integration in the divide-and-conquer approach [15,16].
Theorem 1 [42] Let A be a deterministic FA having input alphabet †, n the number of states of A, n 0 n and C n 0 the set of deterministic FAs having input alphabet † whose number of states does not exceed n 0 . If T is a transition cover and W a characterization set of A, then Y n 0 n D T .
Theorem 2 [15,16] Let Z be a DSXM having typeˆinput complete and output distinguishable and C a set of DSXMs testing compatible with Z. If t is a test function of Z and Y Âˆ a test set of A Z w.r.t.

DSXM testing improvements and optimizations
Our optimized test generation method makes slightly different assumptions. We do not require strong input completeness to control the DSXM but instead require the test generator to produce only feasible sequences that eventually will cover all transitions and states, using a test function that is built into the specification. This supports testing real-world systems that are not input complete. Similarly, we finesse the output distinguishability criterion and the state separation criterion through different design-for-test assumptions about the system under test (SUT): 1. The SUT has a reliable reset function r that is guaranteed to place the SUT in its initial state and memory bindings. We do not include r inˆbut assume it is executed before every test sequence. This is the same requirement as for the W-method. 2. The SUT has a reliable log function g, which reports which processing function inˆwas triggered in response to the most recent request, without modifying the SUT. We do not include g inˆbut assume it may be executed after any sequence. This implementation detail ensures output distinguishability via a side-channel, so does not interfere with the SUT's natural outputs. 3. The SUT has a reliable state oracle function s, which reports the current state of the SUT, without modifying the SUT. We do not include s inˆbut assume it may be executed after any sequence. This implementation detail ensures state separation without need for the W state characterization set, so does not interfere with the SUT's state after executing a given sequence of processing functions.

of 38
These features can be added easily to service-oriented systems, which often already supply the additional observers as part of their internal diagnostic systems. The observer functions g; s can be invoked immediately after any transition has been fired, respectively, to observe which 2ˆwas triggered and which state q 2 Q was reached. Where unique test sequences are built systematically, these observers need only be checked at the end of each sequence, because all prefix sequences will already have been checked. However, it is possible to interleave functions and observations when merging multi-objective test sequences (Section 5). If the test function t. / is deterministic, the memory of the SUT is uniquely determined by the sequence of processing functions 2ˆthat were exercised, so the observers g; s are sufficient to confirm the memory bindings.
Our improved testing method not only tests for correct integration of component functions [15,16] but also performs equivalence-partition testing. The linked verification method (Section 4) ensures, by symbolic reasoning, that every possible input partition is handled by exactly one operation branch (viz. DSXM processing function), which must be tested at least once. Test input constraints, assumed from domain knowledge, cause each branch eventually to be executed, discharging the assumption through test coverage reports.

Test generation replacing W by a state oracle.
As a precursor to later optimizations (such as test compression by merging), we replace the characterization set W by an abstract state oracle function s. What is important is to preserve the existing test properties of the W-method, as justified by the following lemma.

Lemma 1
If the SUT satisfies the design for test conditions, having a reliable state oracle s, and V is a feasible r-state cover for the DSXM Z that has n states, and the SUT has at most n 0 states, then the set S D W test.ˆ; V; fsg; n 0 ; n/ is a test set, where S Âˆ fsg.

Proof
Intuitively, the state oracle function s is replacing the characterization set W in the set W test.ˆ; V; W; n 0 ; n/ D VˆOEn 0 n C 1W , which is a test set for the associated automaton A Z , according to Theorem 1. The set becomes W test.ˆ; V; fsg; n 0 ; n/ D VˆOEn 0 n C 1fsg and it is still a test set for the automaton.
Note that the product with singleton set fsg does not multiply the number of generated sequences. The size of the generated test set is therefore reduced by a factor of card.W /, with respect to the set generated by the original W-method. The earlier treatment must also be mapped via a test function to yield suitable test inputs to drive the DSXM. For this, we require an extended version of the test function that also verifies states. Where i 2ˆare processing functions of the DSXM Z, and s 6 2r epresents the state oracle:

Definition 12
An extended test function for a DSXM Z and a state oracle s is a function t 0 Wˆ fsg ! † Q that satisfies the following condition: t 0 . 1 : : : n s/ D t. 1 : : : n / q, where t is a test function for Z and q 2 Q is the state reached in A Z after processing t. 1 : : : n /.

Lemma 2
If W test.ˆ; V; fsg; n 0 ; n/ is a test set of the associated automaton A Z , then t 0 .W t est .ˆ; V; fsg; n 0 ; n// will be a test set of the DSXM Z.

Proof
By the equivalence presented in Definition 12, the conditions of Theorem 2 hold and the corresponding set t 0 .W t est .ˆ; V; fsg; n 0 ; n// will be a test set of Z, where a sequence from this set 1 : : : n s will be used for testing.

Elimination of redundant and infeasible sequences.
An important property of the sets of exploratory paths generated by our algorithm is that they are prefix closed, that is, all the prefixes of a given path exploring from a given state are also in the same test set. They are also path complete, that 14 of 38 SIMONS AND LEFTICARU is, all possible alternative paths of a given length are explored from each state, before optimization. These properties apply to the paths explored, not necessarily to the state cover prefix.

Definition 14
We say that a language L Âˆ is path complete if every non-empty 2 L has an immediate prefix 0 and for all inˆ, every alternative sequence 0 is also included in L, that is, 8. j ¤ / 2 L . D 1 : : : n 1 n ; n > 0/ 8 2ˆ . 0 D 1 : : : n 1 / 2 L.

Remark 1
The languageˆOEn is prefix closed and path complete by construction. This is by definition ofˆOEn andˆn, which build the result by breadth-first exploration. The languageˆOEn also has the property that if D 1 2 2ˆOEn, then 0 D 1 2 2ˆOEn.
The first optimization is to remove redundant test sequences, which test properties that have already been confirmed by other test sequences in the same test set. Consider that an automaton A Z which blocks for some events in some states can always be completed by adding explicit trivial transitions representing the ignored events (service-oriented systems are designed this way, to avoid blocking). Such transitions denote trivial functions (nullops), having no effect upon the SUT, and are circular, returning to the same state.

Proposition 1
Suppose that the automaton A Z has been completed, that 1 , 2 ¤ are sequences and that the test set S Â VˆOEn contains the path D 1 2 , where is a trivial transition. Then, a shorter path 0 D 1 2 2 S will test the same properties as the original path . We say that is a redundant sequence, which may be deleted from the test set, without loss of coverage. That is, if S is a test set of A Z , then S n f g is also a test set.

Proof
Trivial cannot occur in the minimal state cover V but only in exploratory sequences generated bŷ OEn. The prefix sequences 1 and 1 must leave the SUT in the identical state and memory configuration, because is a trivial transition, which by definition has no effect on the SUT. Because all exploratory paths are prefix closed, then the triviality of will have been determined by testing 1 , and the condition 2 ¤ ensures that this sequence is not deleted. Because the exploratory paths are path complete and the test set includes , it will also already include 0 , by the property derived in Remark 1.
The second optimization is to remove infeasible test sequences, which cannot be executed in the SUT. Previously, the classic DSXM testing method either forced all sequences to be feasible (by input completeness) or truncated test sequences after their maximally feasible prefix. Our test generator deletes infeasible sequences, on the basis that prefix closed test sets include the maximally feasible prefix, and path complete test sets supply witnesses for blocked paths by other means.

Proposition 2
Suppose that the memory and input type M † is exhaustively partitioned into n equivalence classes .m i ; i /; 1 Ä i Ä n and that an operation Op D f 1 : : : k g; k Ä n consists of a set of related processing functions, handling similar requests, such that 8.
Then, if some path D 1 i 2 2 S is found to be infeasible when i blocks at some .m i ; i /, there will always be exactly one feasible j that can execute instead. Therefore, if S Â VˆOEn is test set, then S n f g is a test set.

Proof
Blocking cannot occur in the feasible state cover V but only in the exploratory sequences gen-15 of 38 erated byˆOEn. The different i 2 Op are mutually exclusive and exhaustive, by definition. If the path D 1 i 2 2 S blocks at i , then by virtue of prefix closure, a maximally feasible prefix 1 exists and will be tested. By virtue of path completeness and the property derived in Remark 1, some other path 1 j 2 will also exist, where the accepted prefix 1 j is a witness to the blocking prefix 1 i by mutual exclusion.

Merging test sequences with shared test objectives.
The third optimization compresses the test suite. Shorter sequences are merged with longer sequences of which they are a prefix. The test objectives of a path are normally checked by assertions, added after all execution steps. Merged sequences, which are multi-objective tests, may also have assertion checks interleaved with the path's execution steps. We define an assertion check˛as a side-effect-free function that inspects the result of the triple .q; ; /, where q 2 Q is returned by the state oracle s, 2ˆis the last triggered function returned by the log g and 2 is the output of the last . We call any path ending with an assertion check a checked path, and any test set consisting of checked paths a checked test set.

Proposition 3
Suppose that assertions˛2 are side-effect free and may be added onto the end of any sequence in S . Suppose that S 0 is a checked test set, which contains the checked path 1 D 1 : : : k : : : n˛n and the checked prefix 2 D 1 : : : k˛k . If a merged sequence 3 D 1 : : : k˛k : : : n˛n is included in S 0 , this meets the test objectives of 1 ; 2 , which may be removed. That is, if S 0 Ã f 1 ; 2 g is a checked test set, then S 0 [ f 3 g n f 1 ; 2 g is also a checked test set.

Proof
Because assertions˛2 are side-effect free, medial insertion of˛k into 3 w.r.t. 1 has no effect on the state or memory of the SUT, so the check˛n will yield the same result in 3 as it did in 1 .
The sequence 2 is a prefix of 3 . Therefore, 3 checks the objectives of 1 and 2 .
The various optimizations described earlier achieve significant test set size reductions reported in Section 5. The test compression optimization could not work without replacing W, because products with W tend to produce sequences that are not prefixes of other sequences; whereas with prefix closed exploratory sequences, there are many prefixes of other sequences. After eliminating redundant sequences with trivial prefix cycles, exploratory sequences are still prefix closed, but after removing infeasible sequences, they are no longer path complete.

VERIFICATION APPROACH
In this section, we present a novel verification approach for Stream X-Machines, based on the notion of making the IOPE protocol of the SXM explicit. Verification ensures that the DSXM protocol is correct, in that its operations are complete (non-blocking), always having a response for any input/memory combination, and deterministic, having a single response for any such input/memory combination.
Apart from the fact that it is desirable to have a correct specification, verification ensures that the DSXM specification meets the necessary conditions for applying the DSXM testing method (Section 3), which expects a deterministic and complete SXM. Furthermore, specific test optimizations also require this property, to ensure that removing infeasible paths still leaves a witness to blocked paths, based on guaranteed mutual exclusion of certain functions.

Protocol specification
The associated protocol P Z for a DSXM Z is an abstraction over its operations and memory states. The protocol is related to, but distinct from, the associated automaton A Z . Whereas A Z dictates through its explicit transitions when an operation is valid (invalid requests are ignored), P Z dictates how valid requests should respond, given the current memory state. The protocol is a tuple: in which F is a set of operation specifications; M is specification of memory and m 0 is the initial state of memory. The memory tuple m W M Â T is a finite product of constants and variables c i ; v j 2 T , where the type T is restricted to finite computational types taken from the type domain: where Obj is the union of finite uninterpreted sets used to model object identifiers, Bool is the Boolean type, N um is the union of finite computational numeric types including the usual short and long precision versions of the IEEE integral and floating point numbers, C har is the type of bounded-length character strings, List OET is the type of bounded-length lists of elements of type T , SetOET is the type of finite sets of elements of type T and MapOET k ; T v is the type of finite maps from T k to (distinct) T v . All types are finite to ensure decidability. An operation f 2 F is one of the declared operations of the service. Each operation is a tuple: in which req f is a unique label or name for the operation f denoting a request, i n f is a finite input tuple for the operation f consisting of values taken from the domain type T f in and out f is a finite output tuple for the operation f consisting either of values taken from the codomain type T fout , or of a single value from Err, the set of exceptions including ? the undefined result. The input and output tuples i n f ; out f may be zero-length to indicate respectively a no-input or no-output operation. Finally,ˆf ¤ ; is a non-empty finite set of atomic functions f j 2ˆf , known as the scenarios of operation f , denoting its distinct execution paths as specified by the designer. Each scenario f j 2ˆf may also be described as a tuple: in which rsp f j is a unique label within function f , denoting the distinct response described by the scenario f j , g f j is a Boolean guard testing the function's input domain and memory, yielding a response which, when true, indicates that the scenario f j should be triggered and beh f j is the behaviour performed by the scenario f j , which is not analysed further here. The behaviour specifies posterior variable bindings, such as outputs or memory updates, and can also be empty.

Protocol-automaton congruence
The protocol P Z must describe the same system as the automaton A Z of the DSXM Z. The set of processing functions 2ˆlabelling the transitions in the automaton A Z is the union of the sets of scenarios for each operation f j 2ˆf , intuitivelyˆD Sˆf . It must be possible to map from any explicitly specified transition to a unique scenario, describing its behaviour; symmetrically, any specified scenario must correspond to a unique function labelling at least one explicit transition in the automaton.

Lemma 3
A scenario may be identified uniquely by a label pair: .req f ; rsp f j /, where req is the name of an operation f 2 F and resp is the name of a scenario f j 2ˆf , the branches of operation f .

Proof
By definition, the label req f uniquely identifies a distinct operation request f 2 F and within the set of scenariosˆf of f , the label rsp f j uniquely identifies a distinct scenario response, also by definition.

Definition 15
We say that the protocol P Z and the automaton A Z of a DSXM Z are congruent if every explicitly labelled transition 2ˆ(before the completion of A Z with trivial transitions) corresponds by name to exactly one scenario 2 Sˆf , and symmetrically, if every scenario corresponds by name to a function labelling one or more explicit transitions.

of 38
Protocol-automaton congruence is checked directly through model inspection, by assuring the equivalence of the two sets of labels: labels.t ransitions.states.A Z /// Á labels.sce nari os.operat i ons.P Z ///. We established earlier that transitions were labelled in the same binary style .req; rsp/. Note that the same label may sometimes be used on more than one transition (denoting the same function, but targeting different states). Note also that trivial transitions play no part in this consideration, because the protocol P Z only specifies valid requests.

Memory initialization and test input completeness
For the protocol to simulate correctly, the memory must be initialized and suitable test inputs must be found for every 2 Sˆf . In a departure from traditional DSXM approaches, which presume the existence of a test function to generate inputs, we require a test input binding constraint in the specification, a pragmatic choice to limit the search performed by the constraint solver. The binding constraint is expressed as a simple inequality, for example, v 1 D c 1 or v 1 < v 2 and is used by the constraint solver to bind v 1 to a suitable value.

Definition 16
If the memory type M Â T defines a product of constants and variables c i ; v j 2 T , then we say memory is initialized if an initial binding exists for every declared variable in initial memory, that is, 8v i 2 m 0 9c j 2 m 0 .v i WD c j / 2 i nit, where i nit is the set of initial bindings. The types must be compatible, but there may be fewer constants than variables.

Definition 17
Similarly, every scenario f k 2ˆf must accept a distinct input tuple i n f k W T f in specified in the test bindings. A test binding must exist for each input variable (state variables are already bound), that is, 8v where test is the set of test bindings and c j is found by constraint satisfaction. Furthermore, we say that a test binding is input complete if, under suitable memory conditions m i , the guard g f k for scenario f k may be true, that is, The existence of initial bindings of memory and test input bindings for each scenario is checked directly through model inspection. Test input completeness is initially assumed, after manually picking suitable test bindings that will satisfy the guards g f k . This assumption is later discharged by protocol simulation, when the test generation tool checks that every scenario was executed at least once, so satisfying the test input completeness condition. For functions with no inputs, no test input binding is needed, because any guards may only test memory.

Operation completeness and determinism
To satisfy the testing assumptions for the DSXM Z, the associated protocol P Z must specify deterministic and complete behaviour over all inputs and memory. To verify this, we must show that every operation f 2 F individually specifies deterministic and complete behaviour over memory M and its own domain type T f in .

Definition 18
Where an operation f 2 F consists of a set of scenarios f i 2ˆf , we say that f is complete if, for all inputs and memory, at least one f i 2ˆf may be triggered; similarly, we say that f is deterministic if, for all inputs and memory, at most one f i 2ˆf may be triggered. This is equivalent to imposing a mutually exclusive and exhaustive condition on the guards g f i of each scenario, namely, 8i Verifying this property is the main task of the verification tool. We convert a possibly large statespace search problem into a finite symbolic checking problem. The key insight is that a guard is a predicate g f i W M T f in ! Boolean whose domain type can be divided completely into equivalence partitions, based on the partitioning effects of predicates used in the complete set of guards 18 of 38 SIMONS AND LEFTICARU g f i 2 G f for the operation f . These equivalence partitions may be represented as conjunctions of atomic predicates, which may then be tested against the guards by symbolic subsumption.
In the sequel, we first deconstruct the guards g f i 2 G f to reveal the atomic predicates acting on a subset of relevant variables; then we recombine the atomic predicates in all possible memory/input combinations; then we eliminate inconsistent partitions by constraint solving; and finally, we satisfy Definition 18 by symbolic subsumption. 4.4.1. Deconstruction of guards into atomic predicates. Guards are constructed from a finite collection of predicates available in the specification's expression language (Table I). Predicates fall into three basic meta-types C omparison; Membership and P roposition, plus the degenerate At om, denoting a Boolean value. Complex P redicates are defined inductively over constants c, variables v, collections s and (recursively) predicates p: This inductive definition supports deconstruction of a guard into a set of irreducible constraints on relevant variables that affect its outcome. We define a structure-matching recursive function at omic./ that deconstructs all compound logical formulae pro 2 P roposition: atomic(pro(p)) = atomic(p) unary proposition atomic(pro(p, q)) = atomic(p) [ atomic(q) binary proposition atomic(p) D fpg everything else The deconstructed set contains only At oms and At omi c predicates defined over scalar and set-theoretic types, where the meta-type At omi c WWD C omparison j Membership. At omi c predicates denote irreducible scalar inequality or set-theoretic constraints over variables; At oms are Boolean variables that may bind to t rue or f alse.
Whereas the domain of each scenario f i 2ˆf is the whole type M T f in , the domain of the guards g f i 2 G f may be a (shorter) projection of this type. Guards are not obliged to test all variables but may choose to test a subset of relevant variables. In the sequel, we group atomic predicates by the relevant variables that they constrain.

Treating predicate partitions as value-spaces.
Consider that a given atomic predicate p.v; c/ 2 At omi c constrains the values of a variable v W T in relation to some constant c W T . This expression may be construed as denoting a partition of the type T ; that is, the predicate p.v; c/ stands for the set of values v W T that satisfy p. By symbolic manipulation of At omi c predicates, it is possible to construct further partitions of T , such that we obtain a set of predicate terms that completely partition T. Whereas predicates from Membership generate two partitions, for example, f.e 2 s/; .e … s/g, predicates from C omparison generate three partitions on total orders, for example, f.v < c/; .v D c/; .v > c/g. The construction of partitions is achieved by negating and splitting atomic predicates. Negation has the usual logical sense. The definition of spli t ./ is sensitive to certain predicates cmp.x; y/ 2 C omparison with a compound sense: split (x <= y) D f(x < y), (x = y)g split (x >= y) D f(x > y), (x = y)g split (x != y) D f(x < y), (x > y)g split (p) D fpg: The partitions./ function, constructed earlier, is sensitive to whether its input is an At omi c predicate or an At om. It maps an At omi c predicate to a set of complementary predicates on the same relevant variables and maps an At om to a pair of Boolean values: A single guard g may now be mapped to a set of sets, where each contained set denotes the partitions of one of its relevant variables (constraining it in relation to another variable or constant), defined by comprehension as partition_sets(g: Predicate) D fpartitions(p) j p 2 atomic(g)g: Generalizing this to all guards g f i 2 G f collected from all scenarios of operation f , we note that the guards for different scenarios may choose to test more or fewer relevant variables (being more, or less, selective). Therefore, the set of all partitions for an operation is the distributed union: This set returned by al l_partitions./ contains, for each relevant variable tested in any of the guards g f i 2 G f , a set of symbolic partitions of that variable, used in the sequel to generate all the unique partitions of memory and state that could be presented to an operation. .e … s/^.v < c/; .e … s/^.v = c/; .e … s/^.v > c/g; namely, ordered pairs created by the product are treated as conjunctions of atomic predicates (and atoms). We use the distributed version of the Cartesian product operator, because the set of partition sets may contain as many sets as there are relevant variables tested in guards: X 1˝X2 ˝X n D fx 1^x2 ^x n j x 1 2 X 1 ; x 2 2 X 2 ; : : : x n 2 X n g: This product represents every way in which one symbolic partition could be combined with every other symbolic partition of every relevant variable. This set represents all combinations of distinct bindings of relevant input and state variables that could be presented to an operation.

Preserving consistent partitions of inputs and memory.
In fact, the set Q .al l_partitions.f W F // is a conservative overestimate of the exhaustive partitions of relevant inputs and memory. This is because some conjunction terms, which apply constraints transitively to overlapping sets of variables, may be logically inconsistent. Consider how: .x < y/^.y D´/^.´< x/ is inconsistent, because no value may be found for y that is simultaneously greater and less than x.
The elimination of inconsistent conjunction terms is a constraint satisfaction problem. The approach taken is to populate the free variables of such terms with values that try to satisfy each predicate individually and then to evaluate the conjunction term as a whole. Conjunctions that do 20 of 38 SIMONS AND LEFTICARU not hold under any possible value assignments are deemed inconsistent. Existing constraint solvers for Java, such as Choco † † † or Cream ‡ ‡ ‡ , were explored but were found unsuitable for our needs, because they could only handle simple numeric and string types (whereas our specifications also have constructed Set, List and Map types). We therefore created our own constraint solver.
We assume the existence of a reliable binding operator bi nd./ that binds the free variables in a term with values that marginally satisfy the term's constraint (a tactic known as "crawling around the edges" of a specification; this approach has been used with success in the solvers used to find counter-examples in the Alloy tool [44]). The bi nd./ operator has no effect on bound variables and may choose to bind a left-hand or right-hand free variable in relation to a constant or bound variable mentioned in the same constraint. Some examples of this are bind (v: Integer = 6) ) v 6 bind(v: Integer > 6) ) v 7 bind(6 > (v: Integer + 2)) ) v 3 bind ( The bi nd./ operator is a meta-method defined for every operation in the expression language and for every type in T. For floating point inequalities, we assume that a small delta difference from a limit is adequate. For constructed List; Set and Map constraints, we assume that it is sufficient to satisfy the top-level constraint. For expressions in which the free variables are nested, bi nd./ propagates a derived constraint back into the expression, by forcing it to have a given result.
Given this binding operator, it is possible to populate all the free variables in conjunction terms, in some order. However, consider how the order of variable binding is significant. If bi nd./ were called to populate the term: .x > 4/^.x <´/^.´> 7/ in left-to-right order, then the variables will receive the following assignments: fx 5;´ 6g and the conjunction will evaluate to false. However, if the terms are populated in right-to-left order, bi nd./ will assign f´ 8; x 7g and the conjunction will evaluate to true. The objective is to find at least one consistent binding, which demonstrates that the conjunction term models a valid input and memory combination.
Treating a conjunction conj as a sequence (cf. the tuple generated by the Cartesian product), we compute all permutations of this sequence perm 2 P .conj / and bind each permutation in indexed order of its terms, such that if at least one of these evaluates to true, the conjunction is valid. We conjecture that if a valid binding exists, then bi nd./ will always find it by "crawling around the edges." A sketch of the proof is by considering a conjunction of terms for which there is a unique solution: .x > 5/^.x < 7/, for which bi nd./ will assign fx 6g, whereas any alternative binding algorithm that assigns to x values much greater than 5 or much less than 7 will fail to find this.
The consistent partitions of relevant input and memory are then determined as .all_partitions.f W F// j 9perm 2 P.conj/ bind.perm/ D trueg; where P ./ computes all permutations of a conjunction term and bi nd./ binds all free variables of the permutation in order.

Determinism and completeness of operations.
Having filtered the exhaustive partitions of relevant input and memory to preserve only those that are logically consistent, we may use these as symbolic exemplars of all relevant tuples that may be presented to the guards of an operation. For a given operation f 2 F and for the set of guards G f of this operation, we determine the following: 8c 2 consistent_partitions f 9Šg 2 G f g Ã c; where g Ã c denotes logical subsumption, in the sense that g is more general than c, such that if c holds for some input and memory tuple, g will also hold. This formulation in terms of logical subsumption is equivalent to the original proof obligation: since the symbolic conjunction tuple c is a projection of i n f m that is relevant to the guards G f of operation f , which is agnostic about other values in T M . The subsumption check is equivalent to proving that the guard accepts all inputs and memory states that satisfy the conjunction tuple.
Symbolic subsumption involves matching terms indexed by common operands and proving that the guard-term is more general than or equal to the input-term; namely, the guard-term: .x Ä y/. p ¤ q/ subsumes the input-term: .x < y/^.p > q/^.e 2 s/; because the first two sub-terms of the input are pairwise more specific and the third term is irrelevant. This check is performed directly on model predicates, using subsumption-checking meta-methods, without any need to populate terms with values.

Limitations of the approach
The decidability of verification could be challenged by unbounded values. We mitigate this by ensuring that all types are finite and structures are of bounded length (all types are computational, so must fit within finite computer memory). While the symbolic subsumption algorithm is complete by design with respect to all predicates in the expression language, the constraint solver, whose purpose is to eliminate inconsistency, may possibly fail to find consistent bindings in deeply nested expressions that break the assumptions of the reliable binding operator. The effect of this would be to exclude a valid input term from the subsumption checker, which would proceed without it. Degrading gracefully, the verification tool will still report a result on the basis of all other permutations. The tool works well in practice, as demonstrated by the examples included in Section 6.

VERIFICATION AND TESTING TOOLS
The BrokerCloud Verification and Testing Tool Suite (VTTS) [25] includes two tools to assist the designer in writing a correct specification, suitable for model-based test generation. The designer develops a DSXM specification Z in the manner outlined in Section 2. The validation tool may then be used to check the desired state-transition behaviour of the associated automaton A Z and the verification tool may be used to check the completeness and determinism of the associated protocol P Z , using the approach described in Section 4. VTTS also includes two tools to assist the designer in generating tests. A correct specification may be submitted to the test generation tool to create complete functional test suites, using the optimizing approach described in Section 3. These technology-neutral test suites contain full information for code generation in any desired format. As a proof of concept, the test grounding tool can map these to executable JUnit tests for a variety of Java-wrapped web services.
The validation and verification tools accept as input an XML specification and produce as output an annotated version of the relevant part of the specification, flagging various issues by attaching extra XML elements of the kinds: Notice, Analysis or Warning describing the results of the analysis and highlighting any faults that need correction. The test generation tool also accepts an XML specification but outputs an abstract XML test suite, similarly annotated to describe test parameters, test optimizations and test coverage properties. The test generation tool accepts an abstract XML test suite as input and generates Java code for JUnit, in which all metadata is translated into Java comments.

Validation tool
The validation tool helps the designer to structure the state space of the service, by reflecting back the consequences of the state machine design. The tool analyses the completeness of the states, under all known events supported by the machine (the FSM's alphabet, the union of events on all transitions) by static analysis, and determines the reachability of states by dynamic simulation. Altogether, it checks that an initial state has been specified (well-formedness); it checks by exploration that every state is reachable from the initial state (reachability); for each state specified in the machine, it checks whether a transition exists for every event in the alphabet (completeness); and for each scenario specified in the protocol, it cross-checks whether this has a corresponding transition in the machine (consistency).

of 38 SIMONS AND LEFTICARU
Feedback is presented to the designer by annotating the model of the Machine. The output XML file may be processed by client tools, to present this information to the designer in any desired format. Warnings are issued if any of the earlier properties are violated. The designer may then repair a faulty machine or decide whether any missing transition should be added. After validation, the machine is considered complete, and missing transitions are treated implicitly as trivial transitions of the form: request/ignore, where ignore indicates a mandatory null response.

Validation algorithm.
The validation algorithm is straightforward, based on a static analysis of the specification model. The states of the machine and explicit transitions leaving each state are found by inspection in the specification model. In addition, exactly one state must be marked as the initial state; the alphabet of the machine is constructed from the set of all transition events used anywhere in the machine; missing transitions for each state are discovered by comparing that state's explicit transitions against the alphabet; bounded breadth-first search (with a 5 s time-out) explores the machine to determine whether all states are reachable; and consistency of names used for scenario labels and transition events is determined by comparing the two alphabet sets.

Verification tool
The verification tool helps the designer to complete the branching logic expressed in the protocol, by checking this for completeness and consistency. The tool ensures, by static analysis, that all specified parameters are bound to suitable values and then checks that every specified operation is deterministic and non-blocking, by symbolic execution and constraint satisfaction. Altogether, it checks that every specified variable in memory is assigned an initial value (initialization); for each scenario of each operation, it checks that every specified input has a test input binding (testability); for each operation, it determines the exhaustive equivalence partitions of inputs and memory and then checks that every equivalence partition triggers exactly one scenario of that operation (deterministic, non-blocking); and finally, for each transition specified in the machine, it cross-checks whether this has a corresponding scenario in the protocol (consistency).
The initialization and testability checks ensure that the protocol can be simulated, without raising null value exceptions. The checks for determinism and non-blocking ensure that operations are complete and behave consistently under all inputs and memory. These two conditions are important, not only for logically correctness but also for testability [15]. If these conditions were not met in the specification, testing could not distinguish between different nondeterministic behaviours and might not detect blocking in the implementation. Feedback is provided by annotating the model of the Protocol, which may be visualized and presented to the designer in any desired format.

Verification algorithm.
The verification algorithm was presented in Section 4, and the stages of the algorithm are summarized hereafter. Referenced model elements, such as memory, operations and variables, are all available by inspection in the model. In addition, for each memory variable, a corresponding assignment is sought in the initial memory bindings; and for each input variable of each scenario, a corresponding test input binding is sought in the model; for each operation, all possible equivalence partitions of symbolic input and memory are calculated, by analysing the set of predicates collected from all of the scenarios of that operation; a constraint solver is used to eliminate any inconsistent partitions denoting impossible input/memory configurations; and  every consistent partition of input and memory is presented in turn to each of the guarded scenarios of that operation, to determine by symbolic subsumption whether that partition is accepted by exactly one guarded scenario.

Test generation tool
The test generation tool accepts a verified (complete, deterministic and non-blocking) specification and from this generates complete functional tests for the service under test. The specification serves both as a means to determine the minimum level of test coverage and also as the oracle for the generated tests, by supplying all transition labels, reached states and expected input-output correspondences for each test sequence. The test suite is output in a technology-agnostic, platformindependent XML format (Figure 3), which is later used as the input to different translators that ground the tests for particular service platforms, implementation styles and programming languages. Altogether, the test generation tool checks by exploration that every state is reachable from the initial state, despite the blocking effects of guards (state coverage); it checks during simulation that every transition in the specification has been triggered at least once, despite the blocking effects of guards (transition coverage); it proposes higher coverage paths, such as all-pairs and all-triples of transitions, to cover a non-minimal implementation (super-minimal coverage); it filters proposed test paths through the memory and guarded operations, to eliminate paths that are infeasible in the specification (path feasibility optimization); it prunes all test paths that have proven empty cycles in their prefix, achieving a significant reduction in the test suite size (test redundancy optimization); it optionally creates multi-objective tests, merging shorter and longer test paths whose test goals can be jointly verified (test compression optimization); generated test sequences contain complete information about required test inputs, expected test outputs or failures, triggered transition labels and reached states (test observation completeness).
The output of the test generation tool is an XML file, whose schema is visualized as a treediagram in Figure 3. The tool gives feedback to the tester, by annotating this model with metadata, recording the chosen test parameters and the reductions to the test suite size achieved by successive optimizations. The tool also warns if states or transitions were not covered when simulating the Stream X-Machine, which might occur under the blocking effects of guards. The designer can then take different kinds of remedial action in response, such as increasing the maximum path length, or specifying different test values, or creating additional scenarios to ensure certain critical memory states are reached.

Test generation algorithm.
The optimized DSXM test generation algorithm was presented in Section 3, and the stages of the algorithm are summarized hereafter. We simulate the complete Stream X-Machine, that is, the Machine and Protocol simultaneously, keeping track both of the current state and current memory values, in order to generate feasible test sequences by construction: this is a new achievement in Stream X-Machine testing. Altogether, bounded breadth-first search (with a 5 s time-out) explores the Stream X-Machine to determine the minimal feasible state cover, under the possible blocking effects of guards; 24 of 38 SIMONS AND LEFTICARU a test suite is constructed from the concatenated product of the minimal feasible state cover and the bounded Kleene star language (which is prefix closed and path complete), consisting of all possible event sequences of lengths 0..n, where n is the maximum path length; the size of this test suite is the baseline against which optimization is measured; candidate test sequences from the baseline test suite are presented in turn to the Stream X-Machine for simulation, after resetting the Machine and Protocol to their intial states, such that for each sequence -any sequence that executes completely in both the Machine and the Protocol is retained in the test suite; -any sequence that executes in the Machine but is blocked in the Protocol is discarded as infeasible; -any sequence that is blocked in the Machine is only retained if it ends with the empty cycle, otherwise it is discarded as redundant; and -assertion flags are added to the final step of each sequence to check the transition label, reached state and expected outputs; and if the tester requests multi-objective testing, the optimized test suite is compressed further, by merging shorter sequences with longer sequences of which they are a prefix.

Test optimization benefits.
The test generation tool uses verified assumptions about trivial transitions and mutually excluding scenarios to make significant optimizations on the size of the test suite generated by the baseline DSXM test generation algorithm. The three main optimization steps were presented in Section 3.5 but are summarized hereafter, with an indication of each step's effectiveness (see also Section 6). The precursor step is to use a reliable state oracle instead of extended characterization sequences from W . This directly reduces the size of the test set by a factor of card.W /, because in the classic W-method, every exploratory sequence would be extended by each of the sequences in W , to identify the reached state. Test sequences are also shorter without the suffix from W . Whereas executing sequences from W would force the tested system through additional states and transitions, the state oracle function is side-effect free, which we exploit hereafter in the third optimization.
The first optimization filters out all paths that contain trivial prefix cycles. If a path ends with a trivial cycle, this is confirmed in the implementation by testing for an explicit nullop; thereafter, all future paths that extend it are redundant, because any such path is equivalent to a shorter path without the trivial cycle. The benefit gained by removing redundant paths grows in proportion to the elaboration of states with missing transitions in the specification but typically reduces the size of the test suite by 60-70% in a system with three to five states. This huge gain is achieved relative to the exponential growth in the size of the test suite as the maximum path length increases.
The second optimization filters out infeasible paths that cannot be executed. Previous SXM testing approaches [15][16][17] recognize that simulating (only) the SXM's associated automaton may generate infeasible paths, blocked by guards. The hypothetical test function converts abstract paths into sequences of concrete test inputs, such that if no input can be found to trigger a given transition, the input sequence is truncated. In our approach, the maximally feasible prefix is already in the prefix closed test suite, so we may safely discard the infeasible sequence. Negative testing (to prove blocking) is accomplished by positive testing for a mutually exclusive path. The benefit gained by removing infeasible paths grows in proportion to the number of guards used in the specification but typically reduces the size of the test suite by 20-30%. Furthermore, we guarantee detection of missing coverage due to blocking, because we simulate the whole DSXM. This cannot be detected by simulating the automaton alone (unless the system is artificially forced to be input complete).
The third optimization compresses the test suite by merging sequences that share test objectives. This can only be achieved if all test-final assertion checks are side-effect free. We may then embed shorter sequences inside longer multi-objective sequences that contain them as a prefix, executing assertion checks after one or more prefix steps, as well as after the final step. We check the triggered transition (cf. output distinguishability), the reached state (cf. sequences from W ) and the result 25 of 38 of each operation. The benefit of compressing the test suite grows in proportion to the number of sequences that are also prefixes but typically reduces the size of the test set by 5-15%.
Altogether, the combination of these different optimization strategies may reduce the size of the generated test suite to under 10% of the theoretical size of test suites generated by the W-method [19]. Examples of performance are reported for five services in Section 6 hereafter.

Test grounding tool
In general, the translation of test suites into executable tests is the responsibility of the service provider, who controls the platform and programming languages used to build the service. The test grounding tool is provided as a proof of concept that the generated tests can be translated into different service technologies. The tool performs a model-to-code transformation, accepting an XML model produced by the test generation tool, and from this generates JUnit tests suitable for a JAX-WS Java client for a WSDL/SOAP web service or a JAX-RS Java client for a REST/JSON web service. A grounding to plain Java is also provided. Like other code generators, it follows a number of standard design patterns [45], in particular the Visitor pattern to walk through the model test suite (Figure 3) and the AbstractFactory pattern to synthesize Java types and values from models expressed in the standard expression language (see also Section 2).
Standard groundings all assume that the service client has a Java API and that the service may be tested through this API. Test grounding generates the source code for a JUnit test-driver class, whose @Test-annotated methods are the individual test sequences. The @Before-annotated method initializes the service on first invocation and resets it subsequently. Each test consists of prefix steps designed to reach a given memory state, followed by the final verified step. Every aspect may be checked by JUnit assertions, to verify which scenario was executed, what result was returned and what state was reached. In multi-objective testing, intermediate steps may also be verified by assertion.
The different concrete groundings adopt slightly different strategies (through different realizations of the Grounding visitor). The JaxWsGrounding generates a JAX-WS Java client, which converts regular Java method-calls into SOAP requests and then converts the SOAP responses back into Java objects. The JaxRsGrounding generates a JAX-RS Java client, which dispatches RESTful-style HTTP requests that include the operation-name and any inputs as part of the URL and uses a JSON parser to convert the responses back into Java objects. Both of these groundings use JUnit to execute the concrete tests.
However, this is not the only possibility. A custom grounding was created that translates operation inputs and outputs directly into SOAP requests and responses [28], in the format accepted by the SOAP-UI test engine. This invokes a sequence of SOAP requests on a web service and compares the actual SOAP responses against expected responses. This approach can be used to test any service with a WSDL interface and can test full state and transition behaviour if the service exposes these in the manner described in the design-for-test conditions (Section 1).
Our partners at SAP have also created a custom grounding that outputs tabular HTML instructions for the Selenium test engine [27], which drives the tested service through a rich-client, manipulating and comparing the HTML Document Object Model (DOM) on the client-side. This is useful for styles of service with mixtures of client-side and server-side business logic (typically implemented using JavaScript), or for clients which communicate via AJAX with the server, rather than through a functional web service API. However, this kind of grounding can only be created by the serviceprovider with full knowledge of the client-side DOM.

CASE STUDY AND EVALUATION OF RESULTS
During the course of the BrokerCloud project [4], seven case studies were developed, ranging from simple micro-services to complete service applications, including a shopping cart, a data warehouse and a VAT clearance application. These were deliberately chosen to vary the numbers of states, guarded or trivial transitions and dependency on memory. The specifications are available online, accessible via public URL [25] and so may be used as input to the validation, verification, test 26 of 38 SIMONS AND LEFTICARU generation and test grounding tools. In the succeeding texts, we describe the evolution of one of these examples, the DocumentStore data warehouse, through the various stages of specification, validation, verification and testing. The aim is to give a substantial example of service specification, showing the kinds of issue encountered at different stages during its creation. We follow this with a summary of our experience in applying the proposed approach to the seven case studies.

Requirements for the DocumentStore
Imagine that a platform provider wishes to offer bulk data storage services at different levels of assured quality. Non-functional requirements include availability, reputation and cost, assured by other components of the BrokerCloud framework [6]. Functional requirements include offering three different storage capacities (up to 10, 100 or 1000 terabytes), three possible AES encryption key lengths (128, 192 or 256 bits), full document versioning and login-based authentication. Figure 4(a) shows the state machine for the DocumentStore, as the designer first imagined it (note that this is not the final version). The service has two states: LoggedOut and LoggedIn. In the LoggedOut state, the only action possible is to login. In the LoggedIn state, all actions are possible apart from login. Signed-up users will already have been given accounts corresponding to the levels of service agreed with the provider; this includes the encryption level, which is constant for each user, and reported only at login. Users must first login with their userid and password. Subsequently, they may check in, or check out, versioned documents, up to an agreed storage allocation (the whole document is stored each time rather than delta increments.) The system must ensure that users cannot exceed their agreed allocation. Figure 4(b) sketches the protocol to support the operations shown in Figure 4(a). Requests are shown with their alternative responses, and parameters have the following meanings: terabyte is the allocated storage, version is the version number, encrypt is the encryption level and message is an error message. Figure 5 shows the memory abstraction for the DocumentStore. Constants represent valid and invalid users and passwords, an encryption and storage limit associated with the valid user, and documents of different sizes. Variables represent the total storage used, a document ID counter, a version-list of documents and a repository of document versions. These are suitably initialized. The versioning system is modelled as a map, which stores the version-list of a document against its docid. Documents are represented as uniquely identifiable instances of an otherwise uninterpreted type, Document, whose sizes in terabytes are recorded against their identifiers in a map. One of these deliberately exceeds the storage limit.

Specification for the DocumentStore
The operations require at least one scenario to cover each expected branch in the protocol. Figure 6 shows an excerpt of this protocol, describing the putDocument operation. Initially, the designer expected this to have three responses: fok, blocked, errorg (Figure 4(a)). However, while developing the ok response, the designer realized that different effects should happen when adding  Figure 6 shows how the old ok response has been refined into distinct new and update responses, which behave differently in the way that the docVersions variable is handled. We assume that the designer forgets to revisit the state machine to adjust the transition labels. Unknown to the designer at this stage, there are two further errors buried in Figure 6.

Validation and verification of the DocumentStore
Assuming the designer did not repair inconsistencies in the specification, then validation follows next. The validation tool exercises due diligence, reporting that the LoggedIn state has no transitions for login/ok and login/error and that these are the only transitions available in the LoggedOut state, whereas all other transitions are missing. This is as intended. However, the tool also reports warnings that two transitions: putDocument/new and putDocument/update were present in the protocol but were absent from the machine. The designer fixes the state machine specification by replacing the old putDocument/ok transition (in Figure 4(a)) with these refinements (from Figure 6). Next, the specification is passed through the verification tool, which agrees that all four variables in memory are correctly initialized ( Figure 5), but then discovers faults in the specification of the operation putDocument. This is found initially to be nondeterministic, due to the fact that a pair of symbolic inputs that were generated by the tool and which are shown in Figure 7(a), both trigger multiple scenarios, namely, putDocument/new and putDocument/error. The fault is found in the guard for putDocument/new ( Figure 6) which should have the additional conjunct: docid > 0 to distinguish its acceptance of positive docids from the guard of putDocument/error, which accepts docids less than or equal to zero. Formerly, they overlapped on the value zero. After fixing this fault, the designer passes the specification through the tool again, which now reveals that putDocument is blocking, due to the fact that another pair of generated symbolic inputs shown in Figure 7(b) are not accepted by any scenario. This fault is traced to an off-by-one boundary error in the guard condition for the scenario putDocument/blocked. The upper bound for docid should be raised by one: docid Ä docCounter + 1. The designer fixes this fault. The final version of this specification is available to view online [25].
As an indication of how useful automatic verification can be, the operation putDocument is found to have 27 equivalence partitions, of which one triggers the scenario putDocument/new, one triggers putDocument/update, four trigger putDocument/blocked and the rest trigger putDocument/error. However, the operation getVersion is found to have 81 equivalence partitions, of which four trigger getVersion/ok, fourteen trigger getVersion/absent and the rest trigger getVersion/error. This would be too many to check by hand; or rather, the designer might never think of all such cases.

Test generation for the DocumentStore
Tests are initially generated for the DocumentStore specification with the path length set to 1. While this is sufficient to cover all states and transitions of the associated state machine, it is not adequate to cover all the transitions of the Stream X-Machine, due to the restrictions imposed by guards. The tool reports generating 39 unique sequences of events, of which 23 paths are executable and 16 paths are infeasible. The infeasible paths are sequences that attempt to update or access a document before any document has been deposited. The tool issues warnings that eight transitions are never executed: getDocument/ok, getDocument/absent, putDocument/update, putDocument/blocked , getVersion/ok, getVersion/absent, deleteVersion/ok and deleteVersion/absent, none of which are enabled until at least one document is present.

of 38
Because coverage of the specification is incomplete, tests are generated again with the maximum path length set to 2. The resulting test suite includes the previous test suite, plus longer paths. This time, the only generated warning is for one uncovered transition: getDocument/absent (only testable after a document inserted at a given docID has been deleted); all other transitions are exercised at least once. The tool reports 742 unique theoretical sequences, of which 394 paths are deemed infeasible (blocked by guards), and a further 209 paths are deemed redundant (with trivial prefix cycles). Only 139 paths are retained as usefully discriminating, executable tests. This corresponds to 18.7% of the original test suite.
To ensure that every transition of the specification is covered at least once, tests are generated again with the maximum path length set to 3. This time, there are 14,099 unique theoretical sequences, because the test suite grows exponentially. Of these, 8,102 paths are deemed infeasible, 4,862 paths are deemed redundant, leaving 1,171 paths that are usefully discriminating, executable tests. This corresponds to 8.3% of the original test suite. If the tester selects multi-objective test generation, in order to merge paths which share test objectives, the executable test suite is optimized further to 1,077 paths, which is 7.6% of the original test suite.
The resulting test suite has powerful conformance-testing properties. It tests every scenario from the specification at least once, and for all but one scenario, getDocument/absent covers more than the minimal FSM of the specification. Testing with the maximum path length set to n covers all 1step reachable duplicates of implementation states that were covered in the specification with path length set to n-1.

Evaluation across multiple services
We present summary results for all seven of the service case studies developed during the course of the BrokerCloud project [4], which may be examined online [25]. The Login service is a single-signin micro-service. The Account service is a payment micro-service. ContactList is an address book micro-service. HolidayBooking is a mobile app for booking periods of vacation leave, proposed by partner SAP SE, developed in SAP OpenUI5 and executing on the HANA platform [27]. We developed a ShoppingCart app for online shopping, accessed via a SOAP/WSDL API on an Apache Tomcat web service. DocumentStore is a data repository app, developed by partner SEERC and accessed via a REST/JSON API on a cloud Apache web service. VatClearance is a business app for self-employed clients to calculate their annual VAT returns, proposed by partner SingularLogic SA and accessed via a REST/JSON API on the SingularLogic Orbi cloud platform. Table II shows summary verification characteristics for each of these service specifications. For each service, the columns indicate the number of operations in the service's interface; the total number of scenarios describing distinct paths through these operations; the total number of equivalence partitions generated from guard expressions to validate the scenarios; the number of valid, logically consistent partitions used in verification; the number of invalid, logically inconsistent partitions that were excluded; the maximum number of partitions found for any operation; and the maximum number of valid partitions found for any operation.
From this, it is clear that the DocumentStore example presented the most challenging task for verification, with two operations each having 81 equivalence partitions, leading to 220 equivalence partitions overall. The cases of 81 partitions arose from guards for these operations consisting of four conjoined terms, each of which had three partitions individually. The ContactList example was  also intriguing, because of its high number of detected invalid partitions. These arose from a higher occurrence of variable-sharing across the conjoined terms of a guard (e.g. conjuncts having three terms, pairwise sharing two variables). This service also had a higher ratio of scenarios per operation, to capture memory-contingent branching behaviour.
In other examples, memory-contingency was less important, for example, the ShoppingCart is strongly state-contingent and its scenarios mostly have single-term guards resulting in fewer partitions (almost one per scenario). Table III shows summary test generation characteristics for each of these service specifications. For each service, the columns indicate the number of states in the associated automaton; the total number of transitions (including trivial circular transitions); the transition path-length necessary to cover the Stream X-Machine (starting from each state); the baseline number of tests required to meet this test obligation; the number of culled tests deemed infeasible; the number of culled tests deemed redundant; the remaining number of useful single-objective tests; the optimized number of merged, multi-objective tests; and an efficiency expressed as the ratio of pruned (culled and merged) tests to baseline tests.
From this, it is clear that the specifications offered vastly different testing challenges. They ranged from Login, whose SXM was already covered by the transition cover of the associated automaton, to VatClearance, which needed to exercise longer paths up to length 4 (from each state) to reach memory conditions that would allow certain guards to be triggered. While the size of the baseline test suite increases exponentially in the length of the transition path, there is also greater opportunity to reduce the test suite size by optimization, where pruning ranged from 22.2% in Login to 99.1% in VatClearance. For median coverage paths of length 3, on average around 90% of the theoretically possible baseline tests may be eliminated, leaving behind the 10% most effective tests (effective in the sense: non-redundant, executable and discriminating). Different specifications benefited more or less from different optimizations. Login showed how a path length greater than one is needed before the culling of redundant tests is possible but still benefited from path merging. Services with greater memory-contingency, having a higher branching factor (viz. more scenarios per operation), such as ContactList, HolidayBooking or DocumentStore, offered greater opportunity to cull infeasible tests. The extreme example was ContactList, whose states were entirely characterized by memory conditions and for which infeasible paths were pruned before empty cycles were considered. Services with greater state-contingency, having a larger state space, such as ShoppingCart or VatClearance, offered greater opportunity to cull redundant tests, due to the increased presence of trivial prefix cycles.
In terms of time complexity, the test generation algorithm is exponential in the maximum length of paths explored, the verification algorithm is proportional to the product of the number of scenarios and their input/memory partitions and other algorithms are linear in the size of their inputs. However, in timing experiments conducted on the examples, we found that execution times were dominated by other factors, such as data transmission via HTTP, when invoking the tools as cloud services via URL.
The biggest time penalties were incurred by client web browsers (or GUI tools) which rendered the XML or Java code using syntax highlighting. We established this by invoking the test generation service via a remote internet connection and rendering results on a laptop with an Intel Core i5 1.8GHz processor with 6GB RAM. We compared the timing for the VatClearance and Docu-mentStore examples. Approximately speaking, the former generated 61K tests, optimized to 0.5K tests, while the latter generated only 14K tests, but optimized to 1K tests (twice as many). The former rendered the XML in 12 s, whereas the latter took 25 s, indicating that response times were determined by the size of the rendered XML file rather than by the complexity of test generation. In a second comparison, we remotely piped the results of test generation into test grounding to Java (an additional processing stage) but returned the results without any syntax highlighting. The response times were an order of magnitude faster at respectively 1.5 and 2 s, indicating that local rendering was the dominant cost. Table IV gives a better indication of the raw performance of the (most costly) test generation algorithm. We conducted Java micro-benchmarking experiments on a laptop with an Intel Core i7 1.8GHz processor with 16GB RAM. The test program took time-checkpoints after test generation (with all optimizations), after saving the XML file and after generating the Java grounding for each of the examples. The average timings shown in Table IV were obtained over ten runs for each set of measurements, to ensure that the Java code had been properly loaded and exercised, but not to the point of invoking the Just-in-Time peephole optimizer. The various split timings are as expected for the complexity of each task.

Threats to validity
Results are reported for the seven specifications and the described partner platforms that were made available during the BrokerCloud project [4]. It was not reasonable to create a larger base of test examples in the time, due to the individual nature of partner services and the need to develop a specific specification for each. Otherwise, we identified the following threats to validity [46].
The threat to internal validity concerns the possibility that the BrokerCloud VTTS tools are faulty. To mitigate this, we used code inspections and modular testing during development and then exercised the tools on a wide variety of specifications having different extreme properties. Verification and testing results were compared with expected theoretical results. The source code has been publicly available, along with a user manual and example specifications, allowing others to reproduce our results.
The threat to construct validity concerns the possibility we did not measure properties that are of interest. To counter this, we included execution times and metrics describing the size of the DSXM (numbers of states, transitions, operations and guarded scenarios) relating these to metrics describing the cost of testing (path length to achieve DSXM coverage, baseline number of tests, numbers of culled infeasible and redundant tests, retained single-objective and multi-objective tests), with cost savings expressed as a percentage reduction of the baseline test suite.
The threat to external validity concerns the possible failure to generalize from the results of the seven case studies. This is inevitable because we have no way of sampling from the population of real case studies in a uniform manner. However, we did seek to use real case studies provided by industrial partners, which covered a spread of types of web service, including traditional client/server, modern rich-client services and a mixture of SOAP/WSDL, REST/JSON and HTML/Selenium implementations. The specifications also reflected widely varying designs, some state-contingent and others memory-contingent, showing the related benefits of different optimizations. We are therefore confident that the approach is applicable in a wide domain of applications.  [30,31,[47][48][49]. While some are more general reviews of state-based [47] or model-based [49] testing tools, contrasting their approaches to specification, test generation and overall support [49], others consider service-oriented architectures explicitly [30,31] looking at the adequacy of web-specifications for testing, or the merits of active testing (with explicit test suites) versus passive testing (non-interventionist observation of traces) [48].

Testing cloud and service-oriented systems
The systematic reviews of state-based testing tools [47,49] analyse several tools from different perspectives: what test criteria are supported , what scaffolding criteria are supported (for the generation of test drivers, test stubs and test oracles) and what support is given for related activities (such as model creation, verification, debugging, and test execution). The authors concluded that, while a majority of the surveyed tools can generate abstract tests to satisfy simple criteria (e.g. state and transition coverage), very few support more complex criteria (like round-trip path coverage, or full predicate exploration), and data criteria are seldom supported. Many of the surveyed tools only offer partial support for test scaffolding construction, namely, they create test stubs that the tester must fill out by hand. One exception is Weissleder's tool ParTeG § § § , which creates adapters and oracles for Java, as complete JUnit test cases. None of these surveyed tools were directed towards generating concrete tests for cloud-based services.
The survey by Cavalli et al. [48] acknowledges FSM-based testing as a sound basis for conformance testing in the cloud. The authors review the many variations of Chow's W-method [19] that infer system states from distinguishable input/output sequences but note that unique input/output sequences may not always exist. In passive testing, they survey approaches that use EFSMs as oracles working in parallel, during the normal execution of services, or offline, analysing collected traces post hoc. Passive approaches are less invasive but are unfortunately not complete, because actual service usage may only cover a subset of states and transitions. One such approach is Núñez and Hierons' cloud-centric adaptation of metamorphic testing [50], which infers faults from discrepancies between successive observations across multiple VM instances.
Canfora and Di Penta [30] provide an excellent survey on verification and testing approaches suitable for service-oriented architectures. They cover both functional and non-functional issues, highlighting how some traditional testing assumptions are violated by dynamic service discovery and late binding. This poses problems for performance testing (the context of service consumption cannot be guaranteed) and integration testing (the integration may not be known until late). For functional testing, they highlight the lack of observability as a problem, also the cost of repeated execution of test sequences. They survey a number of approaches for unit testing, using WSDL to inform the selections of test inputs and using BPEL (Business Process Execution Language) to extract models of component service behaviour but conclude that existing web service languages need augmentation with testing facets (to suggest test cases), functional descriptions (to express behaviour in more detail) and dependency analysis (to relate input/output pairs).
Bozkurt et al. [31] cover model-based testing (MBT) for service-oriented systems in more detail. They endorse the findings of Canfora and Di Penta, highlighting the inadequacy of WSDL when used as a model for test data generation. Early service-oriented testing approaches considered only WSDL interfaces [51] or REST protocols [52] as the basis for test generation, which lacked semantic content. They survey approaches that augment WSDL with semantic information, such as OWL-S (Web Service Ontology Language) in order to describe the semantic effects of executing operations but otherwise conclude that while most of the surveyed approaches successfully automate test execution and the generation of test data, few of them can automatically generate adequate test oracles (from the impoverished models). §  Workarounds that provide oracle information include Heckel and Lohmann's [53] augmentation with pre-generated design contracts, whereas Tsai et al. [54] require multiple implementations of the same service in order to detect delta differences (cf. metamorphic testing [50]). Many of the surveyed MBT approaches synthesized the models directly from the services, so were more likely to test the consistency of models generated by translation than test whether the code actually obeyed an independent specification. One exception was Sinha and Paradkar's [32] creation of an extended finite state machine (EFSM) specification from WSDL-S descriptions, which facilitated a number of test generation approaches, including symbolic execution for full predicate coverage and projectioncoverage using user-specified coverage goals.
Other recent work with the same goal of testing cloud services from abstract specifications includes an automated approach that generates useful tests from algebraic specifications [55] but does not especially address test-coverage issues and a high-level method for specifying different kinds of robustness tests for a cloud platform, based on input validation, or state-space exploration, or concurrent access stress-testing [56], which does not address full test automation.

Extended finite-state machine testing
Our own approach follows in the tradition of using EFSM specifications to define the test coverage criteria; but we also seek to generate full concrete test data and oracle assertions automatically. Other authors have modelled service-oriented applications as some kind of finite state machine (FSM) for the sake of test generation but have not always achieved all the three aforementioned goals. We discuss some of these approaches hereafter.
The simple FSM approach by Endo and da Silva Simão [57] abstracts strongly over the service, in that it models the states and transitions of the FSM, but elides any more detailed functional description. We prefer the EFSM, in particular Laycock's Stream-X Machine [14], for its ability to model complex memory datatypes and realistic functional processes. While Wu and Huang recognize the superiority of EFSMs to handle memory [33], they only generate symbolic paths through the automaton, rather than concrete test cases with real inputs. Bertolino et al. [34] use the UML protocol state machine diagram and convert this into a Symbolic Transition System (an extension of Labelled Transition Systems with guards on symbolic memory), in order to generate symbolic test paths. Dranidis et al. specify systems directly as Stream X-Machines [40], whose transitions are modelled as functions operating on inputs, outputs and memory. Their tool JSXM generates fully executable JUnit tests for Java classes (POJOs); but to achieve this, the specification must include snippets of executable Java code for the functions, which are then pasted into the generated tests, whose coverage is motivated by the SXM specification.
The starting point of Ma et al. [58] is slightly different, in that their goal is to find suitable inputs to test services specified using BPEL. They describe a detailed manual transformation to convert BPEL descriptions into an SXM, whose desirable properties for test generation they acknowledge [14], but this is as far as the work goes. In a similar vein, Sun et al. [59] seek to attach EFSMs to WSDL descriptions, in order to use the power of EFSMs to motivate test generation with suitable coverage. They describe a procedure for partially converting WSDL descriptions but find unsurprisingly that the relevant semantic information has to be added by hand.
Sinha and Paradkar [32] first suggested augmenting WSDL-S interfaces with functional semantics expressed in the IOPE style (input, output, precondition and effect). They chose SWRL (Semantic Web Rule Language) to express the IOPE conditions and described a procedure to generate a testable EFSM. However, they were only able to derive an EFSM with one state and many guarded transitions. Ramollari et al. [26] were more successful in synthesizing a multi-state Stream X-Machine from semantically annotated WSDL (SAWSDL). They chose RIF-PRD (Rule Interchange Format -Production Rule Dialect) to describe the IOPE semantics of the system's functions. Their algorithm generated all the high-level states of an SXM automatically by observing the domain partitioning effect of operation guards on state variables. This approach came the closest to extrapolating the complete SXM automatically (they lacked a reasoning component to relate the prior and posterior states of memory).

of 38 SIMONS AND LEFTICARU
Endo et al. [60,61] adopt a related approach to test coverage using event sequence graphs (ESG). Their tool ESG4WSC (ESG for web service compositions) successfully generates positive and negative test cases [60], and they report results from industrial web service testing [61]. Endo et al. plan to investigate how to evolve ESG4WSC models in the direction of state machines, so that they can take states into account as well as events, and also handle more complex behaviour. Bertolino et al. [34] continue to use Symbolic Transition Systems in their PLASTIC framework for testing service-oriented applications [62], which share many similarities with the state machine model.
Others have used UML as the basis for state-based testing methods but do not address serviceoriented issues. Holt et al. [63] use the model-based testing tool TRUST and evaluate six state-based coverage criteria in an industrial case study. Khalil and Labiche [64] generate test trees from UML state machine diagrams, extract test cases and compose test suites. Hasanan and Labiche [65] test real-time embedded systems, using random generation from RTEdge models combined with SPIN model-checking to discover test cases missed by random generation.

Improvements over similar approaches
Our approach is closest to Dranidis et al. [40], in that we also write our specifications directly as Stream X-Machines and generate concrete executable tests. Their JSXM tool (Java Stream X-Machine) automates DSXM testing for Java classes. A JSXM specification is created in a mixture of technology-neutral XML describing the state machine, and executable Java describing the functions, whereas we follow a fully abstract approach to specifying functions. Like Sinha and Paradkar [32] and Ramollari et al. [26], we consider IOPE semantics to be the right level of abstraction at which to express functional behaviour, because it is minimal and elegant. This contrasts, for example, with Tsai et al. [36], who extended WSDL with paired input-output dependencies, invocation sequences, hierarchical function descriptions and sequence specifications, which in our view detracts from the desirable abstraction of a specification.
Another difference between our approach and Dranidis et al. [40] is that we are able to verify that the specification is non-blocking and deterministic, before it is submitted for test generation; that is, we check that the specification satisfies the assumptions of the testing method. Our verification tool performs symbolic reasoning upon the IOPE specification, for which having explicit models of the functions is essential. (Dranidis et. al. are working on a separate reasoning component for JSXMpers. comm.) Furthermore, our test generator is able to reason fully about the whole DSXM specification, which solves many of the issues that arise with infeasible sequences, when these are generated from the automaton alone. JSXM requires the designer to be more careful in choosing a realizable state cover and feasible separating sets to avoid blocking issues when confirming the identity of reached states [40]. Our test generator simply alerts the designer if the blocking effects of guards cause loss of model coverage. We finesse the issue of state verification through a reliable oracle. Generally, our approach provides more support and requires less expertise to use.
Our DSXM test generator probably has one of the most ambitious test-pruning algorithms. Other SXM-based testing approaches [15][16][17] avoid discussion of how to treat infeasible sequences, assuming that the test generators must produce them but that they block during execution. We are able to detect infeasible sequences early by simulating the whole DSXM (including memory states and guards). Our pruning of idempotent paths containing trivial prefix cycles is new, based on an assumption that is discharged through testing. Our compression of test sequences by merging into multi-objective sequences is only possible because all state, function and output assertions are side-effect free. We do not force the DSXM through further state-distinguishing sequences, which produce divergent test paths offering far fewer opportunities to merge sequences.
Recent developments in Stream X-Machines include Ipate's combined method for testing component machines and their system-level integration in parallel [41]; the TXStates domain-specific language for multi-agent systems [66], which supports specifying agents as Stream X-Machines in NetLogo; and a timed extension to the SXM formalism [67], for which we are not yet aware of any testing method.

CONCLUSIONS AND FUTURE WORK
The first innovation of this work is the creation of a Deterministic Stream X-Machine (DSXM) specification model that exposes not only the abstract state-transition behaviour of the automaton but also the concrete input, output, precondition and effect (IOPE) behaviour of operations acting upon memory. Our tools reason explicitly about the blocking effects of guards on memory as well as blocking due to missing transitions in the automaton. The specification model also meets our goals for a web-transmissible format that may be used by different verification and testing tools at distributed locations in the cloud.
The second theoretical innovation is that simulation of the whole DSXM supports generation of complete test suites that are known to be feasible by construction. This overcomes the earlier problems found with test suites generated from the automaton alone. It means that the designer no longer has to worry about manually selecting a realizable state cover or separating sets to identify states and reduces the need for special inputs to ensure controllability. The tool will simply report if states or transitions are not eventually covered. This is better than approaches that determine test coverage from the automaton alone and later filter out infeasible tests, without any guarantee that coverage is still complete.
A third theoretical innovation is the verification algorithm for determinism and completeness, which converts a potentially large partition-finding problem in the state space of input and memory variables into a compact, finite symbolic checking problem, using conjunction terms to represent input and memory spaces and symbolic subsumption in lieu of concrete execution. Furthermore, we believe that this is the first alliance of such a verification tool with DSXM test generation, which verifies the testing method's assumption of a non-blocking and deterministic specification. This tool also meets our goals of aiding engineers to write correct specifications.
The fourth set of theoretical and engineering innovations are the three optimizations in test generation, which mitigate the problem of exponential growth in the size of test suites as tested paths grow longer. In particular, the prefix complete and path complete properties of the test generator support pruning of infeasible sequences and sequences with trivial prefix cycles. The verification method ensures that a witness remains for paths ending with blocking guards. The assumed idempotence of sequences with and without trivial prefix cycles is discharged by testing. These optimizations attack different kinds of test growth in systems with more guards or more states. The merging of test sequences with shared test objectives is only made possible by eliminating the mutating nature of W to determine states.
A fifth innovation lies in the novel solution to the hypothetical test function that maps sequences onto test inputs. The test input constraint, supplied as part of the specification, guides the constraint solver to provide test inputs that, under suitable memory conditions, will cause a transition eventually to fire. Abstract test inputs are then converted by test grounding into any desired execution format.
The different tools were evaluated in practice by members of the BrokerCloud consortium. During the development of the case study examples reported in Section 6, we collected informal feedback from industry partners SAP SE (Karlsruhe) and SingularLogic SA (Athens). Developers were surprised by their tendency to write incomplete specifications, either missing scenarios or making mistakes in the guards for the default "otherwise" scenario. In this respect, the verification tool proved useful in helping them to think more rigorously about equivalence partitions of input and memory. Sometimes, deciding how to represent the problem-state abstractly using Set, List and Map types was a challenge. Sometimes, choosing the best operation decomposition into scenarios was difficult: for example, the ContactList service needed four scenarios for removing an entry, to capture all possible selection states in the GUI after a deletion.
The same respondents agreed that test generation produced comprehensive test suites. Automatically generated tests detected subtle faults that their in-house QA procedures had not found. The most common kinds of extra fault detected were incorrect state transfer, leaving the application in the wrong state [27], and wrongly exposing internal variable information after a trivial cycle transition, which was a security vulnerability. In general, the exhaustive testing capability of the tool was far superior to in-house testing approaches that were based on walking through a finite number 36 of 38 SIMONS AND LEFTICARU of end-user scenarios. For the sake of test generation, developers found it intuitive to specify test input constraints as part of the specification. For services with greater memory-contingency, it was important to create enough data-entry scenarios that would load memory in all the distinct ways that would allow guards to be triggered. When creating a bespoke grounding to Selenium [27], the abstract test suite provided all necessary information to combine with an external DOM to generate the Selenium test script.
Future work arising from this project includes the further development of user-oriented tools for creating and editing specifications; the wider public offering of Testing as a Service to help increase the quality of brokered software services in the cloud; and also research into new areas enabled by having a completely modelled cloud service specification. One attractive future research area is to investigate the testability of service compositions. The current flat state model could be extended to permit decomposed states containing sub-state machines and composed models could be mapped by model transformation to equivalent flat models using UML or Statechart semantics. In this way, a single sign-in process for a given cloud platform could be wrapped around an arbitrary provided service and the additional test obligations determined automatically from the model composition. This could eventually support an incremental test generation approach for composed software services in the cloud.