Polymorphic bytecode instrumentation

Bytecode instrumentation is a widely used technique to implement aspect weaving and dynamic analyses in virtual machines such as the Java virtual machine. Aspect weavers and other instrumentations are usually developed independently and combining them often requires significant engineering effort, if at all possible. In this article, we present polymorphic bytecode instrumentation(PBI), a simple but effective technique that allows dynamic dispatch amongst several, possibly independent instrumentations. PBI enables complete bytecode coverage, that is, any method with a bytecode representation can be instrumented. We illustrate further benefits of PBI with three case studies. First, we describe how PBI can be used to implement a comprehensive profiler of inter‐procedural and intra‐procedural control flow. Second, we provide an implementation of execution levels for AspectJ, which avoids infinite regression and unwanted interference between aspects. Third, we present a framework for adaptive dynamic analysis, where the analysis to be performed can be changed at runtime by the user. We assess the overhead introduced by PBI and provide thorough performance evaluations of PBI in all three case studies. We show that pure Java profilers like JP2 can, thanks to PBI, produce accurate execution profiles by covering all code, including the core Java libraries. We then demonstrate that PBI‐based execution levels are much faster than control flow pointcuts to avoid interference between aspects and that their efficient integration in a practical aspect language is possible. Finally, we report that PBI enables adaptive dynamic analysis tools that are more reactive to user inputs than existing tools that rely on dynamic aspect‐oriented programming with runtime weaving. These experiments position PBI as a widely applicable and practical approach for combining bytecode instrumentations. © 2015 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd.


INTRODUCTION
Virtual machines for safe languages, such as the Java virtual machine (JVM) or .NET, execute platform-independent code -bytecode in the case of the JVM and CLI code in the case of .NET. Many recent programming languages are compiled to virtual machines. For example, Java, Scala, or JRuby programs are compiled to JVM bytecode, and C# programs are compiled to CLI code. Furthermore, there are compilers for recent languages for the partitioned global address space programming model, such as X10 [1] or Fortress [2], which target the JVM.
Instrumentation and manipulation of platform-independent code -subsequently called bytecode instrumentation -are key techniques for the implementation of various tools and frameworks. For example, many dynamic program analysis tools, such as profilers and data race detectors, rely on bytecode instrumentation. Many aspect-oriented programming (AOP) languages, like AspectJ [3], are implemented using bytecode instrumentation [4]. Because bytecode instrumentation has become so central for tool and framework development, modern virtual machines offer dedicated support. For instance, the JVM supports bytecode instrumentation with the JVM tool interface [5] and with the API in the package java.lang.instrument. In addition, there are numerous instrumentation libraries for Java bytecode, such as BCEL [6], ASM [7], or Javassist [8], as well as for other languages [9].
Typically, tools relying on bytecode instrumentation are separately developed. Composing several bytecode instrumentations is usually not foreseen and difficult to achieve. However, flexible composition of multiple bytecode instrumentations can enable many interesting applications. For instance, a memory leak detector can analyze a profiler at work. Even if implemented by the same instrumentation tool, interesting compositions like self-application (e.g., a race detector analyzing itself) or adaptive dynamic analysis are often out of reach. An adaptive dynamic analysis tool allows the user to select between different analyses for different parts of a program at runtime, thereby avoiding excessive overhead resulting from applying all analyses at the same time for the overall program. JFluid [10] is a good example of an adaptive profiler.
In this article, we present polymorphic bytecode instrumentation (PBI), a novel technique that allows several, possibly independent bytecode instrumentations to coexist and to select dynamically which instrumentation takes effect. First, different bytecode instrumentations are applied in isolation to a program class. Afterward, a PBI framework merges the resulting instrumented classes into a single class that holds the code for all applied bytecode instrumentations. For each method, PBI introduces a dispatcher in order to select the desired version of the code at runtime. Because the dispatch logic is customizable, PBI is applicable in a wide range of scenarios.
In addition, PBI also enables bytecode instrumentation of shared libraries, that is, of libraries that are used by the base program as well as by code inserted through instrumentations. A good example of a shared library is the core class library of the considered language, such as the Java class library: in Java, almost any base program invokes methods in the core class library, and many bytecode instrumentations insert code that needs to call some methods in that library. If inserted code invokes already instrumented methods, infinite regression can easily happen. By preventing infinite regression, PBI enables instrumentations with complete bytecode coverage; that is, any method that has a bytecode representation is amenable to bytecode instrumentation, including methods in the core class library. As a special case, aspect weavers implemented with PBI are capable of weaving aspects with complete coverage; this is in contrast with mainstream weavers, such as the standard AspectJ weaver and abc [11].
Polymorphic bytecode instrumentation is a general technique that is applicable to any intermediate language. In this article, we focus on CodeMerger, our PBI framework for Java bytecode. The main contribution of this article is PBI, a general and widely applicable technique that eases the development of instrumentation-based tools, such as software engineering tools for various dynamic program analysis tasks (e.g., profiling, debugging, testing, program comprehension, and reverse engineering) and for implementing advanced AOP frameworks. More concretely, the original, scientific contributions of this article are as follows: 1. We present PBI, a simple and effective technique to dynamically dispatch amongst multiple bytecode instrumentations (Section 2). 2. As a first application, we show how PBI enables instrumentation with complete bytecode coverage without disrupting the virtual machine bootstrapping phase (Section 3). 3. We explain the implementation of CodeMerger (Section 4). 4. We describe three consequent case studies of PBI. First, we use PBI to implement JP2, a comprehensive profiler of inter-procedural and intra-procedural control flow (Section 5). This case study focuses on achieving instrumentation with complete bytecode coverage with the aid of PBI. Second, we use PBI to support execution levels [12] in AspectJ (Section 6), thereby enabling black-box composition of dynamic analysis aspects in multiple ways. Third, we leverage PBI to implement adaptive dynamic analysis tools, where the dynamic analysis to be performed can be changed at runtime for different parts of the base program (Section 7). 5. We thoroughly evaluate our PBI implementation for Java (Section 8). First, we explore the overhead introduced by PBI dispatch logic and code bloat. Afterwards, we evaluate PBI in the three case studies. We report on the performance of JP2 and its ability to cover execution of the core Java libraries. Our evaluation then shows that PBI-based execution levels are much more efficient than equivalent control flow pointcuts to avoid interference between aspects and are generally as efficient as the standard AspectJ weaver when applying analysis aspects on the DaCapo benchmark suite. Finally, we demonstrate that PBI enables adaptive dynamic analysis tools that react more quickly to user inputs than existing tools that rely on dynamic AOP with runtime weaving.
Section 9 discusses prior, ongoing, and related work. Section 10 concludes. This article extends and refines the work initially presented in [13]. The new contents of this article covers (1) improved deployment options for PBI and an extended discussion of technical details (Sections 4.1 and 4.2); (2) two additional alternative implementations of PBI that solve some serious performance issues (Section 4.4); (3) an additional case study in the profiling domain (Section 5); (4) evaluation results for the new PBI implementations and for the new case study (Sections 8.2 and 8.3); and (5) a new evaluation of PBI-based execution levels, taking both start-up and steady-state performance into account (Section 8.4).

POLYMORPHIC BYTECODE INSTRUMENTATION
Many tools make use of bytecode instrumentation to achieve different goals. PBI is a general technique to allow these instrumentations to coexist and to select dynamically which instrumentation takes effect. The name polymorphic stems from the parallel with polymorphic method calls, where the actual code to be executed is chosen dynamically according to some dispatch mechanism. However, as opposed to typical polymorphic calls, the dispatch logic in PBI is not fixed but customizable.
Here, a bytecode instrumentation is considered purely augmentative, meaning it may insert fields and methods, ‡ as well as modify method bodies, but it may not remove any field or method. PBI is applicable to a wide range of bytecode instrumentations, § which may be implemented with any instrumentation framework, not necessarily the same. A PBI framework is in charge of integrating these instrumentations, as explained later.

Polymorphic bytecode instrumentation overview
The PBI enables dynamic dispatch between differently instrumented versions of a method at the granularity of individual method executions. The version of a method to be executed is decided upon method entry. After this selection, it is not possible to switch between differently instrumented parts of a method (e.g., it is not possible to execute a differently instrumented loop body in each loop iteration).
Let us consider N > 1 bytecode instrumentations that are applied to a base program class C orig . Each instrumentation produces an instrumented class, denoted C i instr (1 6 i 6 N ). These instrumented classes, as well as C orig , are called class versions. A PBI framework takes these class versions and merges them into a single class denoted as C merged (Figure 1). There are N C 1 class versions considered for merging, where (typically) class version 0 corresponds to C orig and class version i corresponds to bytecode instrumentation i (1 6 i 6 N ).
At any single point in time, for a given computation step, only one code version is active. Polymorphism comes from dynamic dispatch between these versions. Notably, the actual dispatch logic is not fixed by the PBI framework. Rather, it is provided as input (as a computeCV() function) in addition to the code versions ( Figure 1). The PBI framework uses this dispatch function to insert code that selects the specific version to execute at runtime. ‡ In this article 'method' stands for 'method or constructor'. § As we will explain in Section 4, PBI imposes some restrictions on bytecode instrumentations. Furthermore, PBI offers a special mechanism for initializing inserted static fields, which is not transparent to bytecode instrumentations. Each method in C merged has a switch to select the code version to execute; the dispatch logic is defined in the function computeCV().
The merged class C merged generated by the PBI framework has all fields and methods that exist in at least one class version. For methods that have the same signature in different class versions, the corresponding method bodies are merged. We refer to the body of a method defined in class version i as code version i of that method. The merged method body starts with the dispatch logic, whose purpose is to jump to the code version to be executed. A PBI framework is free to decide how this jump is realized and where the different code versions effectively reside.

Dynamic dispatch
Support for dynamic dispatch between code versions is at the heart of PBI. It offers the necessary flexibility to use PBI in a wide range of scenarios, such as complete bytecode coverage (Sections 3 and 5), execution levels for AOP (Section 6), and adaptive dynamic analysis tools (Section 7).
The case studies developed here highlight two different kinds of dispatch logic, based either on (global) state or on control flow. More precisely, both kinds of dispatch logic are typically composed. In the former case, dispatch depends on some value that is accessible to all threads, and so, changes between code versions are global. ¶ This is used, for instance, for adaptive dynamic analysis, where the user globally selects which variant of the analysis to apply. In the latter kind of dispatching, thread-local state is used, thereby allowing different threads to concurrently execute different code versions. This is needed for execution levels, among others.
Also, our case studies show that dispatch logic is typically common to all classes in a program, although some optimizations are applicable to reduce the complexity of dispatch for certain classes [15]. Another case study, described elsewhere [13], explores support for dynamic mixin layers using PBI, illustrating a good scenario for class-specific dispatch.

Polymorphic bytecode instrumentation for Java: CodeMerger
Our implementation of PBI for Java bytecode is called CodeMerger. The most recent version of CodeMerger uses ASM [7]. The developer of an instrumentation using PBI has to provide each input class version as a pair consisting of the Java class file (represented as a byte array) and the desired version number. The original class C orig is specially marked, allowing CodeMerger to verify whether certain constraints that will be explained later are met. Each input class must have a unique integer version number; the numbering need not necessarily be continuous. While in many cases, it is convenient to assign C orig version number zero, there are also some scenarios where it is convenient to assign a different version number to C orig ; an example will be given in Section 6. The function computeCV() holding the custom dispatch logic must be provided as a static method in a separate class file. The merged output class is a Java class file. ¶ In the case of Java, the typical approach is to use fields that are public, static, and volatile, such that their states can be altered asynchronously and the new states become visible to all threads (according to the happens-before relation for volatile writes and reads guaranteed by the Java memory model [14]).

1355
Listing 1 illustrates the generated code pattern for a merged method body. Here, we assume that C orig is assigned version number v 0 and C i i nst r is version number v i . CodeMerger extracts the body of computeCV() and inlines it in the beginning of each merged method. The resulting code version is obtained as an integer, and then, a switch statement dispatches to the appropriate code version. All code versions of a method are simply concatenated in the merged body. Jumping to an unknown code version raises an error at runtime.
If a method exists in more than one class version, PBI requires that its modifiers (i.e., abstract, final, native, static, synchronized, public, protected, private, and strictfp) are the same in each class version. When using PBI with independently developed bytecode instrumentations, it is important to ensure that there is no undesired merging of methods with the same signature. Typically, methods inserted by different bytecode instrumentations need to be renamed (by the developer who implements the PBI-based switching logic) before merging so as to avoid name clashes.
Inserted fields must have different names in each class version, so it may be necessary to rename them to avoid name clashes. Consequently, only the fields in the original class C orig exist in all class versions and are preserved (without any replication) in the merged class C merged . More details about CodeMerger, such as field initialization, are described in Section 4.

COMPLETE BYTECODE COVERAGE
Many applications of bytecode instrumentation require complete bytecode coverage in order to function properly. For instance, a profiler needs to be able to track computation occurring in the core language libraries as well as in application code. Binder et al. proposed a solution to this issue, albeit in an ad hoc manner [16]. The general technique of PBI subsumes this previous approach.
Instrumentation with complete bytecode coverage implies that every method that has a bytecode representation (i.e., every non-abstract and non-native Java method) must be amenable to bytecode instrumentation, including methods in the Java class library and in dynamically downloaded or generated classes. Full coverage of the Java class library is delicate because of two issues: 1. The instrumentation must not break JVM bootstrapping, for instance, by triggering premature initialization of classes used by inserted code. 2. Code inserted by the instrumentation must not cause infinite regression when invoking methods in the (instrumented) Java class library.
By allowing us to keep both the instrumented method bodies (class versions 1) and the original unmodified method bodies (class versions 0) of the Java class library and dispatching amongst them dynamically, PBI solves both of these issues, as explained hereafter.

Bootstrap with an instrumented Java class library
Many current JVMs are very sensitive to the order in which some core classes in the Java class library (e.g., java.lang.Object, java.lang.String, or java.lang.Thread) are initialized. In such classes, code inserted by a bytecode instrumentation may change the class initialization order when bootstrapping, typically causing a JVM crash. Because the JVM specification mandates lazy class initialization (JVM Specification, Second Edition, Section 5.5 [17]), inserted code that is not executed during bootstrapping does not change the class initialization order. Hence, we can use PBI in order to execute only the original code version of invoked methods as long as the JVM is bootstrapping. Dispatch is therefore based on a global state that indicates whether the JVM has completed bootstrapping. Class BootstrapState (Listing 2) maintains that global state in a static volatile flag. The flag is toggled (by an invocation of signalEndOfBootstrap()) before the base program main class is initialized. This can be achieved by calling signalEndOfBootstrap() in the premain(...) method of a Java agent (package java.lang.instrument). Because the flag is volatile, all threads are guaranteed to see the new state of the flag, thanks to the semantics of volatile field access specified by the Java memory model [14]. This state-based dispatch can be simply defined as follows: That is, the instrumented code version is only used after the JVM has completed bootstrapping. Note that our approach will cause initialization of class BootstrapState during bootstrapping. However, that class has no static initializer, and reading the Boolean flag during bootstrapping does not trigger any other class initialization. Our approach has been thoroughly tested on many versions of Oracle's HotSpot virtual machines (VM) and IBM's J9 VM.
While in prior work [16] the access to the volatile flag upon each invocation of a method in the Java class library introduced significant extra overhead, some recent state-of-the-art JVMs, such as on Oracle's HotSpot Server VM, enable us to completely eliminate that overhead. If the JVM supports class redefinition (i.e., dynamic class redefinition with the aid of code hotswapping), method signalEndOfBootstrap() can replace class BootstrapState with a version where bootstrapCompleted() returns the constant true. Thanks to just-in-time compiler optimizations, the overhead due to the check can be completely eliminated.

Preventing infinite regression
If code inserted by an instrumentation invokes some instrumented methods in the Java class library, infinite regression can happen, because the invoked methods would also execute some inserted code. In order to prevent infinite regression, we can keep track of whether a thread is executing code in the control flow (i.e., in the dynamic extent) of inserted code, and if so, dispatch to the non-instrumented version of the code. To this end, we need to maintain Boolean control flow information for each thread.
Class ControlFlow (Listing 3) provides access to a Boolean, thread-local flag indicating whether the execution is in the dynamic extent of inserted code. We directly insert that flag in class java.lang.Thread as the public, Boolean instance field pbi_cflow. The control flow-based dispatch is as follows: Whenever inserted code may invoke instrumented methods, such as methods in the Java class library, it must first set the thread-local control-flow flag to true, and upon completion of the inserted code, it must restore the previous value. That is, in general, the developer of an instrumentation has to use the following code pattern within inserted code that may invoke instrumented methods: In order to ensure that a bytecode instrumentation properly implements the aforementioned code pattern, the instrumentation may either be manually adapted, or some automated tool may be applied to detect inserted code and to enclose it with the operations that update the control flow information. For example, our aspect weavers MAJOR [18] and HotWave [19] generate the code pattern on code previously woven with the standard AspectJ weaver in a fully automated way.

Composite dispatch
Each of the two issues of complete bytecode coverage, namely, JVM bootstrapping and infinite regression, requires a specific dispatch (respectively state based and flow based). In order to support complete bytecode coverage properly, both dispatch logics must be composed, as follows: In Section 5, we will present the details of our profiler JP2 that employs PBI with the composite dispatch logic presented here, in order to achieve complete bytecode coverage.

CODEMERGER IMPLEMENTATION DETAILS
In this section, we describe our implementation of PBI for Java, CodeMerger. First, we explain the overall process of applying PBI with complete bytecode coverage in Section 4.1. Next, we discuss how CodeMerger handles the initialization of fields inserted by instrumentations; Section 4.2 addresses static fields, whereas Section 4.3 deals with instance fields. Finally, Section 4.4 considers the issue of overlong methods resulting from merging multiple code versions.

Build-time and load-time instrumentation
CodeMerger can be used for build-time, load-time, and runtime instrumentation. Build-time instrumentation takes place before the instrumented application is started. Load-time instrumentation intercepts class loading events and performs the instrumentation before a class is linked in the JVM. Runtime instrumentation takes place when an application is already running by redefining some previously linked classes. However, class redefinition is severely restricted in some state-of-the-art production JVMs, such as in Oracle's HotSpot VMs. For example, class redefinition may only replace method bodies but must not introduce any new methods or fields. Load-time and runtime instrumentation are supported by the JVM tool interface [5] and by the java.lang.instrument API.
The profiling case study presented in Section 5 uses CodeMerger at load time and at runtime. After the JVM has completed bootstrapping, the classes loaded during the bootstrapping phase are redefined with instrumented versions. Afterwards, all other classes are instrumented at load time.
In the other case studies (Sections 6 and 7), we use CodeMerger at build time and at load time. First, the whole Java class library is instrumented at build time. All other classes are instrumented at load time.

Initialization of static fields
According to the code pattern illustrated in Listing 1, exactly one code version is executed upon each invocation of a merged method. If an instrumentation inserts fields and initializes them to a value different from the default value of the corresponding type, the code pattern in Listing 1 would result in skipping the initialization of some inserted fields depending on the executed code version. On the one hand, skipping initialization of inserted fields can break invariants. On the other hand, requiring instrumentations to leave all inserted fields initialized to their default values would be too restrictive, because many existing bytecode instrumentations initialize inserted fields, in particular static fields. For example, the standard AspectJ weaver inserts static fields and initializes them to hold instances of type JoinPoint.StaticPart, holding reflective information of join points [4].
CodeMerger supports the initialization of inserted static fields with the special private static void method pbi_initClass(). If a class version needs to initialize inserted static fields, it must do so in its pbi_initClass() method, which in turn must be invoked at the end of its static initializer; the pbi_initClass() method must not be invoked from any other call site. Upon merging, the pbi_initClass() methods and the static initializers in the class versions are treated specially by CodeMerger. First, the pbi_initClass() methods are renamed by appending the class version number to the method name. In this way, the bodies of the pbi_initClass() methods will not be merged. Second, in each class version, the static initializer is extended to invoke the pbi_initClass() methods of all class versions in the end (if there is no static initializer in a class version, it is created). Consequently, after merging of the static initializers, the pbi_initClass() methods of all class versions will be executed, independently of the executed code version of the merged static initializer. That is, all inserted static fields will be properly initialized.
Java requires static final fields to be initialized in the static initializer; it is not allowed to initialize them in another method that is invoked by the static initializer. Hence, it is not possible to initialize static final fields in pbi_initClass() methods. Consequently, static fields inserted by an instrumentation must not be declared as final.
In Section 3, we pointed out that during JVM bootstrapping, inserted code -and therefore, also the pbi_initClass() method -must not be executed. CodeMerger solves this issue by treating inserted static fields and pbi_initClass() methods in the Java class library specially. For each instrumented class C i i nst r , the inserted static fields are moved into an extra class in the same package (private visibility is replaced with package visibility), the pbi_initClass() method becomes the extra class' static initializer, the invocation of pbi_initClass() in the static initializer of C i instr is removed, and access to the inserted static fields by inserted code in methods in C i instr is redirected to the static fields in the extra class. Consequently, during JVM bootstrapping, inserted code is not executed and the static fields in the extra classes are therefore not accessed. Because the JVM initializes classes lazily [17], it is guaranteed that the extra classes will not be initialized during JVM bootstrapping. Note that the introduction of extra classes is trivial for build-time instrumentation of the Java class library (the extra classes are simply added to the archive of the instrumented Java class library), whereas in general, it may not be possible to introduce extra classes at load time or runtime, because custom class loaders may not be able to find or may refuse to load the extra classes. However, for load-time and runtime instrumentation after the JVM has completed bootstrapping, CodeMerger does not introduce any extra classes.
Note that PBI is not transparent for instrumentations that insert static fields. In order to use CodeMerger's pbi_initClass() feature, existing bytecode instrumentations need to be refactored so as to initialize inserted static fields in the (inserted) pbi_initClass() method. In addition, final modifiers on inserted static fields must be removed. As an alternative, postinstrumentation transformations can be performed: for instance, in the case of the AspectJ weaver (Section 6), we apply post-weaving bytecode transformations to move the initialization code for inserted static fields of type JoinPoint.StaticPart from the woven static initializer into the pbi_initClass() method.

Initialization of instance fields
CodeMerger does not support initialization of inserted instance fields in the Java class library, as it would be impossible to guarantee that such fields are initialized during JVM bootstrapping. An inserted instance field must be initialized to the default value of the corresponding type. When CodeMerger is applied to AspectJ, this restriction implies that AspectJ's static crosscutting features cannot be fully supported. An inserted instance field can be lazily initialized by inserted code accessing the field, although this incurs extra overhead because of the necessary checks of whether the field has been initialized.

Dealing with long method bodies
The JVM specification [17] imposes several restrictions on class files, which can impair any application of Java bytecode instrumentation. For instance, method bodies must not exceed 2 16 bytes (because indices in exception tables, line number tables, and local variable tables are unsigned 16 bit values). While such limitations affect any bytecode instrumentation tool, the merging of code version into a single method body aggravates the problem. This issue can be mitigated by placing code versions in separate private methods when the method size limit is exceeded.
CodeMerger can operate in three different modi. In the first modus, which is the default modus, CodeMerger places multiple code versions into a single method body and does not introduce any new method. In the second modus, CodeMerger puts each code version in a separate private method. In the third modus, which we will call adaptive modus in this article, CodeMerger places multiple code versions into a single method body only as long as a specified maximum method size is not exceeded. If the merged method body exceeds that limit, CodeMerger puts the code versions into separate private methods. Note that CodeMerger processes static initializers always in the default modus, as static final fields must be initialized within the body of the static initializer (and cannot be initialized in a method invoked by the static initializer).
The default modus has the advantage that it does not introduce any structural modifications of class files; that is, the effects of merging multiple code versions into a single method body are not visible through the reflection API and there are no extra stack frames upon method execution. However, the resulting method bodies can be long, which may have some negative performance impact, for example, if the just-in-time compiler bases decisions on method inlining on the method size (i.e., preventing inlining of long methods). The other two modi produce artifacts (i.e., extra methods) that are visible through the reflection API. However, they help avoid creating methods with very long bodies. In Section 8.2, we will carefully explore the overhead introduced by CodeMerger in each modus.

CASE STUDY 1: COMPREHENSIVE PROFILING OF INTER-PROCEDURAL AND INTRA-PROCEDURAL CONTROL FLOW
In this section, we present a profiler, JP2, that relies on PBI to achieve complete bytecode coverage, as explained in Section 3. JP2 profiles both the inter-procedural and the intra-procedural control flow of applications running in any standard JVM.
To capture the inter-procedural control flow of the profiled base program, JP2 instruments each method so as to reify the current calling context for each thread. JP2 maintains a calling context tree (CCT) [20] as a thread-safe data structure shared between all threads in the JVM. Within the thread-local variable currentCCTNode, each thread keeps track of its current position in the CCT. Upon method entry, the inserted instrumentation code accesses currentCCTNode (which at that moment refers to the CCT node of the caller) and stores the reference in the local variable callerCCTNode. Then it looks up the CCT node representing the callee (creating that node if it does not yet exist) and stores the reference in the local variable calleeCCTNode as well as in the thread-local variable currentCCTNode. Upon (normal and abnormal) method completion, the reference stored in callerCCTNode is stored back to currentCCTNode.
The profiles generated by JP2 preserve callsite information; that is, if method m is invoked from different callsites within the same calling context, the executions of m are represented by separate CCT nodes, one for each callsite. Callsite awareness is achieved by storing the bytecode position of a callsite in a dedicated thread-local variable before the call, such that the callee can access and use that information when looking up (respectively creating) its CCT node. Note that it is not sufficient to store the bytecode position only before method invocation bytecodes, but it must be stored also before any bytecode that might trigger class loading or class initialization, as these activities can result in implicit invocations of class-loader methods respectively of static initializers.
JP2 profiles the intra-procedural control flow by incrementing a counter in the beginning of each basic block of code. That is, each CCT node maintains an array of counters, one for each basic block in the body of the method represented by the CCT node. JP2 uses CodeMerger both at load-time and at runtime. JP2 employs an instrumentation agent written in pure Java that is initialized after JVM bootstrapping, before the first class of the base program to be profiled is loaded. The JP2 agent determines the set of loaded classes and redefines them, replacing them with instrumented versions (i.e., using PBI at runtime). Because instrumentation happens in the same JVM process that runs the instrumented base program, the instrumentation may trigger class loading; these classes need to be instrumented as well. Therefore, the JP2 agent repeatedly determines the set of freshly loaded classes and redefines them, until no further classes are loaded. Then, the agent installs itself to intercept subsequent class loading events. That is, all classes loaded by the execution of the base program will be instrumented at load time.
The instrumentation of JP2 has been carefully designed to avoid structural changes in the classes as much as possible, as such changes would violate current constraints on class redefinition in some production JVMs. JP2 makes only a single structural change to the class java.lang.Thread, where it inserts the instance field pbi_cflow for control flow-based dispatch, as illustrated in Listing 3 in Section 3.2. This trivial modification of class java.lang.Thread is carried out at build time (and hence will not interfere with JP2's use of class redefinition). Because JP2 does not insert any static fields, there is no need for using the pbi_initClass() feature of CodeMerger.
JP2 has been used for workload characterization at the bytecode level, for the comparison of Java and Scala workloads [21]. In that work, various dynamic metrics (e.g., the number of executed bytecodes, callsite polymorphism, and basic block hotness) are computed by cross-referencing the profiles produced by JP2 with static information from the class files of the profiled base program. The details of the initial design and implementation of JP2 are presented in [22,23]; later, JP2 was ported to use CodeMerger. JP2 is implemented in ASM [7] and available as open-source software at http://code.google.com/p/jp2/. In Section 8.3, we will explore the overhead of complete bytecode coverage in JP2, enabled by PBI.

CASE STUDY 2: EXECUTION LEVELS FOR ASPECTJ
As a second case study for PBI, we explore how the technique makes it possible to implement execution levels [12] for AOP with AspectJ [3].

Aspects and circularity
An aspect observes the execution of a program through its pointcuts and affects it with its advice. An advice is like a method, and therefore, its execution also produces join points. Similarly, pointcuts as well can produce join points. For instance, in AspectJ, one can use an if pointcut designator to specify an arbitrary Java expression that ought to be true for the pointcut to match. The evaluation of this expression is a computation that produces join points. In higher-order aspect languages like AspectScheme [24] and others, all pointcuts and advice are standard functions, whose application and evaluation produce join points as well.
The fact that aspectual computation produces join points raises the crucial issue of the visibility of these join points. In all languages, by default, aspectual computation is visible to all aspectsincluding themselves. This of course opens the door to infinite regression and unwanted interference between aspects. These issues are typically addressed with ad hoc checks (e.g., using cflow checks in AspectJ) or primitive mechanisms (like AspectScheme's app/prim). However, all these approaches eventually fall short for they fail to address the fundamental problem, which is that of conflating levels that ought to be kept separate [25].

Execution levels
In order to address this issue, a program computation is structured in levels. Computation happening at level 0 produces join points observable at level 1. Aspects are deployed at a particular level, and observe only join points at that level. This means that an aspect deployed at level 1 only observes join points produced by level-0 computation. In turn, the computation of an aspect (i.e., the evaluation of its pointcuts and advice) is reified as join points visible at the level immediately above; therefore, the activity of an aspect standing at level 1 produces join points at level 2.
An aspect that acts around a join point can eventually invoke the original computation. For instance, in AspectJ, this is performed by invoking proceed in the advice body. The original computation ought to run at the same level at which it originated! || In order to address this issue, it is important to remember that when several aspects match the same join point, the corresponding advice are chained, such that calling proceed in advice k triggers advice k C 1. Therefore, the semantics of execution levels guarantees that the last call to proceed in a chain of advice triggers the original computation at the lower original level. This is shown in Figure 2. A call to a move method in the program produces a call join point (at level 1), against which a pointcut pc is evaluated. The evaluation of pc produces join points at || This issue is precisely why using control flow checks in AspectJ in order to discriminate advice computation is actually flawed. See [12] for more details. level 2. If the pointcut matches, it passes context information ctx to the advice. Advice execution produces join points at level 2, except for proceed: control goes back to level 0 to perform the original computation, then goes back to level 1 for the after part of the advice.

Execution levels for AspectJ using PBI
Execution levels have been formulated and prototyped in aspect languages with dynamic weaving [12]. In recent work, we have designed an extension of AspectJ with execution levels, tailored to take into account the specificities of AspectJ, like static aspect weaving with partial evaluation of pointcuts [4,26]. The detailed motivation, design, and applications of this extension are presented elsewhere [15]. Our previous implementation of level switching is carried out in an ad hoc manner; here, we describe how PBI can be used for that sake. Section 8 also provides a much more detailed performance evaluation of the implementation. Semantically, the execution of a method produces join points. These join points may be seen by pointcuts that may match them; if so, the corresponding pieces of advice are triggered. In aspect languages that perform weaving statically, join point production is partially evaluated [26]: based on the static properties of code, it is determined whether or not a given expression can produce a join point that will be matched at runtime [4]. If so, such a join point shadow is transformed so as to invoke advice appropriately. If it can be statically determined that the pointcut however never matches join points corresponding to the shadow, then no transformation happens. The matching of the pointcut may also depend on runtime information not available at compile time; in that case, the shadow is woven together with a residue, that is, a conditional expression that guards the invocation of the advice.
With execution levels, the join points produced by the execution of a method vary. If base program code, running at level 0, invokes a method, it produces join points at level 1, that may be matched by aspects deployed at that level. If an aspect deployed at level n calls this same method, then it produces join points at level n C 1, visible only for aspects deployed at level n C 1. We use PBI to check the execution level upon method entry and dispatch appropriately to a particular code version. More precisely, there is one code version per execution level, and each code version corresponds to the code with the instrumented shadows of the aspects deployed at the level directly above it. For instance, execution at level 0 uses code version 0, which is the result of weaving aspects deployed at level 1. Execution at level N (the highest level in the configuration) uses code version N , which is set to be the original, non-instrumented code version. A code version is obtained by invoking the standard AspectJ weaver with the aspects deployed at a given level.
In order to track execution levels, we define a class ExecutionLevel that provides access to a thread-local variable that indicates at which level the current thread is running (Listing 4). For that, we insert an integer instance field pbi \ _level in class java.lang.Thread to keep track of the current level. Method currentLevel() returns the current thread's level, whereas methods up() and down() increment respectively decrement it.
Level shifting is carried out upwards for the dynamic extent of both advice and pointcut residues, following the top pattern of Listing 5. A level shift downwards occurs for around advice, when the original computation is finally called with proceed, following a similar pattern (Listing 5, bottom).
Our PBI-based aspect weaver, MAJOR2, uses the unmodified standard AspectJ weaver and postprocesses its output to automatically insert the above pattern in each advice method and in each method corresponding to compiled if pointcuts.
The dispatch logic given to the PBI framework simply consults the current execution level and dispatches to the corresponding version. Finally, because execution levels generalize the solution we presented in the previous section to avoid infinite regression, we only need to combine the levels check with the JVM bootstrap check in order to obtain execution levels for aspects with complete bytecode coverage: The evaluation in Section 8.4 uses two different deployment configurations with two aspects in order to assess the performance of the PBI-based implementation. Additional deployment configurations are discussed in [15].

CASE STUDY 3: ADAPTIVE DYNAMIC ANALYSIS
Adaptive dynamic program analysis allows the user to choose or change the dynamic analysis to be performed at runtime. For example, in adaptive profiling, the profiler code is adapted at runtime based on user choices, in order to restrict profiling to only part of an executing application or to enable and disable the collection of certain dynamic metrics. Adaptive profiling helps reduce profiling overhead, as only data of current interest are gathered.
A good example of an adaptive profiler is JFluid [10], which has been integrated in the NetBeans Profiler [27]. JFluid measures execution time for selected methods and generates a CCT (like JP2, Section 5) to help analyze the contributions of direct and indirect callees to the execution time of selected methods. JFluid is an adaptive profiler: when the user selects different methods for profiling at runtime, JFluid adapts the profiling code accordingly, using the class redefinition mechanism of the JVM.
Runtime instrumentation and class redefinition can be very expensive, in particular if many classes are to be instrumented and if the instrumentation is specified in a high-level programming model, such as AOP, which requires more complex tool support (e.g., in the case of AOP, an aspect weaver is used). For example, with the dynamic AOP framework HotWave [19], which relies on runtime aspect weaving and on class redefinition, weaving an aspect into all modifiable classes at runtime may take up to 60 s on a recent machine (Section 8.5).
If the set of bytecode instrumentations that may be needed is known in advance, it is not necessary to resort to expensive class redefinition techniques. Instead, we can use PBI to apply all the instrumentations and decide at runtime which code version to execute. For example, Villazón et al. present an adaptive profiler built with HotWave that may switch between two different instrumentations (implemented as aspects) at runtime [19]. The default instrumentation generates a plain CCT, whereas a second instrumentation additionally stores various dynamic metrics in the CCT nodes. The second instrumentation introduces much higher overhead and therefore is applied at runtime only to the classes for which the user desires detailed dynamic metrics. Instead of applying the two different instrumentations (possibly repeatedly) at runtime by redefining the affected classes, PBI allows us to merge the code versions for both instrumentations and to switch between them at runtime. Figure 3 illustrates a case of adaptive dynamic analysis with PBI: the user defines, and changes dynamically, the scope of the profiling; profiling data are then passed to a profiling agent that renders it. Implementation-wise, all methods have two code versions and start with a dispatch that triggers the appropriate version, based on the current scope definition.
In the general case, computeCV() dispatches between N different instrumentations based on asynchronous user choices. These user choices may be at the level of classes or packages. Depending on the granularity at which the user can switch between instrumentations, we assume there is some state (i.e., a public static volatile field) for each class or package indicating the code version to be executed. The effect of a state change is similar to class redefinition in current JVMs: all subsequently invoked methods will read the new state and execute the corresponding code version, whereas methods that already executed the dispatch logic before the state change are not affected. The computeCV() function as follows is a template where the meta-variable selectedCodeVersion refers to the corresponding volatile field to be read: This dispatch logic uses the bootstrapping state and the control flow information in the same way as explained in Section 3, in order to enable instrumentation of the Java class library. Code version 0 corresponds to the original method bodies in C orig . Note that for methods in the base program, computeCV() can be optimized as follows, assuming that the inserted code never invokes any method of the base program: As mentioned in Section 3, reading a volatile variable upon each method entry may introduce significant overhead. If the user rarely changes his choice of the code version to be executed (by writing to the meta-variable selectedCodeVersion), redefining the class that holds the volatile variable (as explained in Section 3) helps reduce the overhead of reading the volatile variable in state-of-the-art JVMs. Because of the de-optimization and re-optimization caused by class redefinition, changing the volatile variable does introduce some temporary overhead. However, compared with a solution based on runtime instrumentation and on possibly redefining (potentially) all previously loaded classes, this approach only redefines a single, trivial class. Furthermore, this approach supports the atomic change of a set of instances of the meta-variable selectedCodeVersion (e.g., atomically changing the instrumentation for a set of classes or for a set of packages).

EVALUATION
In this section, we thoroughly evaluate CodeMerger, our PBI implementation, in different scenarios. Section 8.1 summarizes our measurement environment and evaluation settings. In Section 8.2, we explore the performance impact of code duplication introduced by PBI, considering the three different modi supported by CodeMerger (Section 4.4), that is, placing all code versions into a single method body, generating a separate private method for each code version or adapting to the method size. In Section 8.3, we investigate the performance overhead of complete bytecode coverage in the profiling case study (Section 5). In Section 8.4, we evaluate MAJOR2, our PBI-based implementation of execution levels for AspectJ (Section 6), considering both start-up and steadystate performance. Finally, Section 8.5 presents our evaluation of an adaptive dynamic program analysis tool (Section 7) and compares it with an alternative implementation that relies on dynamic class redefinition.

Measurement environment
Our measurement machine is a quad-core machine (Dell Optiplex 760, 1 quad-core Intel CPU, 3.0 GHz, 8 GB RAM) running Fedora 13 and the Oracle JDK 1.6.0_18 Hotspot Server VM (64 bit version with 4 GB maximum heap size). In Section 8.2, we additionally use the HotSpot VM in interpreted mode, to study the base overhead introduced by PBI in a JVM both with just-in-time compilation and with interpretation.
We use the DaCapo benchmarks (dacapo-2006-10-MR2) [28] with the default workload size. For some evaluations, we also show the geometric mean of the measurements for all DaCapo benchmarks.

Overhead of polymorphic bytecode instrumentation dynamic dispatch
In this subsection, we evaluate the base overhead introduced by PBI when two identical code versions (without any instrumentation) are merged. We consider all three modi supported by CodeMerger, that is, (1) merging code versions within method bodies, (2) introducing a private method for each code version, and (3) the adaptive modus that creates extra methods only if the size of the merged code versions (and the inserted dispatch logic) would exceed a given limit. We use the default value of the JVM parameter -XX:FreqInlineSize as limit; this parameter indicates the maximum size of methods that are inlined when executed frequently. The rationale of this choice is to avoid extra methods as long as it does not prevent inlining of frequently executed methods. Regarding dynamic dispatch, we consider three different types of computeCV() function; the first reads a thread-local variable; the second reads a static volatile field, and the third reads a static final field. The thread-local variable, respectively volatile or final field, always contains the value 0; that is, always the same code version is executed for each invoked method. Figure 4 shows that in interpreted mode and with a dispatch logic based on a thread-local variable, extending the size of method bodies is always faster than using extra methods. The reason is that the interpreter does not perform any inlining and the extra method calls always introduce some overhead. The adaptive strategy performs almost as good as always merging code versions into the same method body, because the majority of frequently executed method bodies are still smaller than the limit of the adaptive strategy. Therefore, extra methods are not created in most cases.  Using the same dispatch logic, Figure 5 illustrates PBI overhead when the JVM uses just-in-time compilation. For most benchmarks, the overhead is rather low in all three modi, between 5% and 20%. However, for three benchmarks (i.e., eclipse, luindex, and lusearch), always merging code versions into the same method body introduces surprisingly high overhead between factor two and factor four. The reason for this excessive overhead is that some frequently executed methods cannot be inlined anymore because their size exceeds the limit imposed by the JVM. Always generating extra methods avoids this problem but introduces slightly more overhead for some other benchmarks (e.g., jython, hsqldb, pmd, and xalan); in comparison with interpreted mode (Figure 4), the overhead because of extra method calls is less pronounced, as the just-in-time compiler is able to inline many extra methods. The adaptive strategy always outperforms the use of extra methods for all code versions; it succeeds in combining the benefits of the other two strategies. Figure 6 shows that in the interpreted mode of Oracle's HotSpot VM, accessing a thread-local variable is much more expensive than accessing a static volatile field or a static final field. Because there is no just-in-time compilation, there is no evident performance difference between accessing a volatile respectively final field, as access to final fields is not optimized.
In contrast, Figure 7 illustrates that with just-in-time compilation (i.e., HotSpot server compiler); access to a volatile field is more expensive than access to a thread-local variable, which in turn is slightly more expensive than access to a final field.
In summary, we conclude that the base overhead introduced by CodeMerger is small on a modern JVM with an optimizing just-in-time compiler, whereas it may reach a factor of 2 when using    an interpreter. The adaptive mode of CodeMerger achieves consistently good results for all benchmarks, both when using just-in-time compilation and interpretation. In contrast, the default modus may result in surprisingly high overhead in certain situation when just-in-time compilation is used.

JP2
We now evaluate the impact of complete bytecode coverage in our profiling case study (Section 5). Figure 8 reports the profiling overhead introduced by JP2 in two different settings: (1) instrumenting only the base program classes and (2) instrumenting all classes; only the second setting requires the use of PBI to achieve complete bytecode coverage. In both settings, the profiling overhead is high, ranging from a factor of 2 to a factor of 33. The high overhead is not surprising, as JP2 performs a heavyweight instrumentation, including callsite-aware calling-context profiling and intra-procedural profiling at the basic block level. Furthermore, the profile data structure is shared between all threads, incurring additional overhead for thread-safety, particularly for multi-threaded benchmarks such as lusearch and xalan. For a detailed exploration of the different sources of overhead, ** we refer to [22]. Here, we are only interested in the performance impact of complete bytecode coverage as enabled by PBI. ** Note that the measurements reported in this article are not directly comparable with the measurements published in [22] because of different measurement environments. Figure 8 shows that on average (geometric mean for the DaCapo benchmarks), JP2 with complete bytecode coverage introduces almost twice as much overhead as JP2 instrumenting only classes of the base programs. However, the bigger part of the extra overhead of instrumenting the Java class library does not stem from code duplication by PBI (as we confirmed in Section 8.2) but from the fact that much more code is instrumented, as explained later. Table I summarizes the number of executed basic blocks and the number of method invocations, both for JP2 with complete bytecode coverage and for JP2 instrumenting only the classes of the base program. Some benchmarks (i.e., bloat, chart, fop, jython, and xalan) trigger the biggest part of events in the Java class library (both in terms of executed basic blocks and method invocations). It is not surprising that for these benchmarks, complete bytecode coverage can introduce more than 2.5 times the overhead of covering only base program code. The correlation coefficient between the difference in overhead (complete bytecode coverage versus covering only base program classes) and the difference in the number of executed basic blocks (resp. in the number of method invocations) is 0.8523 (resp. 0.8531). These results confirm that the extra overhead for complete bytecode coverage indeed stems from the larger amount of data collected.
This study of coverage with the Dacapo benchmarks shows that computation within the Java class library is in fact a big part of the overall activity of Java programs (almost twice as many basic blocks are executed and almost twice as many methods are invoked). This result is especially important in that it demonstrates that proper profilers cannot ignore the core libraries: complete bytecode coverage is crucial, and PBI is an efficient technique to achieve it when implementing profilers in pure Java.

Execution levels
In the following evaluation, we consider the performance impact of execution levels in AOP. First, we describe the aspects used in our evaluation and the deployment scenarios in Section 8.4.1. Second, we explore the overhead of execution levels in comparison with the standard AspectJ load-time weaver in Section 8.4.2. Third, we investigate the overhead when weaving with complete bytecode coverage in two different deployment scenarios in Section 8.4.3.

Profiling aspects and scenarios.
Our evaluation is carried out with two bytecode instrumentations for dynamic program analysis implemented as aspects, the object allocation profiler ProfAllocs and the method call profiler ProfCalls (Listing 6). The allocation profiler collects the number of object allocations for each class, and the method call profiler collects the number of method calls for each method. Both profilers maintain a thread-safe mapping from identifiers to atomic longs (methods profileAllocation(...) and profileCall(...), which are not shown in the listing). For ProfAllocs, the identifiers are the classes represented by java.lang.Class instances, and for ProfCalls, the method identifiers are represented by JoinPoint.StaticPart instances (from AspectJ). We use non-blocking data structures from the java.util.concurrent package, concretely ConcurrentHashMap and AtomicLong. We discuss the scoping pointcuts defined in ScopeProf in Section 8.4.2.
We weave the ProfAllocs and ProfCalls aspects in the DaCapo benchmarks that serve as base programs. We use our new PBI-based re-implementation of MAJOR2 [15] that relies on CodeMerger, which provides support for execution levels and complete bytecode coverage. Aspects are woven with AspectJ 1.6.5 (MAJOR2 is also based on AspectJ).
We considered two scenarios for this evaluation: 1. ProfAllocs and ProfCalls are applied to the base program (i.e., both aspects are deployed at level 1). 2. ProfCalls is applied to the base program (i.e., deployed at level 1), and ProfAllocs is applied to ProfCalls (i.e., deployed at level 2), thus profiling object allocation in ProfCalls.

Comparison with AspectJ.
Our first evaluation compares the performance of code woven with AspectJ's load-time weaver (henceforth called ajc-ltw) versus MAJOR2. As explained in Section 6, MAJOR2 relies on PBI to implement control flow-based dispatch based on a thread-local field. In this comparison, we use scenario 1, as this is the only composition scenario that AspectJ can handle. We limit the coverage of MAJOR2 to application classes in order to have comparable settings. For each benchmark of the DaCapo suite, we report the first run for start-up performance and we take the median of 15 runs executed in the same JVM process to evaluate steady-state performance. We also compute the geometric mean for all benchmarks except bloat and eclipse. † † The aspects in Listing 3.2 use a scope() pointcut in order to avoid infinite regression caused by their own computation, as well as to avoid seeing join points produced by the other aspect. Using control flow checks for achieving this is the most robust pattern, as it ensures that all join points in the dynamic extent of aspect executions are ignored. This pattern is well known [29] and is used in many aspect implementations [30]. In our MAJOR2 implementation, these pointcuts are not needed at all because execution levels already address the issues of infinite regression and mutual visibility. Our benchmarks therefore enable us to compare the cost of these typical control flow checks and of our implementation of execution levels.
For further comparison, we also benchmark an optimized version of the aspects with ajc-ltw, where we skip the control flow checks and just leave the lexical aspects() condition. This happens to be safe in this particular case, because all potential sources of regression and interference are situated lexically in the aspect definitions, no shared libraries are woven, and there are no callbacks from the aspects to the base code.
Tables II and III show the measured execution times and overhead factors respectively for steady-state and start-up performance. Figures 9 and 10 visualize the overhead factors reported in the tables.
Considering steady-state performance, MAJOR2 introduces significantly less overhead than ajcltw (factor 6.28 for MAJOR2 versus factor 11.79 for ajc-ltw, on average). This confirms that PBIbased execution level dispatch can be much more efficient than the use of control flow pointcuts for avoiding infinite regression and aspect interferences. As expected, the optimized ajc-ltw performs better than MAJOR2, as it does not incur the PBI overhead. However, the difference is relatively small (overhead factor 5.80 for the optimized ajc-ltw versus factor 6.28 for MAJOR2, on average). Recall that the optimization is fragile and not generally applicable. † † We exclude bloat because it fails with ajc-ltw (in contrast to MAJOR2). We exclude eclipse because ajc-ltw fails to weave a large number of classes because of dependencies (ajc-ltw depends on classes that are also used by the eclipse benchmark); such a problem does not exist with MAJOR2, which makes proper use of class-loader namespaces.  On average, the start-up overhead (Table III and Figure 10) is lower than the steady-state overhead, as the baseline for comparison consumes much more execution time because of class loading. In the ajc-ltw and ajc-ltw (optimized) settings, the start-up overhead exceeds the steady-state overhead only in the case of fop, a benchmark with short execution time. The most notable difference with steady-state performance is that MAJOR2 causes much more overhead than the optimized 1372 W. BINDER ET AL.  AspectJ version. The straightforward explanation is that the code transformation performed at load time is more expensive with MAJOR2: we call the AspectJ weaver once for each execution level, and then, the code versions are put together in an additional step. In summary, these results are particularly encouraging for execution levels, which provide much more stable semantics for aspect composition [12,15]. Our evaluation shows that their efficient integration in a practical aspect language is possible.

Complete bytecode coverage.
Our second evaluation measures the overhead introduced by the two profiling aspects woven with MAJOR2 with complete bytecode coverage in both scenarios. That is, the complete Java class library is also woven, as we want to evaluate the overhead of MAJOR2 in concrete scenarios where its novel features are used. A comparison with AspectJ is not possible, because AspectJ is unable to weave the aspects in the Java class library, and is incapable of handling scenario 2. For each benchmark, we report the first run (start-up performance) and we take the median of 15 runs within the same JVM process (steady-state performance). Here, the geometric mean is computed for the whole benchmark suite, including bloat and eclipse, as MAJOR2 is able to handle both correctly.
Tables IV and V show the measured execution times and overhead factors respectively for steady-state and start-up performance. Figures 11 and 12 visualize the overhead factors reported in the tables. Table IV presents the results of our measurements for steady-state performance. In the first scenario, the average overhead factor is 11.08, while in the second scenario it is 9.61. The overhead with complete bytecode coverage is almost twice the overhead when weaving only application classes, because Java applications execute large portions of code in methods in the Java class library (as was  Figure 11. Steady-state overhead for scenarios 1 and 2, MAJOR2 with complete bytecode coverage.  already pointed out when discussing Table I in Section 8.3). While an overhead factor of 11 is high, it must be considered that the applied instrumentations are computationally expensive. ProfAllocs intercepts each object allocation, and ProfCalls intercepts each method call. Upon all these intercepted join points, a thread-safe data structure is updated. Table V illustrates the start-up performance. In the first scenario, the average overhead factor is 8.53, while in the second scenario it is 8.11. As previously discussed in Section 8.4.2, the startup overhead is lower than the steady-state overhead, because the baseline for comparison executes much longer because of class loading. This evaluation confirms that MAJOR2 allows us to create dynamic analysis tools with AOP that have practical value, thanks to complete bytecode coverage. Depending on the concrete analysis, the overhead introduced by complete bytecode coverage can be significant, as many Java applications spend a big part of their execution in methods of the Java class library.

Adaptive analysis
Finally, we evaluate PBI for adaptive dynamic analysis and assess the cost of PBI-based dispatch compared with class redefinition. Concretely, we compare CodeMerger in its default modus with HotWave [19], a dynamic AOP framework that is based on runtime weaving and class redefinition. ‡ ‡ We use the ProfCalls aspect introduced in Section 8.4.1 as dynamic analysis. We exclude results for the eclipse benchmark, because HotWave excludes many benchmark classes from weaving, similar to ajc-ltw. That is, execution time for eclipse with HotWave would be too short and therefore misleading.
We execute nine runs of the benchmarks within a single JVM process. The first three runs execute original code; then, we activate the dynamic analysis for all classes for three runs, and finally, we execute again original code for the last three runs. For CodeMerger, we present two settings: PBI-V keeps the global state in a volatile field (which is read by the computeCV() function), whereas PBI-R uses class redefinition to change the accessor of that field to return a constant, as discussed in Section 7. Figure 13 shows the execution times as bars with nine segments, one for each run. White segments correspond to runs executed without analysis, and green (or light gray) segments are runs with dynamic analysis. Dark gray areas represent the time spent in runtime weaving and class redefinition. With HotWave runtime weaving and class redefinition may take long time, because all modifiable classes are processed. For instance, with fop, runtime weaving and class redefinition take more than 50% of the overall execution time. jython has the longest redefinition time of 61 s.
In contrast, with CodeMerger, the activation (and deactivation) of the analysis is almost instantaneous, the maximum latency being less than 100 ms in all cases. However, PBI introduces some extra overhead when running original code (without analysis), because of the dispatch switch and code bloat in each method, whereas HotWave introduces no overhead when executing original code. With CodeMerger, the first run is particularly slow because of load-time weaving, which is not needed for HotWave. For some benchmarks, particularly for luindex, the difference in execution time with CodeMerger versus HotWave when executing original code is surprisingly high. The reason is that in this case study we use CodeMerger in its default modus, which can result in high overhead because of increased method size, as explored in Section 8.2.
Note that both HotWave and CodeMerger in the PBI-R setting make use of class redefinition. With Oracle's HotSpot VM, this feature may trigger de-optimization of compiled native code (e.g., undoing method inlining). Consequently, the run that follows class redefinition is often longer than the subsequent runs. Because the HotSpot VM keeps information on hot methods upon class redefinition, the de-optimized code is quickly re-optimized after class redefinition.
Comparing overall execution times for the nine benchmark runs, CodeMerger outperforms HotWave in seven out of 10 benchmarks. For CodeMerger, the PBI-R setting outperforms the PBI-V setting for nine out of 10 benchmarks. Note that these results depend very much on the concrete evaluation settings. On the one hand, if the analysis is frequently activated and deactivated, one can expect that CodeMerger outperforms HotWave because of the dominant overhead of class redefinition. On the other hand, if the analysis is rarely (or never) activated, HotWave may outperform CodeMerger, as HotWave does not incur the overhead of PBI dispatch when the analysis is not woven.
In conclusion, our evaluation confirms that PBI is well suited for building adaptive dynamic analysis tools. As the latency incurred when switching between different code versions is small, adaptive tools built with CodeMerger can quickly react to user choices. ‡ ‡ Please note that PBI and HotWave are not functionally equivalent systems. HotWave enables the runtime deployment of instrumentations that may not be available when the base program is started. In contrast, with PBI, all instrumentations must be known in advance.

DISCUSSION
In this section, we first discuss prior and ongoing work by the authors of this article and second compare our approach with related work by others.

Prior and ongoing work
The PBI generalizes some previously developed techniques. In this section, we give a short overview of our prior research that finally resulted in this proposal. The FERRARI framework [16] takes any user-defined bytecode instrumentation (which can be implemented with any bytecode manipulation library) and augments it with support for complete bytecode coverage. To this end, FERRARI relies on code duplication within method bodies, similar to the approach presented in Section 3. However, as FERRARI lacks support for merging multiple independent bytecode instrumentations. While the case study presented in Section 5 -the profiler JP2 -could also be implemented with FERRARI, the other two case studies presented in this article may require merging of multiple instrumentations and therefore cannot be handled by FERRARI.
Based on FERRARI, the aspect weaver MAJOR [18] supports most constructs of the AspectJ language and enables aspect weaving with complete bytecode coverage. Thanks to MAJOR, aspectbased dynamic analysis tools, such as profilers [31,32] or data race detectors [30,33], are able to analyze all bytecode executed in a JVM. The dynamic AOP framework HotWave [19] relies on the same implementation techniques as FERRARI in order to achieve complete bytecode coverage.
Tanter introduced the notion of execution levels as a means to structure aspect-oriented programs so as to prevent infinite regression and unwanted interference between aspects [12]. Attracted by the idea of having execution levels in AspectJ, we developed a first ad hoc implementation [15]. This implementation and the commonalities with the techniques used in FERRARI and MAJOR progressively led us to the formulation of the PBI technique and the implementation of CodeMerger. As discussed in Section 6, PBI enables a clean re-implementation of execution levels for AspectJ. In addition, the PBI-based implementation discussed in this article enables a thorough evaluation with the complete DaCapo benchmark suite, where various compositions of aspects are woven with complete bytecode coverage.
The profiler JP2 used as a case study in this article was first presented in [22,23]. It has been used for workload characterization at the bytecode level [21]. JP2 is available as an open-source release that includes CodeMerger.
The dynamic program analysis framework DiSL (domain-specific language for instrumentation) [34], available as open-source software (http://disl.ow2.org/), relies on PBI and CodeMerger to ensure analysis with comprehensive bytecode coverage.

Related work
To the best of our knowledge, there is not much work that is directly related to this proposal of PBI. Altering program semantics through bytecode transformations is a widely used technique and has been explored and put in practice in many different flavors in Java, from low-level tools like BIT [35], BCEL [6], and ASM [36], to higher-level frameworks like Javassist [8], Jinline [37], or Soot [38]. Similar toolkits have also been proposed for other languages based on virtual machines that run intermediate bytecodes, like Squeak Smalltalk [9] and .NET. PBI is a general-purpose technique that allows to combine instrumentations possibly written with any of these tools. Thus, it stands at a higher-level than specific instrumentation tools and cannot be directly compared. The most recent version of CodeMerger, our PBI implementation for Java, is implemented using ASM, although other frameworks could be used as well.
On the other hand, there is a huge body of language-level proposals for advanced dispatch, like mixin layers [39], dynamic layer activation [40,41], aspects [4], and predicate dispatch [42]. Each of these has been realized using particular implementation techniques, specific to the targeted semantics and the implementation trade-offs that their authors were willing to make. Here again, PBI does not stand at the same level as these proposals: PBI is not a language-level mechanism but rather 10. CONCLUSIONS Polymorphic bytecode instrumentation is a simple yet very effective technique to combine different instrumentations and select among them dynamically. With PBI, third-party instrumentations of a given class are combined into a single class, where each method uses a user-specified dispatch logic in order to select, at runtime, the code version to execute. Therefore, a PBI framework simply merges code versions and generates the appropriate switch.
We have shown that PBI is an effective technique by illustrating its applicability in a wide range of scenarios: to achieve complete bytecode coverage without disrupting VM bootstrap and avoiding infinite regression, to implement a comprehensive profiler, to implement execution levels for AOP, and to support adaptive dynamic analyses. All case studies have been carried out with CodeMerger, our PBI framework for Java bytecode.
A thorough performance evaluation further shows that PBI can be efficiently implemented. In particular, the pure overhead of the dispatch added by PBI is rather low when just-in-time compilation is enabled. The most efficient modus of PBI is the adaptive one, in which versions are merged into a single body by default, and are implemented as private methods if the merged body would become so large that it would prevent inlining. Our benchmarks of the DaCapo suite confirm that complete bytecode coverage is really crucial for profilers, because a large part of computation happens in the core libraries. Our profiler JP2 is able to produce accurate profiles thanks to PBI. We then demonstrate that execution levels for AOP can be efficiently implemented and are actually more efficient than their brittle equivalent AspectJ idioms, based on control flow checks. Finally, PBI makes it possible to implement adaptive analyses that are more reactive than other systems based on class reloading.
We expect PBI to prove useful in many other cases, such as for implementing advanced dispatch mechanisms and language constructs. A preliminary experience with implementing (a restricted form of) dynamic mixin layers is discussed in [13]; extending it and exploring the implementation of other constructs is one of the main venues for future work.