Contributions - FindingparallelbugsinJavaprograms,usingconcolicexecution Hyperconcolic

During this project, I build upon Eslamimehr and Palsberg’s research on multiple fronts; I have implemented a better data race detector and improved the concolic engine to get a better coverage of the program.

The data race detector used in their project, where based on the oldersaid algo-rithm[21]. I have implemented a version of thehuang algorithm [8], which improves the coverage.

The concolic execution used in their projects, where unable to correctly produce schedules from parallel programs, this made the coverage incomplete. In this thesis, I have devised this new hyperconcolic engine, which works in a parallel environment.

To argue about the coverage of the concolic and the hyperconcolic engine, I have developed a notion called schedule classes. Schedule classes are abstract representa-tions of the set of schedules that an analysis covers. The hyperconcolic class ([h]^hc) is stricter defined than the concolic class, because Hyperconcolic chooses its scheduler.

Using this framework of schedule class it is possible to argue if the combination of dy-namic analysis and schedule generator is sound. Using the framework, I have argued that the hyperconcolic engine paired with thehuangdata race detector is both sound and complete in programs that for all inputs terminates, and I have also proposed a technique for testing those that do not.

A schedule intermediate format has been developed, because the hyperconcolic engine covers a larger pipeline of analyses.

1.3 Contributions 5

The main contributions of this thesis are these five points.

• A new schedule format, which language independent can represent symbolic schedules.

• A survey of dynamic analysis, especially the ones detecting parallel errors, and data races. An implementation of the newest techniques,huang andsaid has been made.

• Description of the first combination of partial order constraints and input con-straints in the context of concolic execution, to my knowledge. Thus enabling real concurrent support to the concolic engine. The resulting concolic engine is called thehyperconcolic engine.

• A combination of the hyperconcolic engine and the data races analysis, which is both sound and complete for all programs with a bounded length of schedules.

A mathematical framework for arguing about the soundness of the combination is also proposed.

• A symbolic virtual machine of Java programs, which produces schedules in the new language. The machine logs the output of multiple threads and is capable of forcing a special scheduling.

CHAPTER 2 Symbolic Schedule Language

This chapter introduces the symbolic schedule language, both in mathematical nota-tions and also in a data format. The symbolic schedule will be used in the chapters, both to argue about dynamic algorithms and to prove properties for them. The sched-ule classes are introduced, as the last part of the chapter. The schedsched-ule classes serves as the theoretical foundation for arguing about maximality of the dynamic analyses and coverage of the concolic engines.

This section will use a functional approach to build the execution model. All functions are type annotated, to make the ideas of the functions easier to understand.

The resulting schedule language is a symbolic reduction of the Java engine instruction set, but should be able to represent all programs that can be reduced to the execution model. An example of the concrete language can be found in Appendix A.

2.1 The Execution Model

We have to define the structure of the programs that we are investigating, to under-stand the structure of schedules. It is possible to abstract away a lot of the gritty details of the separate languages and focus on the core, using a general execution model.

The main focus of the project is bytecode languages like assembly, LLVM IR and Java bytecode. Even though they have different execution models, all of these languages are ways to represent a program (P) that is build up of statements (S). In the model, aPcan be completely described by an initial statement.

init:P→S (2.1)

Statements

A statement in this model has the ability to do two things. It can choose the next statement to be executed (Equation (2.2)), or none to indicate the end of execution, and it can update the state of the execution (Σ) (Equation (2.3)). Both of these actions has knowledge of the current state.

8 2 Symbolic Schedule Language

next:S→Σ→S∪ {ϵ} (2.2)

exec:S→Σ→Σ (2.3)

It is possible to build a simple sequential execution model, from these functions alone:

runseqis a recursive function that a program and a state returns a new state. The top level function runseq calls the helper function with the initial statement of the programrun^′_seq. The helper function takes a statement or no statement and a state and returns a state. If the execution is not done (ϵis presented as the statement) will run^′_seqexecuted the statement on the state and find the next statement. The function then calls itself with these updated values. Notice, that there is no way to guarantee that the program will terminate, but if it does it will do so with an updated state.

Scheduler

We can describe programs with branches and loops using this model, but they are limited to sequential execution. The model does not allow us to represent concurrent programs. Concurrent execution requires different execution contexts, threads (T), which each execute statements in a serial manor. The threads can present certain parts of the state local to the thread. These substates are referred to with the thread as an index (σt), and to some degree be overlapping.

The execution needs to maintain a mapping from threads to the next statement to be executed at all times in the execution. If the thread has ended or not started it will map to nothing (J:=T→(S∪ {ϵ})). This mapping is called the current jobs of the execution. To manage these jobs we need a scheduler (Ψ). The scheduler is the algorithm, which the computer uses to schedule the program. The schedulers responsibility is to choose the next thread to be executed from the jobs:

choose : Ψ→Σ→J→T∪ {ϵ} ×Ψ chooseψ σ js = choose^′ ψ js^′

wherejs^′[x7→js xifready(js x)ψ σx]

(2.5) The choose function can from a scheduler, a state and a job mapping find a tuple; the next thread to be executed, and an updated scheduler. The function is a combination of the two functionsready andchoose^′. The functionready indicates

2.1 The Execution Model 9

with a boolean if the statement is ready to run and choose^′ choose picks a thread from the active jobs to execute.

ready:S→Ψ→Σ→ {⊤,⊥} (2.6) choose^′ : Ψ→(T→S)→T∪ {ϵ} ×Ψ (2.7) Some languages embeds requirements to the scheduler in the statements, this can be done in the terms of locks or join commands. The semantics of these statements require that they are not executed before some state or execution history has been reached, in these cases wouldreadythen return false.

If and only if the statement is ready to be executed,readywill return true, which will include the statement in the partial ordering of active jobs.

The choose^′ function returns a pair of a job and a scheduler. The scheduler is updated because the scheduler may need to react on the statements which has already been run. This is the case in a round-robin scheduler where each thread is executed in turn, the scheduler therefore has to have knowledge about which thread was executed last and for how long. The function can return the empty thread, because thechoose^′ function might get an empty mapping in some cases when none of the statements are read.

The mathematic scheduler is a good way to describe the nondeterminism of choices made by the computer during the execution of multi-threaded programs. It is possible to introduce nondeterminism in a mathematically sound way, by using an unknown scheduler in the equations. The scheduler is constrained by the semantics of the program. This means that the statements has the ability to determine some of the choices of the scheduler.

We only need one more function, for the execution to be complete. The statements can start new threads, this is represented by thespawnfunction. The function returns a set of statements (indicated by the power set type), because statements might be able to start multiple threads at once. The statements indicates the initial statement of the new thread.

spawn:S→Σ→2^S (2.8)

Execution

We can update therunfunction to also handle concurrent programs, with the help of the scheduler and the active jobs mapping:

10 2 Symbolic Schedule Language

run : P→Ψ→Σ→Σ

runp ψ σ = run^′ (init^′ p)ψ σ run^′ : 2^J→Ψ→Σ→Σ run^′ ∅ψ σ = σ

run^′ js ψ σ = run^′ (manageψ σtt js)ψ^′ (exec(js t)σt) where(t, ψ^′) =chooseψ σ js

manage : Ψ→Σ→T→J→J

manageψ σ t js = js[t7→next (js t)σ]⊎ψspawn(js t)σ

(2.9)

The notable difference fromrunseqis, that it accepts a scheduler as an input. What actually happens is that the run function for each step of the execution maintains the current jobs of the execution, which the scheduler manages.

The first step of the execution model is the to find the initial jobs of the program (init^′). This version of init is annotated because it returns a mapping, possible [T17→initp].

The scheduler choose a thread to execute from the jobs, using thechoosefunction.

Then, statement is executed and the jobs of the next execution step is managed by the helper functionmanage. This execution model does not handle ifchoosereturns the empty thread. That situation infers a deadlock and the program never ends.

The helper function replaces the executed threads statement with the new state-ment and adds all the spawned statestate-ment to the mapping, using new threads (⊎ψ).

Limitations and assumptions This model has certain limitations:

1. This execution model requires that all events in some way can be serialized, i.e. executed after each other. This is not always the case. If the memory model of the computer is not sequential consistent, some executions might not be possible in our execution model [12].

2. The state cannot be changed throughout the execution of the program, this means that it is impossible to model true interactive programs with this model.

It can on the other hand be simulated by modeling the user as a part of the program.

In document FindingparallelbugsinJavaprograms,usingconcolicexecution Hyperconcolic (Sider 10-16)