Limitations of the benchmark setup

5.2 Benchmark setup

5.2.6 Limitations of the benchmark setup

We have designed the benchmark to produce as fair and consistent results as we could, however, we have identified some limitations of the benchmark design and setup, that we have not been able to fully remove or work around.

5.2.6.1 Language support

Since the project compiler does not support the full ECMAScript Language Specification, the project compiler will always have an unfair advantage over full implementations. Even if none of these features are actually used, the fully compliant implementations still have code to support them, and this could po-tentially slow these implementations down compared to an implementation that does not. Specifically, the project compiler implements a simplified object prop-erty model compared to the specification that could be significant in comparison to a fully compliant implementation. It also means that even if the programs have the desired property of using a wide selection of language features, there are language features that will not be exercised.

Partial mitigation We have removed all explicit use of the unsupported fea-tures from the Delta Blue benchmark to avoid giving the project compiler an unfair advantage and to make sure that it can compile and execute the program.

For instance, all "try", "throw" and "catch" statements are removed to avoid forcing the overhead of exception handling to the other implementations - this, however, cannot fully be avoided as the there are still some operations that can throw exceptions and the other implementations will handle these cases, whereas we do not. The simplifications in the object mentioned above cannot avoid to give the project compiler an unfair advantage.

Other limitations, such as not supporting automatic semi-colon insertion, should not give our compiler an advantage over the other implementations.

5.2.6.2 Compile time

The benchmark run times from the output of the project compiler does not include the compile time. This is obviously a design choice because we have implemented an ahead of time compiler where all the compilation is finished before execution. However, since we compare the project compiler to interpreter

and JIT compiling systems that do compilation during run time it needs to be addressed.

Partial mitigation The Dijkstra and Delta Blue benchmarks include a "dry-run" that does not count towards the measured run time. This gives the JIT compiler a chance to identify "hot" functions and optimize these. The bench-mark also uses timings inside the program to report the run time rather than measure the run time of the process. This allows us to avoid including the parse timings for the interpreters and JIT compilers in the results.

5.2.6.3 Startup time

When we measure the startup time for the project compiler, all the time is used on doing the compilation. When we measure the startup time of NodeJs and RingoJs, some of the time may also come from starting up the server features, of these implementations. When we do the measurements of the startup time for the implementation, our compiler has a clear advantage over NodeJs and RingoJs because we have not implemented all the server features, that these implementations have.

Partial mitigation This is a flaw in our test we have not resolved. It means that even if our compiler uses less time to start up than the other systems, we can not conclude that it is a faster program. In the case that our compiler is slower, we will, however, be able to conclude that NodeJs and RingoJs can startup and are ready to begin execution faster than our compiler.

5.2.6.4 The benchmark suite used

The Delta Blue benchmark used is designed and developed by teams that design and implement JavaScript JIT compilers for use in a browser context. This raises the concern that the benchmark is focused for use in the browser and therefore does not give a good measurement of performance for a JavaScript implementation that is primarily designed to be used server side.

Partial mitigation Although the Octane benchmark suite is designed to test the JavaScript implementations of web browsers, there is no tests of the DOM interactions. We therefore believe that it is a fair tests to use. If anything,

the benchmark will give NodeJs an advantage over our compiler, since it is optimized towards this benchmark as it is part of the continuous evaluation of the V8 performance.

The Dijkstra benchmark does not carry this concern because it does shortest path finding, which is an application that has seen use in applications run on servers such as Google Maps. The loops benchmark is so simple that it is not focused on any particular context.

5.2.6.5 The benchmark selected from the suite

The octane benchmark suite contains 13 different benchmarks, but we have included only one of them in our tests. This means that we do not get the coverage offered by the entire benchmark suite and one could raise the concern that we might have cherry picked a test that the project compiler does really well.

Partial mitigation The test that we have chosen tests primarily an object oriented programming style in JavaScript. The other tests of the suite bench-mark features that we do not support (RegExp, for each, eval, etc.) or are simply too large for us to convert for use by the project compiler. To convert them we, among other things, must manually insert every missing semi-colon in the file - the Box2D test is 9000 lines of code.

5.2.6.6 Use of production code in the benchmarks

One of our goals when making the benchmarks was to use code that was close to production code, that could be of use in the industry. One could raise the concern that the benchmark programs do not have this property.

Partial mitigation We believe that the Delta Blue and Dijkstra benchmarks have this property to a high degree because they do constrains solving and short-est path finding, respectively. These are disciplines that have many practical applications. The coding style of Delta Blue is object oriented, which is a very popular style in the industry.

The loops benchmark, however, does not have the property. Even if the struc-ture of the program does resemble some useful algorithms such as 2D folding, the program is extremely simple, and it does not compute anything that would

be of any value in real life. Therefore we will only use the loops benchmark to make conclusions on how well the compilers can optimize very simple program constructs.

In document Compiling Dynamic Languages (Sider 84-87)