Choosing the Right Compiler - Eﬃcient Numerical Methods for Adaptive Quantile Regression

speed is only half of what was observed in Section3.4.1.

The compiler often does a very good job of unrolling since this is such an essential feature, and much work goes into this function. What the programmer seeking performance gain can do is to make it easier for the compiler to predict how to unroll. If, for instance, the compiler knows how many iterations a particular for-loop will carry out at compile time, the compiler can fully unroll the loop.

This is really good in blocked algorithms, where the inner loop is already fixed in size. If the block size had been a variable and not a preprocessor macro, the compiler could not possibly know how many iterations the loop would do and there goes the chance of fully unrolling. Without the full unrolling, the processor needs to flush the pipeline before checking if another iteration in the loop is required.

Unrolled loops will make the compiled program larger, as well as most likely increase the amount of code which needs to be cached. Given no problem of supplying data, instruction cache misses are the limiting factor of to what extent loops can be unrolled. If the performance drops after unrolling very deeply, the culprit is most likely the instruction cache not being able to keep up. This should obviously be avoided by not unrolling too much.

3.5 Choosing the Right Compiler

As mentioned in the previous section, the compiler can do things to help you in terms of performance. What a compiler does is transform the C code to machine code, so the program can be run. One might think that a program is uniquely defined by the C code, but despite C being a relatively low-level language, the actual transformation to machine code offers plenty of room for interpretation.

In order to get the best performance on a particular computer architecture, the selection of compiler is essential.

Here is a list of some of the things a compiler might do to improve performance:

• Use mathematical processing unit in CPU

• Loop interchange based on memory access

• Utilize processor pipeline and super scalar capabilities

• Insert prefetch instructions

• Use internal registers for counters and temporary values

• ”Change” code to perform better

Most compilers use some very conservative optimization settings as standard, so in order to get the full set of features, a good knowledge of the compiler optimization flags is required. Sun C 5.8 Patch 121015-04 (no unroll)

Figure 3.15: The well known matrix multiplication code tested using different compilers on a UltraSPARC III. The GCC compiled version performs signifi-cantly worse than the one compiled using unrolling on the Sun compiler. It is interesting to see that the version which is not unrolled performs nearly iden-tically to the GCC version. This indicates that the GCC version does not use the four way super scalar capabilities of the processor.

Before selecting compiler flags, the actual compiler must be selected. For high performance computing, selecting the best compiler for a particular platform can have a tremendous impact on performance.

As an example, the best performing version of the matrix multiplication has been complied using GCC 3.4.3 and run on the same UltraSPARC III as all the other tests. The result can be seen in Figure 3.15¹⁰. This clearly shows that selecting the best compiler can make a huge difference, and on Solaris SPARC, the obvious choice is to use Sun’s compiler. It makes good sense that Sun has

10GCC 3.4.3 was used with the recommended optimization flags for SPARC ”gcc -g -O3 -mcpu=ultrasparc -funroll-loops” and the Sun compiler was the same as all the other experi-ments in this chapter.

3.5 Choosing the Right Compiler 43

the better performing compiler for their own platform. Intel likewise offers a compiler for their processors which they claim outperforms GCC. A lot of effort is required to make a compiler produce faster code, and since GCC is mostly used on x86 architectures, it is safe to assume that more time has been spent on getting optimal performance on this platform.

3.5.1 Compiler Flags

To get the most performance out of the final binary it is a very good idea to understand and use the performance flags available on the compiler. All the optimization capabilities are not turned on by default. It might seem strange that a compiler does not produce the optimal binaries by default, but some performance features such as unrolling increase the binary size which might be unwanted in some cases and it takes longer to compile if more options are turned on. The final binaries might also be intended for use on different machines (CPUs), and then it would be incompatible if a performance instruction subset was enabled, which was not available on all target machines.

The Sun C compiler offers many compiler flags and it is a very good idea to understand and use them if the program is to perform well. The performance flags are very well documented in Sun’s C compiler and a list can be obtained by writing cc -flags. There are five level of optimization in Sun’s compiler and these can be set by the flag -xOn wherenis a number describing the level of optimization. A really nice feature with Sun C is the macro definitions of performance flags. There is a flag called -fast, which enables all the recom-mended performance flags. This is a very good starting point for getting the full potential out of the compiler.

GCC have similar options for selecting the optimization level, but only three levels are defined. Unfortunately, the documentation is not as complete as for the commercial Sun C and there is no -fastflag in GCC.

When selecting the correct flags, it is important in both Sun C and GCC to select the processor architecture, for which the binary is intended. Valuable performance instructions such as prefetching, pipelining and instruction level parallelism are only used optimally if the compiler knows about them.

In document Eﬃcient Numerical Methods for Adaptive Quantile Regression (Sider 53-56)