Performance/Analysis/Example2

#### Contents

## Loop Invariant Code Motion and LLVM

Now let’s take a look at the results from simple_types_loop_invariant.cpp<P>

We’ll use **LLVM-gcc** 2.1, specifically:

gcc version 4.0.1 (Apple Computer, Inc. build 5449)(LLVM build 2.1) Apple PowerMac G5 2.5 Ghz Quad Processor

llvm-g++ -I. -O3 simple_types_loop_invariant.cpp -o simple_types_loop_invariant_llvm ./simple_types_loop_invariant_llvm > simple_types_loop_invariant_llvm.txt

## Integer math

The int32_t case is fairly representative of all the integer results under **LLVM**.

test description absolute operations ratio with number time per second test0 0 "int32_t variable add" 2.57 sec 622.57 M 1.00 1 "int32_t multiple variable adds" 2.57 sec 622.57 M 1.00 2 "int32_t variable subtract" 2.57 sec 622.57 M 1.00 3 "int32_t multiple variable subtracts" 3.52 sec 454.55 M 1.37 4 "int32_t variable multiply" 3.84 sec 416.67 M 1.49 5 "int32_t multiple variable multiplies" 3.86 sec 414.51 M 1.50 6 "int32_t variable divide" 24.36 sec 65.68 M 9.48 7 "int32_t multiple variable divides" 53.19 sec 30.08 M 20.70 8 "int32_t multiple variable mixed" 3.86 sec 414.51 M 1.50 9 "int32_t variable and" 1.95 sec 820.51 M 0.76 10 "int32_t multiple variable and" 1.99 sec 804.02 M 0.77 11 "int32_t variable or" 2.56 sec 625.00 M 1.00 12 "int32_t multiple variable or" 2.58 sec 620.16 M 1.00 13 "int32_t variable xor" 1.99 sec 804.02 M 0.77 14 "int32_t multiple variable xor" 1.96 sec 816.33 M 0.76

### addition

This is good. Multiple loop invariant adds are the same speed as a single add. And that means the compiler moved the loop invariant calculations out of the loop.

### subtraction

Hmm, multiple loop invariant subtracts take longer than a single subtract.
That means that **LLVM** failed to optimize loop invariant subtraction.
And that is really surprising because it did optimize addition.

for (x=0;x<count;++x) result += input[x] - v1 - v2 - v3 - v4;

should have been optimized to

temp = v1 + v2 + v3 + v4; for (x=0;x<count;++x) result += input[x] - temp;

An assembly dump shows that **LLVM** issues each subtract inside the loop.

### multiplication

Also good. This runs a little slower than adds due to instruction latency, but is optimized correctly.

### division

Clearly not optimized. Hmm, what was the source for this?

for (x=0;x<count;++x) result += ((((input[x] / v1) / v2) / v3) / v4);

Whoops! That can’t be optimized for integer math without getting wrong results. I messed up and used the wrong test code, and failed to catch it for months. I’ll have to change the test to include something that can be moved out of the loop.

### mixed math

This should be the same speed as a single add, but isn’t. And yet it is not as slow as if all the operations were done in the inner loop.

for (x=0;x<count;++x) result += input[x] + v1 - v2 * v3 / v4;

should have been optimized to

temp = v1 - v2 * v3 / v4; for (x=0;x<count;++x) result += input[x] + temp;

Reading the assembly shows that **LLVM** did move the multiply and divide out of the loop, but kept 2 adds and a subtract.
This is probably related to whatever caused the subtract case to fail earlier.

### bitwise and

Good.

### bitwise or

Both versions run about the same speed, but how is this running slower than a **bitwise and**?
Reading the assembly shows that **LLVM** did optimize the multiple **bitwise or**s correctly,
generating almost identical code to the **bitwise and** loops.
But there’s an interesting problem - the 7 instruction **bitwise or** loops straddle a cacheline boundary.
On some CPUs (like the PowerPC 970 aka G5) that can add significant stalls.
If I change the code slightly (by removing some other tests), the loops fall on better alignment and run as fast as the **bitwise and**.

### bitwise xor

Good.

### Other sizes and unsigned values

The 8, 16, 64 bit and unsigned values are very similar to int32_t but are also hitting unaligned loop problems that make analysis of optimizations a little more difficult (without reading all the assembly).

### Conclusions for integer loop invariants

- With the -O3 flag for optimization,
**LLVM**is usually moving integer loop invariant calculations out of inner loops. **LLVM**does not optimize loop invariant integer subtracts.*[ LLVM 2.3 fixes this problem - ccox ]***LLVM**is not aligning loops and can have code position dependent slowdowns on some processors.*[ LLVM team is working on it - ccox ]*

## Floating point math

test description absolute operations ratio with number time per second test0 0 "float variable add" 3.84 sec 416.67 M 1.00 1 "float multiple variable adds" 6.20 sec 258.06 M 1.61 2 "float variable subtract" 3.85 sec 415.58 M 1.00 3 "float multiple variable subtracts" 6.20 sec 258.06 M 1.61 4 "float variable multiply" 3.84 sec 416.67 M 1.00 5 "float multiple variable multiplies" 4.72 sec 338.98 M 1.23 6 "float variable divide" 18.57 sec 86.16 M 4.84 7 "float multiple variable divides" 53.81 sec 29.73 M 14.01 8 "float multiple variable mixed" 3.85 sec 415.58 M 1.00 test description absolute operations ratio with number time per second test0 0 "double variable add" 3.84 sec 416.67 M 1.00 1 "double multiple variable adds" 6.23 sec 256.82 M 1.62 2 "double variable subtract" 3.84 sec 416.67 M 1.00 3 "double multiple variable subtracts" 6.22 sec 257.23 M 1.62 4 "double variable multiply" 3.84 sec 416.67 M 1.00 5 "double multiple variable multiplies" 4.17 sec 383.69 M 1.09 6 "double variable divide" 18.58 sec 86.11 M 4.84 7 "double multiple variable divides" 53.80 sec 29.74 M 14.01 8 "double multiple variable mixed" 3.85 sec 415.58 M 1.00

**LLVM** fails to completely optimize even a single case.
But the assembly dump shows that it does partly optimize the mixed math case similar to the way it did for integers.

### Conclusions for floating point loop invariants

- With the -O3 flag for optimization,
**LLVM**is rarely moving floating point loop invariant calculations out of inner loops. - NOTE: LLVM doesn't do the FP optimizations because the LLVM architect believes it would affect the result of the FP operations.