Performance/Analysis/Example2

Loop Invariant Code Motion and LLVM
Integer math
Floating point math
1. Conclusions for floating point loop invariants

Loop Invariant Code Motion and LLVM

Now let’s take a look at the results from simple_types_loop_invariant.cpp<P>

We’ll use LLVM-gcc 2.1, specifically:

gcc version 4.0.1 (Apple Computer, Inc. build 5449)(LLVM build 2.1)
Apple PowerMac G5 2.5 Ghz Quad Processor

llvm-g++ -I. -O3 simple_types_loop_invariant.cpp -o simple_types_loop_invariant_llvm
./simple_types_loop_invariant_llvm > simple_types_loop_invariant_llvm.txt

Integer math

The int32_t case is fairly representative of all the integer results under LLVM.

test                          description   absolute   operations   ratio with
number                                      time       per second   test0

 0                 "int32_t variable add"   2.57 sec   622.57 M     1.00
 1       "int32_t multiple variable adds"   2.57 sec   622.57 M     1.00
 2            "int32_t variable subtract"   2.57 sec   622.57 M     1.00
 3  "int32_t multiple variable subtracts"   3.52 sec   454.55 M     1.37
 4            "int32_t variable multiply"   3.84 sec   416.67 M     1.49
 5 "int32_t multiple variable multiplies"   3.86 sec   414.51 M     1.50
 6              "int32_t variable divide"  24.36 sec    65.68 M     9.48
 7    "int32_t multiple variable divides"  53.19 sec    30.08 M    20.70
 8      "int32_t multiple variable mixed"   3.86 sec   414.51 M     1.50
 9                 "int32_t variable and"   1.95 sec   820.51 M     0.76
10        "int32_t multiple variable and"   1.99 sec   804.02 M     0.77
11                  "int32_t variable or"   2.56 sec   625.00 M     1.00
12         "int32_t multiple variable or"   2.58 sec   620.16 M     1.00
13                 "int32_t variable xor"   1.99 sec   804.02 M     0.77
14        "int32_t multiple variable xor"   1.96 sec   816.33 M     0.76

addition

This is good. Multiple loop invariant adds are the same speed as a single add. And that means the compiler moved the loop invariant calculations out of the loop.

subtraction

Hmm, multiple loop invariant subtracts take longer than a single subtract. That means that LLVM failed to optimize loop invariant subtraction. And that is really surprising because it did optimize addition.

for (x=0;x<count;++x)
	result += input[x] - v1 - v2 - v3 - v4;

should have been optimized to

temp = v1 + v2 + v3 + v4;
for (x=0;x<count;++x)
	result += input[x] - temp;

An assembly dump shows that LLVM issues each subtract inside the loop.

multiplication

Also good. This runs a little slower than adds due to instruction latency, but is optimized correctly.

division

Clearly not optimized. Hmm, what was the source for this?

for (x=0;x<count;++x)
	result += ((((input[x] / v1) / v2) / v3) / v4);

Whoops! That can’t be optimized for integer math without getting wrong results. I messed up and used the wrong test code, and failed to catch it for months. I’ll have to change the test to include something that can be moved out of the loop.

mixed math

This should be the same speed as a single add, but isn’t. And yet it is not as slow as if all the operations were done in the inner loop.

for (x=0;x<count;++x)
	result += input[x] + v1 - v2 * v3 / v4;

should have been optimized to

temp = v1 - v2 * v3 / v4;
for (x=0;x<count;++x)
	result += input[x] + temp;

Reading the assembly shows that LLVM did move the multiply and divide out of the loop, but kept 2 adds and a subtract. This is probably related to whatever caused the subtract case to fail earlier.

bitwise and

Good.

bitwise or

Both versions run about the same speed, but how is this running slower than a bitwise and? Reading the assembly shows that LLVM did optimize the multiple bitwise ors correctly, generating almost identical code to the bitwise and loops. But there’s an interesting problem - the 7 instruction bitwise or loops straddle a cacheline boundary. On some CPUs (like the PowerPC 970 aka G5) that can add significant stalls. If I change the code slightly (by removing some other tests), the loops fall on better alignment and run as fast as the bitwise and.

bitwise xor

Good.

Other sizes and unsigned values

The 8, 16, 64 bit and unsigned values are very similar to int32_t but are also hitting unaligned loop problems that make analysis of optimizations a little more difficult (without reading all the assembly).

Conclusions for integer loop invariants

With the -O3 flag for optimization, LLVM is usually moving integer loop invariant calculations out of inner loops.
LLVM does not optimize loop invariant integer subtracts. [ LLVM 2.3 fixes this problem - ccox ]
LLVM is not aligning loops and can have code position dependent slowdowns on some processors. [ LLVM team is working on it - ccox ]

Floating point math

test                        description   absolute   operations   ratio with
number                                    time       per second   test0

 0                 "float variable add"   3.84 sec   416.67 M     1.00
 1       "float multiple variable adds"   6.20 sec   258.06 M     1.61
 2            "float variable subtract"   3.85 sec   415.58 M     1.00
 3  "float multiple variable subtracts"   6.20 sec   258.06 M     1.61
 4            "float variable multiply"   3.84 sec   416.67 M     1.00
 5 "float multiple variable multiplies"   4.72 sec   338.98 M     1.23
 6              "float variable divide"  18.57 sec    86.16 M     4.84
 7    "float multiple variable divides"  53.81 sec    29.73 M    14.01
 8      "float multiple variable mixed"   3.85 sec   415.58 M     1.00

test                         description   absolute   operations   ratio with
number                                     time       per second   test0

 0                 "double variable add"   3.84 sec   416.67 M     1.00
 1       "double multiple variable adds"   6.23 sec   256.82 M     1.62
 2            "double variable subtract"   3.84 sec   416.67 M     1.00
 3  "double multiple variable subtracts"   6.22 sec   257.23 M     1.62
 4            "double variable multiply"   3.84 sec   416.67 M     1.00
 5 "double multiple variable multiplies"   4.17 sec   383.69 M     1.09
 6              "double variable divide"  18.58 sec    86.11 M     4.84
 7    "double multiple variable divides"  53.80 sec    29.74 M    14.01
 8      "double multiple variable mixed"   3.85 sec   415.58 M     1.00

LLVM fails to completely optimize even a single case. But the assembly dump shows that it does partly optimize the mixed math case similar to the way it did for integers.

Conclusions for floating point loop invariants

With the -O3 flag for optimization, LLVM is rarely moving floating point loop invariant calculations out of inner loops.
NOTE: LLVM doesn't do the FP optimizations because the LLVM architect believes it would affect the result of the FP operations.

Contents