Micro CPU Benchmarks: Isolating the FPU
Although it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.
Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:
Gcc -O2 -march=k8 flops.c -o flopsAnd, on the G5 machines, we used:
Gcc -O2 -mcpu=G5 flops.c -o flopsThe command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."
Let us check out the results:
MOD | FADD | FSUB | FMUL | FDIV | Powermac G5 2.7 GHz gcc 4.0 |
Powermac G5 2.7 GHz gcc 3.3 |
Powermac G5 2.5 GHz gcc 3.3 |
Opteron 850 2.4 GHz gcc 3.3.3 |
Opteron 850 2.4 GHz gcc 4.0 |
1 | 50% | 0% | 43% | 7% | 1158 | 1104 | 1026 | 1404 | 1319 |
2 | 43% | 29% | 14% | 14% | 607 | 665 | 618 | 844 | 695 |
3 | 35% | 12% | 53% | 0% | 3047 | 2890 | 2677 | 1955 | 1866 |
4 | 47% | 0% | 53% | 0% | 1583 | 522 | 486 | 1856 | 1850 |
5 | 45% | 0% | 52% | 3% | 1418 | 675 | 628 | 1831 | 1362 |
6 | 45% | 0% | 55% | 0% | 2163 | 915 | 851 | 1922 | 1698 |
7 | 25% | 25% | 25% | 25% | 546 | 284 | 265 | 562 | 502 |
8 | 43% | 0% | 57% | 0% | 2020 | 925 | 860 | 1989 | 1703 |
Average: | 1568 | 998 | 926 | 1545 | 1374 |
As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.
A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.
47 Comments
View All Comments
JohanAnandtech - Friday, September 2, 2005 - link
Sorry couldn't resist :-). (for the rest of the world, pannekoek is dutch for Pancake)Desktop performance is ok, as desktop apps are similar to the workstation apps we tested in the first article. Those apps spend from 5-20% in the OS, while server apps spend up to 80% of their time in the OS!
However, I should point out that we tested Mac OS X SERVER, so it is a problem for the Xserves.
Pannenkoek - Friday, September 2, 2005 - link
I stand corrected then. However, my reasoning still applies, it's just that Apple relies even more on its brand than on technology to sell server systems apparently. Who runs Mac OS servers anyway, it's an oxymoron. ;-)P.S. Do not mock my nick, it served well in beating godlike UT bots, and should be honoured as much as Loque.
Tanclearas - Thursday, September 1, 2005 - link
"Apple told us that the problem lies in the Apachebench (the client side), which stalls from time to time and thus, generates too low of a load on the (Apache) server."How does this explanation make any sense? Linux obviously doesn't have a problem with these "stalls".
JohanAnandtech - Friday, September 2, 2005 - link
What follows is not what Apple said, but my interpretation...They are probably pointing out that the version for Mac OS X has a Mac OS X specific bug. Of course, who is to blame? I am sceptical like you.
mariush - Thursday, September 1, 2005 - link
Page 4 :We used the following on the Opteron based PCs:
Gcc -O2 -mcpu=G5 flops.c -o flops
And, on the G5 machines, we used:
Gcc -O2 -march=k8 flops.c -o flops
I think it's the other way around.
Houdani - Thursday, September 1, 2005 - link
Aye, was gonna point that out also.In addition, on page 3 should you list the Yellow Dog Linux along with OSX in the Software section for the Apple PowerMac G5?
Shinei - Thursday, September 1, 2005 - link
My question is, would the memory latencies be so high for the 970FX if high-end RAM was used for the Linux tests (like, say, some TCCD or BH-5 at 2-2-2-5), instead of the standard 3-3-3-8 SPD that ships with the G5 system? Or is there some limitation to the G5 motherboard that prevents posting with performance RAM as a way for Apple to ensure that only certain, accepted DIMMs are used with their computers?Anyway, these results are very telling about what the OSX86 Macs are going to perform like--that is to say, ~25% slower than the equivalent Windows/Linux boxes running the same hardware...
IntelUser2000 - Sunday, September 4, 2005 - link
That doesn't matter since they are testing workstations, Irwindale and Opteron is also using CAS3 RAM. No workstations/servers use 2-2-2-5 RAM.
The poor scores of OS X compared to Linux makes sense. G5 was rumored to be fast in speccpu benchmarks but came out to be slower. Must be that rumor systems were benched with Linux and the production was benched with OSX.
I am impressed with OS X's features though.
Jedi2155 - Thursday, September 1, 2005 - link
The G5 motherboard has the limitations due to Apple's way to insure you only buy certified ram. The SPD settings must be perfect.ceefka - Thursday, September 1, 2005 - link
I am humbled by the sheer expertise of Johan. Amazing work, Johan!This makes me even more curious about Intel's contribution to the next generation of Macs. How will they compare to the best G5s?