Micro CPU Benchmarks: Isolating the FPU
Although it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.
Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:
Gcc -O2 -march=k8 flops.c -o flopsAnd, on the G5 machines, we used:
Gcc -O2 -mcpu=G5 flops.c -o flopsThe command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."
Let us check out the results:
MOD | FADD | FSUB | FMUL | FDIV | Powermac G5 2.7 GHz gcc 4.0 |
Powermac G5 2.7 GHz gcc 3.3 |
Powermac G5 2.5 GHz gcc 3.3 |
Opteron 850 2.4 GHz gcc 3.3.3 |
Opteron 850 2.4 GHz gcc 4.0 |
1 | 50% | 0% | 43% | 7% | 1158 | 1104 | 1026 | 1404 | 1319 |
2 | 43% | 29% | 14% | 14% | 607 | 665 | 618 | 844 | 695 |
3 | 35% | 12% | 53% | 0% | 3047 | 2890 | 2677 | 1955 | 1866 |
4 | 47% | 0% | 53% | 0% | 1583 | 522 | 486 | 1856 | 1850 |
5 | 45% | 0% | 52% | 3% | 1418 | 675 | 628 | 1831 | 1362 |
6 | 45% | 0% | 55% | 0% | 2163 | 915 | 851 | 1922 | 1698 |
7 | 25% | 25% | 25% | 25% | 546 | 284 | 265 | 562 | 502 |
8 | 43% | 0% | 57% | 0% | 2020 | 925 | 860 | 1989 | 1703 |
Average: | 1568 | 998 | 926 | 1545 | 1374 |
As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.
A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.
47 Comments
View All Comments
stmok - Thursday, September 1, 2005 - link
LOL...As everyday passes, it seems more "interesting things" are revealed from Apple solutions.ViRGE - Thursday, September 1, 2005 - link
Granted, some of this was over my head(more than I'd like to admit to), but your results are none the less very interesting Johan. Now that we have the Linux/G5 numbers, there's no arguing that there's a weakness in MacOSX somewhere, which is a bit depressing as a Mac user, but still a very useful insight as to how there's obviously something very broken in some design aspect of the OS(it simply shouldn't be getting crushed like it is). My only question now is how Apple and its devs will respond to this - it is pretty damning after all.Thanks for finally getting some Linux/G5 numbers out to settle this.
sdf - Friday, September 2, 2005 - link
By changing hardware platforms.No, seriously.
A transition from PowerPC to Intel would be the perfect time to correct ABI flaws like this. It isn't that the G5 causes the slow down, it's that the slow down (maybe) can't really be fixed without breaking binary compatibility. A CPU transition is clearly going to do that anyway, so maybe they'll just wait...
toelovell - Thursday, September 1, 2005 - link
I am kind of curious to see how Darwin would work on an x86 based system for these same tests. There are x86 binaries for Darwin 8. So it should be possible to run these tests and compare Darwin with Linux on an x86 platform. This would help to see if the OS really is the limitation. Just a thought.JohanAnandtech - Thursday, September 1, 2005 - link
If linux is capable of pushing the G5 8 times higer than with Mac OS X, there is little doubt on my mind that the OS is the problem. Or did I understand you wrong?Anyway, I have no experience whatsoever with Darwin. My first impression is that installing Darwin on x86 is probably a very masochistic experience, due to lack of proper drivers. We might get it working but can it really run MySQL and other apps? THere are probably libraries missing... Will the results be representative of anything as it is probably tuned for just getting it running instead of performance? Anyone with Darwin x86 experience?
wjcott - Thursday, September 1, 2005 - link
The only interest I have in a mac OS is if they are going to sell it without a computer. I would love to have OS X, but I must build the machine.Quanticles - Thursday, September 1, 2005 - link
Every component must be fine tuned to the upmost degree... Every BIOS Setting... Every Hidden Register... *crazy eyes* =)