Thursday, July 28, 2011

Could Clang displace GCC generally? Part II: Performance of PostgreSQL binaries

This is the second in a two-part series on Clang. If you haven't already, you'll want to read my original post on the topic, Could Clang displace GCC among PostgreSQL developers? Part I: Intro and compile times.

So, what about the performance of PostgreSQL binaries themselves when built with each compiler? I had heard contradictory reports of the performance of binaries built with Clang. In Belgium, Chris Lattner said that Clang built binaries could perform better, but a number of independent benchmarks suggested that Clang was generally behind, with some notable exceptions. I asked 2ndQuadrant colleague and PostgreSQL performance expert Greg Smith to suggest a useful benchmark to serve as a good starting point for comparing Postgres performance when built with Clang to performance when built with GCC. He suggested that I apply Jeff Janes' recent patch for pgbench that he'd reviewed. It stresses the executor, and therefore the CPU quite effectively, rather than table locks or IPC mechanisms. The results of this benchmark were very interesting.

Greg provided me with shell access to a beefy server, the same server that he used in his review of Jeff’s patch, which added the -P option: . I hacked together a shell script to run pgbench for this purpose. Binaries were built using GCC and Clang, each with exactly the same flags - Clang accepts the same flags as GCC. To smooth the results out, and to get a conclusive outcome, I decided on 16 10 minute -P runs with 4 connections, that alternated between using each set of binaries, lasting a total of 3 hours. Here’s a summary of the results:

1) GCC test:
tps = 34.242839 (including connections establishing)
2) Clang test:
tps = 34.370732 (including connections establishing)
3) GCC test:
tps = 34.186687 (including connections establishing)
4) Clang test:
tps = 34.922954 (including connections establishing)
5) GCC test:
tps = 32.393383 (including connections establishing)
6) Clang test:
tps = 34.994233 (including connections establishing)
7) GCC test:
tps = 33.019546 (including connections establishing)
8) Clang test:
tps = 34.234937 (including connections establishing)
9) GCC test:
tps = 33.233653 (including connections establishing)
10) Clang test:
tps = 35.233373 (including connections establishing)
11) GCC test:
tps = 33.962637 (including connections establishing)
12) Clang test:
tps = 33.869868 (including connections establishing)
13) GCC test:
tps = 33.488347 (including connections establishing)
14) Clang test:
tps = 33.005470 (including connections establishing)
15) GCC test:
tps = 33.600023 (including connections establishing)
16) Clang test:
tps = 34.770840 (including connections establishing)

The total transactions per second with the Clang binaries was marginally ahead of the GCC binaries. While further analysis is certainly needed, it is a remarkable achievement for Clang to have been able to hold its own against, or even slightly outperform a compiler as mature and popular as GCC here.

So, is it ready for prime-time? Well, not quite. Even my bleeding edge Fedora 15 system only comes with Clang 2.8, and only Clang 2.9 is listed as supported in 9.1. These build time figures are only obtainable on very recent revisions of Clang. While I’m extremely encouraged by our pgbench benchmark results, I hesitate to recommend the use of Clang for building production PostgreSQL binaries just yet. The benchmark is quite synthetic, and may not be a great proxy for general performance, although it certainly wasn’t cherry-picked.  I encourage others to independently reproduce my work here, and to suggest alernative benchmarks.

I can heartily recommend Clang for hacking on PostgreSQL today. I was also impressed with how accessible the Clang developers are if you have a problem. Sometimes you have to be a bit persistent, but in general the Clang community are quite responsive to end-users concerns.

Watch this space.


  1. This comment has been removed by the author.

  2. In order to allow people to care about your results you need to present it in a format they can understand. Here's your data fed into Gnumeric as a boxplot:

  3. Just to clarify the transaction rate info here: the new "-P" mode proposed for pgbench, and used by this benchmark, executes many loops of a server-side function for each "transaction" reported. Peter is showing that raw rate here; the results I posted to pgsql-hackers (list archive link above) decode that into a SELECT/s rate.

    Since this is a completely new mode to pgbench, you can't compare any of these numbers (raw or expressed as a SELECT rate) usefully to traditional SELECT/s rates from pgbench. It is an extremely CPU bound workload as executed here though, which makes it quite sensitive to the compiler optimizations used. When the results here being so close, it would have been difficult to even measure a difference above the run-to-run variation using any of the older, less CPU bound pgbench modes.

  4. So which compiler-options did you use exactly? The results would be somewhat uninteresting if you compared the compilers at something like -O0. In the tests I did so far, gcc tended to outperform clang when compiled with all the fancy optimisations on (loop unrolling/hoisting, SSE fpu, etc), but otherwise there was fairly little difference.
    It would also be interesting to see some build-time statistics.

  5. @whoppix

    The compile options are PostgreSQL defaults:

    gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -I../../src/port -DFRONTEND -I../../src/include -D_GNU_SOURCE -c -o strlcat.o strlcat.c

    To see build time statistics, go to the first article in the series.

  6. @Ryan: That's worse than the original presentation of the data. It implies the difference is much greater than it actually is.

  7. @Tyler not if you read the labels.

  8. @Ryan you know some people can actually understand numbers, if you can't, I suggest handing in your DBA badge

  9. Mind you, when Apple talks about clang performance they mean their own branch. Which is usually based on the bleeding edge. there's been a lot change in trunk, even clang 2.9 is well outdated.

  10. database transactions are so much disk I/O bound, except if you only hit in the cache, I doubt your benchmark has any relevance

  11. @Philippe Strauss
    Did you actually read the blog post at all?

  12. Peter,

    If you want to compare seed, efficiency of code generated by clang vs gcc, take out the disk of the loop, so don't benchmark a big SQL database, I've read your post but "stresses the executor and the CPU quite effectively" means NOTHING, not worth a blog post about it.

    use CPU bound tasks, database do actually stores some datas on disk I've been told.

    Remind my the dilbert joke about condescendant unix users I'm must confess, but frankly...

    ok you like DB's and wanted to play with it, but that kind of benchmarks is worth nothing.

  13. Sorry I may sound a tad harsh and oldtimer, but... You don't even talk about the write cache setting of your PG setup. Secondly, for measuring perf. of code generated by 2 compilers, never use a piece of software which does disk IO. Clang is a tremendous piece of software, but on it's website, the perf. of generated code is not even discussed. Why? it's young, a lot of apple developers work on it ok, but it has no past, no years of tuning. Apple talks on speed are about the time to compile a piece of C or C++ code...

    I've recently looked at quality of SIMD code generated by both GCC and Clang: both are raw pieces of shit. Things are not as bad in FP87 insns and integers, for sure, just to gives some abilities and limits of those compilers.

    Also, I'm astonished to see peoples writing a full compiler, in 2011, solely in C++, only companies with huge pile of money can afford it. I'm thinking about ML family of language which ease compilers writing, but yeah, every compiler writers want to be self hosting at some point.

    Download the shootout benchmark and build them with both compilers, just the C/C++ one, that would be a nice post.

  14. @Philippe Strauss
    This workload was specifically selected to be CPU bound.

    Why do you think I interlaced the runs of each test?

  15. Then explain in full details how the benchmarsk was set to not touch the disk but only the cache (which is a rather singular use of an SQL DB).

    I've seen variations in the range of 15-20% slower than GCC, about Clang generated code, I can't find back the URL of the guy having done the benchmark, your numbers are so much grouped around 34 TPS, it smell a benchmark bound by IO.

    Overall, I would not use SQL as a compiler benchmark, SQL databases are specifically tailored to store on the disk as fast as possible for reliability.