Monday, June 4, 2012

Towards 14,000 write transactions per second on my laptop

Postgres 9.2 will have many improvements to both read and write scalability. Simon Riggs and I collaborated on a performance feature that greatly increased the throughput of small write transactions. Essentially, it accomplishes this by reducing the lock contention surrounding an internal lock called WALWriteLock. When an individual backend/connection holds this lock, it is empowered to write WAL from wal_buffers, an area of shared memory that temporarily holds WAL until it is written, and ultimately flushed to persistent storage.

Original update.sql "new group commit" benchmark, January 2012. This made it into Postgres 9.2. Here, we compare the performance of my original patch (red line) and Postgres master in January (green line). 9.1 performance on this benchmark would probably be very similar to that of the baseline seen here.

With this patch, we don’t have the backends queue up for the WALWriteLock to write their WAL as before. Rather, they either immediately obtain the WALWriteLock, or else queue up for it. However, when the lock becomes available, no waiting backend actually immediately acquires the lock. Rather, each backend once again checks if WAL has been flushed up to the LSN that the transaction being committed needs to be flushed up to. Oftentimes, they will find that this has happened, and will be able to simply fastpath out of the function that ensures that WAL is flushed (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL. In this manner, we batch commits, resulting in a really large increase in throughput, as you can tell from the diagram above.

In Postgres 9.2, this improvement automatically becomes available without any further configuration. This was one of the subjects of my recent talk, co-presented with Greg Smith at PgCon 2012, “A Batch of Commit Batching”

There is some confusion about the semantics of group commit. Previously, the project sometimes referred to two settings - commit_delay and commit_siblings - as offering a group commit implementation. Certainly, the intent of those settings was to allow the DBA to trade off latency for throughput, which characterises the group commit feature of some well known proprietary RDBMSs (though not, I believe, any of the various MySQL flavours). The patch that we worked on for 9.2 did not add arbitrary latency at any point in the hope of increasing throughput, and was technically totally orthogonal to commit_delay and commit_siblings.

commit_delay was always something that we didn't really have much practical use for. It merely adds a latency of commit_delay microseconds just before each transaction goes to commit, provided there are at least commit_siblings number of concurrent transactions in progress, in the hope that this latency will allow each backend to find their “flush up-to” point in the sequential WAL stream already flushed-up to when they go to flush WAL immediately after the delay, due to some other backend having flushed up to that point by then. When this happens, they can fastpath out of the function, often without ever having to acquire WALWriteLock

It is intuitively obvious that these are not hugely useful settings, generally just as likely to hurt performance as help it, and they were only really retained because they could improve throughput in certain narrow benchmarks. After all, whatever “certain other” backend was supposed to flush everyone else’s WAL was probably delayed too. With the new group commit implementation, the setting became more marginal than ever, and seemed to have next to no positive effect on commit speed. I argued for its removal in a dedicated thread on pgsql-hackers, and Greg and I were vocal in calling for its deprecation during our talk.

I subsequently realised that if we were to do that, there was still an uncomfortable tension. Even though commit_delay is almost completely ineffective, its purported purpose - to allow the DBA to trade-off transaction latency to increase the throughput of the server as a whole - is still a quite reasonable use-case, and a use-case that we were not serving well, with or without commit_delay. To give a good example of when throughput is more important than latency, Google’s F1 specialist relational database, built to replace their legacy Adwords MySQL cluster, very explicitly prioritises throughput over latency. What’s more, as I’ve already mentioned, this trade-off seems to be integral to the group commit implementation of a particular expensive, proprietary RDBMS. It occurred to me that commit_delay really doesn’t work with the new group commit. It is an entirely separate piece of code, whose usefulness was never revisited in light of new group commit, that sometimes simply adds a sleep. I then had what turned out to be a valuable insight: by delaying only within the leader backend, and not within all connection backends, we can much more effectively stagger transactions so that they coincide and can be flushed together. What previously required serendipitous timing - which, if botched, could have the opposite effect to that desired - could now be mostly ensured. The leader alone would wait the full commit_delay before proceeding with flushing, but followers would continue to get in the queue behind the leader for that much longer. This was found to further dramatically improve transaction throughput, particularly in cases that the original “new group commit” did not do so well on, such as workloads where the server didn’t have so many connections, or transactions were not single-statement write statements. It also made a respectable dent even at the highest client counts:

insert.sql benchmark. Importantly, the patch to adjust commit_delay (green line) shows by far the largest improvements over master without a commit_delay (red line) at lower client counts that are more representative of the real world. In this example, setting commit_delay to the same value, 4,000, without the still-not-in-postgres adjustment/patch to commit_delay behaviour (blue line) actually hurts performance a bit.

Update: The commit_delay patch has since been committed, and will be available in Postgres 9.3.

Interestingly, a commit_delay of around 4,000 (microseconds) seem about optimal for the above benchmark, performed on my laptop, which is roughly half of the raw fdatasync() (my wal_sync_method, the default for Linux) speed for the 7200 RPM disk that I have in my Thinkpad laptop, as reported by pg_test_fsync. The below benchmark is for pgbench’s tpc-b benchmark, and compares a 9.2 HEAD/new group commit baseline, which did not improve nearly so much as a result of our initial work on this:

Less dramatic, but still rather compelling improvements with the new commit_delay patch (green line) for a tpc-b.sql workload, as compared to setting commit_delay to 0 (red line), which is representative of master.  This workload was previously not helped much by "new group commit".

Experimentation shows that a commit_delay of around 4,000 remains roughly optimal here, even though the tpc-b transactions do rather more than the single-insert statement of the insert.sql benchmark’s transactions, indicating that commit_delay now shows some promise as a value that can be set in a well-principled way, without having to optimise for some synthetic workload to see these kinds of improvements, or having to do something else so baroque as to make the whole feature completely useless.

Thanks to the 4CaaSt project for funding this recent research into commit_delay and commit_siblings.

Unfortunately, I missed the 9.2 deadline for submitting this small but critical adjustment to commit_delay. For that reason, you’ll have to wait for 9.3 to benefit from it.


  1. I'm so looking forward to updating some of the databases here to this version.
    You made my day :)