Peter Geoghegan's blog: abbreviation

Saturday, April 4, 2015

Abbreviated keys for numeric to accelerate numeric sorts

Andrew Gierth's numeric abbreviated keys patch was committed recently. This commit added abbreviation/sortsupport for the numeric type (the PostgreSQL type which allows practically arbitrary precision, typically recommended for representing monetary values).

The encoding scheme that Andrew came up with is rather clever - it has an excellent tendency to concentrate entropy from the original values into the generated abbreviated keys in real world cases. As far as accelerating sorts goes, numeric abbreviation is at least as effective as the original text abbreviation scheme. I easily saw improvements of 6x-7x with representative queries that did not spill to disk (i.e. that used quicksort). In essence, the patch makes sorting numeric values almost as cheap as sorting simple integers, since that is often all that is actually required during sorting proper (the abbreviated keys compare as integers, except that the comparison is inverted to comport with how abbreviation builds abbreviated values from numerics as tuples are copied into local memory ahead of sorting - see the patch for exact details).

Friday, January 23, 2015

Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance

On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of text in PostgreSQL. This was the last and the largest in a series of such patches, the patch that adds "abbreviated keys". PostgreSQL 9.5 will have big improvements in sort performance.

In realistic cases, CREATE INDEX operations on text are over 3 times faster than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but many will be.

This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting text is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing CLUSTER operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with ORDER BY or DISTINCT clauses, or aggregates using the GroupAggregate strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (CREATE INDEX CONCURRENTLY is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.