Friday, January 23, 2015

Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance


On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of text in PostgreSQL. This was the last and the largest in a series of such patches, the patch that adds "abbreviated keys". PostgreSQL 9.5 will have big improvements in sort performance.

In realistic cases, CREATE INDEX operations on text are over 3 times faster than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but many will be.

This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting text is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing CLUSTER operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with ORDER BY or DISTINCT clauses, or aggregates using the GroupAggregate strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (CREATE INDEX CONCURRENTLY is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.


My interest in sorting is not new: I first worked on it in 2011. Early research on it back then prompted Robert Haas and Tom Lane to write the SortSupport infrastructure, which I've now extended here. Originally, the SortSupport infrastructure was all about providing alternative versions of comparators for use in sort routines, versions that avoided certain overhead otherwise inherent to calling functions that are generally accessible from SQL. As a highly extensible system, PostgreSQL requires that sort behavior be defined in terms of a default B-Tree operator class, which is itself defined in terms of SQL operators with underlying SQL-callable functions. These functions are written in C for built-in types, but in principle they could be written in a PL, like PL/Python, for example. When the underlying comparator can be expected to compile to just a few CPU instructions, "fmgr elision" (avoiding SQL function call overhead) is important - the time spent in the "fmgr" when not eliding it shows up prominently on profiles with certain types (or rather, it did in the past).

Note that I generalized sortSupport to work for more cases, so B-Tree index builds will get a nice little boost in PostgreSQL 9.5, even for types like integer and float8. That's not what this blog post is really about, though. This blog post is about the interesting new direction that the SortSupport infrastructure has been taken in, beyond mere "fmgr elision" - abbreviation.

A well known problem with the text datatype in PostgreSQL is that it uses the operating system/C standard library strcoll() function to resolve comparisons as tuples are sorted, which is very expensive. It's at least a thousand times more expensive than comparing integers, for example. This general problem is something that Robert Haas has expressed concern about in the past.

The expense relates to a normalization process whereby string comparisons use complex tables to make sure that strings are compared according to the rules of some particular culture or nation (that is, some particular collation associated with a locale). Even in English speaking countries, this is important; for example, the en_US collation considers difference in case (higher case versus lower case) after alphabetical ordering and diacritical differences, so case is considered last of all (the c locale, on the other hand, will sort upper case and lower case strings into two distinct batches, which is typically not desirable). In addition, while English usually doesn't have diacritics, sometimes it does. At work, I'm still sometimes annoyed by the sort order of the Linux Hipchat client's user list, which uses the C locale. Hi Ómar!

It was always suspected that we could more effectively amortize the cost of these locale-aware comparisons, by performing a transformation of strings into binary keys using strxfrm(), and sorting the keys instead (using a strcmp()-based comparator with the keys, which only considers raw byte ordering). This comparison will produce equivalent results to just using strcoll() directly. But the binary keys are much larger than the original strings - typically almost 4x larger. Moreover, we'd still need to do a tie-breaker strcmp() comparison (to check for strict binary equality) using the original string, when strcoll() reports equality, because the historic idea of equality that the text type offers is strict binary equality. There were some historic edge cases where a tie-breaker strcmp() was not performed following strcoll() returning '0', resulting in corrupt B-Tree indexes on a Hungarian database. strcoll() could return 0 despite not being passed a pair of bitwise-identical strings.

Having to keep around the original text datum seemed like an additional burden on the whole idea of using strxfrm() blobs as sort keys. It seemed like using binary keys to sort had a lot of promise, but we couldn't quite work out how to exploit that idea - until recently.

Abbreviated keys were committed:

Use abbreviated keys for faster sorting of text datums.

This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.

It's surprisingly effective to just store the first 8 bytes of a strxfrm() blob, and tie-break relatively infrequently by using a full old-style comparison, rather than the more obvious approach of sorting with pointers to strxfrm()-generated blobs (the approach that the C standard recommends for general purpose text sorting).

Entropy


By storing just the first 8 bytes (on 64-bit platforms; 4 bytes on 32-bit platforms) of the strxfrm() blob in a field that would otherwise contain a pointer-to-text (since text is a pass-by-reference type) - the same type-punned field that directly stores the representation of pass-by-value types like integer - most comparisons can often be resolved using just those 8 bytes directly, and without pointer-chasing. At the same time, the cost of locale transformations is still quite effectively amortized, because as always when using strxfrm(), binary key blobs performs a transformation O(n) times, rather than an average of O(n log n) times (the transformation process performed by strxfrm() and strcoll() may not be exactly comparable, but close enough).


It turns out that the large binary key blobs produced by strxfrm(), while much larger than the original strings, have a significant concentration of entropy towards the start of the blob (assuming the use of the Unicode collation algorithm, or an algorithm with similar properties). This is because the representation consists of a series of "levels". The "primary weights", which appear first, represent primary alphabetical ordering when using Latin scripts. So whitespace differences and punctuation differences are not represented at that level (nor are differences in case). For accented Latin characters, for example, diacritics are represented at a subsequent level, and so the abbreviated key representation typically won't vary if accents are added or removed to a text datum. This is important because languages that use accents extensively, like French or Spanish, will get a concentration of entropy in their 8 byte abbreviated keys that's about the same as if no accents were used, even though accented code points usually take 2 bytes of storage in UTF-8, rather than 1 byte on unaccented Latin alphabet code points.

Locality


A more general problem with sort performance is the problem of cache misses. My earlier work on sorting targeted pass-by-value PostgreSQL types like integer and float8. These pass-by-value types naturally have great locality of reference. Their comparisons are integral operations, which are fast, but operating on a representation that is tightly packed is what makes sorting integers with Quicksort perhaps as fast as is practically possible for a comparison-based sort. Cache miss penalties are likely to be the dominant cost on modern CPUs, that are more bottlenecked on memory bandwidth and latency in every successive generation (PDF); sorting infrastructure must heavily weigh this.

When I initially discussed the idea of abbreviated keys, there was a certain degree of skepticism from other Postgres hackers. What if most comparisons are not resolved by abbreviated comparisons, due to text datums with a lot of redundant or repeated information at the beginning? Could all the strxfrm() work go to waste when that happens? Well, for one thing, low cardinality sets (tuples with text columns that have a relatively low number of distinct values) are not a problem. That's because strcoll() is still a huge cost, and if we can have our authoritative tie-breaker comparator observe that the strings are identical, then no strcoll() comparison is ever needed - we can just exit early with a simple, cheap opportunistic binary comparison (memcmp()), which is almost as good. But what about when there are many different strings with differences towards the end of the string, past the 8th or so byte?

CPU cache characteristics have presented complicated engineering trade-offs for sorting infrastructure for a long time. Database luminary Jim Gray proposed an abbreviation-like technique as early as 1994, in his AlphaSort paper (PDF). He describes a "key-prefix sort" in the paper. Even back in 1994, Gray observed that memory latency was the dominant cost by a wide margin. The underlying trends in CPU performance characteristics have continued apace since then. Before his death in 2007, Gray officiated the Sort Benchmark. Among the rules for the "Daytona sort" category, which concerns the sort performance of general-purpose algorithms (which is what I'm interested in), it states that Daytona sort entrants "must not be overly dependent on the uniform and random distribution of key values in the sort input". It's almost as if Gray was saying: "of course I expect you to use abbreviated keys, but don't push your luck!". And so it is for PostgreSQL. Some cases benefit much more than others, and some cases might even be slightly regressed.


Merge Joins


An earlier piece of work for 9.5 had conventional comparisons (not involving abbreviated keys) always try and opportunistic memcmp() tie-breaker. This is likely to be particularly beneficial for merge joins (quite apart from any sort node that may feed the merge join), since they must "synchronize" relations using comparisons that can often be expected to indicate equality. Multi-column sorts on text are also considerably accelerated, where many leading column comparisons can be expected to indicate equality. It's also important for abbreviated keys, because as already mentioned we can still win big with low cardinality sets provided the full tie-breaker comparisons are resolved with a cheap memcmp().

Insurance


Fundamentally, when you do a cost/benefit analysis, abbreviated keys are very compelling. The upsides are clearly very large, and the break-even point for switching to using abbreviation is surprisingly far out. We cannot ignore the performance benefits of these techniques because some much rarer cases will be slightly regressed. But, as it happens, we have cheap worst case insurance: HyperLogLog is used to cheaply and fairly accurately check the cardinality of both abbreviated keys and the original text values. If abbreviated cardinality is an effective proxy for full cardinality, then most comparisons will either use abbreviated comparisons, or use a cheap memcmp() tie-breaker, which is almost as good. Otherwise, we abort abbreviation before the sort proper is underway.

Future


Abbreviated keys are just infrastructure. While text is the most compelling case, there are at least a few other datatypes that would greatly benefit from support for abbreviation. These include:

  • numeric
  • character(n)
  • uuid
  • bytea
  • citext (case insensitive text, from contrib/citext)
  • inet
  • cidr
  • macaddr
Update: A patch for numeric sortsupport with abbreviation was committed!

I welcome others with an interest in making sorting faster to work on the relevant opclass support for each of these types, and possibly others. Other people may be able to come up with novel encoding schemes for these types, that maximize the entropy within the finished abbreviated keys. Order-preserving compression is likely to be an area where text's support could be improved, by making comparisons resolved at the abbreviated key level more frequent. Hopefully the benefits of the abbreviated key infrastructure will not be limited to accelerating sorts on text.

9 comments:

  1. Great post and great contribution. Simply amazing!

    ReplyDelete
  2. would/will random UUIDs benefit from this? been reading on random UUID insert performance problems recently, has something to do with them not being near the most recent insert which I presume has something to do with sorting. I'm not an algorithms person.

    ReplyDelete
  3. Caleb: No, it wouldn't help with that

    ReplyDelete
  4. " ... significant concentration of entropy towards the start of the blob (assuming the use of the Unicode collation algorithm, or an algorithm with similar properties). "

    Does this mean COLLATE "C" is less effective than UTF8? I ask because I too read Robert Haas' blog post and applied the recommendations therein. So I'd like to know if your improvements negatives and/or regresses his recommendations.

    ReplyDelete
    Replies
    1. It's definitely the case that the cost of abbreviating text with a locale other than "C" is almost entirely comprised of the cost of strxfrm() calls. It's also true that the effectiveness of the abbreviated key optimization is better explained by making comparisons not have to do pointer chasing -- in general, memory latency is almost certainly the dominant cost when sorting tuples. People have a tendency to think it's all about amortizing the cost of expensive strcoll() calls, but it's a lot more complicated than that.

      I think that this change is in many ways something that undermines the traditional advice of thinking about using the "C" locale, because it will often amortize the cost of the transformation effectively anyway. But yes, as I allude to above, it's possible that the concentration of entropy in each abbreviated key could be significantly more than with the "C" locale, and that this factor could more than make up for the up-front cost of strxfrm() calls. This seems particularly likely to be true in languages other than English, or text where there is a lot of spaces or punctuation towards the start. I wouldn't like to speculate what the difference is in general, because datasets are so varied. Besides, in practice I think most people cannot get away with using the "C" locale, even in English speaking countries, so mostly I'm just glad that I particularly improved that common case.

      Delete
  5. First I would like to thank you for great work and interesting blog post. I always wonder why case insensitive indexes and comparations are no first class citizens in PostgreSQL. Of course we can use citext, but it is a contrib module and now we have a good example that some great new features are not available to those of us who need case insensitive collations. Now we have year 2016. Any news about abbreviated keys and citext?

    ReplyDelete
  6. First I would like to thank you for great work and interesting blog post. I always wonder why case insensitive indexes and comparations are no first class citizens in PostgreSQL. Of course we can use citext, but it is a contrib module and now we have a good example that some great new features are not available to those of us who need case insensitive collations. Now we have year 2016. Any news about abbreviated keys and citext?

    ReplyDelete
  7. Couldn't you avoid many of the tiebreaking strcmp() comparisons for equal abbreviated keys by storing a very small hash of the full datum (say just one byte: https://en.wikipedia.org/wiki/Pearson_hashing) with the abbreviated key?

    ReplyDelete
    Replies
    1. Not without breaking the long-standing behavior of text sorting. What your talking about is only useful in the case where abbreviated cardinality is a poor proxy for overall cardinality, which is a bad case for abbreviated keys. And, where the sort order can be made to suit making sorting faster entirely, rather than user requirements (surely order-preserving hashing is impossible?.)

      Where it could make sense is where you actually have a semi-contrived sort order that only considers speed:

      https://www.postgresql.org/message-id/5702.1480896563@sss.pgh.pa.us

      You can get the best of both worlds here precisely because the sort order can be totally contrived to suit the purposes of aggregation (because the user doesn't care about sorting; it's just an implementation detail of aggregation or somethign like that). Using a scheme like the one you described, you could take advantage of locality or clustering that the data has, on the one hand, since the first 7 bytes treated as a binary string are still probably close enough in practice to some classic user-visible definition of how text should be sorted, but on the other hand you get the advantage of fewer tie-breakers when not enough entropy is represented in abbreviated keys.

      Delete