Peter Geoghegan's blog

Friday, January 23, 2015

Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance

On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of text in PostgreSQL. This was the last and the largest in a series of such patches, the patch that adds "abbreviated keys". PostgreSQL 9.5 will have big improvements in sort performance.

In realistic cases, CREATE INDEX operations on text are over 3 times faster than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but many will be.

This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting text is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing CLUSTER operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with ORDER BY or DISTINCT clauses, or aggregates using the GroupAggregate strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (CREATE INDEX CONCURRENTLY is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.

What I think of jsonb

Unsurprisingly, there has been a lot of interest in the jsonb type, which made it into the upcoming 9.4 release of Postgres. I was initially a reviewer of jsonb, although since I spent weeks polishing the code, I was ultimately credited as a co-author.

Jsonb is a new datatype for Postgres. It is distinct from the older json datatype in that its internal representation is binary, and in that it is internally typed. It also makes sophisticated nested predicates within queries on jsonb indexable. I've occasionally described the internally-typed scalar values as having “shadow types” unknown to the core SQL parser. This has several implications. For example, if you sort two Jsonb values containing only scalar numbers, the implementation invokes the numeric comparator (which the jsonb default B-Tree opclass comparator is defined in terms of). The on-disk representation of jsonb includes the same representation as is used for, say, numerics (as the internal binary representation of JSON primitive numbers, for example). Plus, JSON objects are de-duplicated by key on input, and optimized for cheap binary searches within a single jsonb. Still, like the earlier json type, jsonb in every sense “speaks JSON”. There are some limitations on what can be represented as a jsonb number, but those are exactly the same limitations that apply to the core numeric type (plus some limitations imposed by the JSON RFC, such as not accepting NaN values). I hope it suffices to say that these limitations are virtually irrelevant, and that many implementations have similar or worse limitations. All of these minor implementation-defined restrictions are explicitly anticipated and allowed for by the recent JSON RFC-7159.

Moving on

Today was my last day at 2ndQuadrant.

The experience of working with 2ndQuadrant in the last couple of years has been very positive. I just decided it was time for a change. Being able to work on interesting problems during my time at 2ndQuadrant, both as a Postgres developer and as a consultant has been great fun, and very personally rewarding. I would like to acknowledge the invaluable support of both Simon Riggs and Greg Smith - thank you both. I wish the entire 2ndQuadrant staff all the best.

I expect to remain active as a Postgres developer, and already have plans to relocate to work for another company that is well known within the community.

Tuesday, December 4, 2012

Finding plans in pg_stat_plans easily with pg_find_plans

As I recently blogged about, pg_stat_plans is a PostgreSQL satellite project I've been working on that aims to support earlier versions of Postgres that cannot use the new pg_stat_statements, and to track execution costs at the plan rather than the query granularity. It allows the user to easily explain each stored query text to see the plan for the entry, and has features that facilitate monitoring planner regressions.

Since PostgreSQL 9.0, support for machine-readable EXPLAIN output has existed. I'm not aware that anyone else got around to actually doing something interesting with this capability, though. I knew that in order to get the most benefit from pg_stat_plans, it ought to be possible to leverage this capability to search for plans based on arbitrary criteria, directly from SQL.

I've written an experimental submodule of pg_stat_plans, called pg_find_plans, that is designed to do just that - to quickly find plans and their execution costs, for those plans that, say, perform a sequential scan on a known large table.