Andrew Gierth's numeric abbreviated keys patch was committed recently. This commit added abbreviation/sortsupport for the numeric type (the PostgreSQL type which allows practically arbitrary precision, typically recommended for representing monetary values).
The encoding scheme that Andrew came up with is rather clever - it has an excellent tendency to concentrate entropy from the original values into the generated abbreviated keys in real world cases. As far as accelerating sorts goes, numeric abbreviation is at least as effective as the original text abbreviation scheme. I easily saw improvements of 6x-7x with representative queries that did not spill to disk (i.e. that used quicksort). In essence, the patch makes sorting numeric values almost as cheap as sorting simple integers, since that is often all that is actually required during sorting proper (the abbreviated keys compare as integers, except that the comparison is inverted to comport with how abbreviation builds abbreviated values from numerics as tuples are copied into local memory ahead of sorting - see the patch for exact details).
Musings on PostgreSQL, database technology in general, and software development
Showing posts with label c++. Show all posts
Showing posts with label c++. Show all posts
Saturday, April 4, 2015
Friday, January 23, 2015
Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance
On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of text in PostgreSQL. This was the last and the largest in a series of such patches, the patch that adds "abbreviated keys". PostgreSQL 9.5 will have big improvements in sort performance.
In realistic cases, CREATE INDEX operations on text are over 3 times faster than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but many will be.
This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting text is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing CLUSTER operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with ORDER BY or DISTINCT clauses, or aggregates using the GroupAggregate strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (CREATE INDEX CONCURRENTLY is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.
Sunday, March 23, 2014
What I think of jsonb
Unsurprisingly, there has been a lot of interest in the jsonb type, which made it into the upcoming 9.4 release of Postgres. I was initially a reviewer of jsonb, although since I spent weeks polishing the code, I was ultimately credited as a co-author.
Jsonb is a new datatype for Postgres. It is distinct from the older json datatype in that its internal representation is binary, and in that it is internally typed. It also makes sophisticated nested predicates within queries on jsonb indexable. I've occasionally described the internally-typed scalar values as having “shadow types” unknown to the core SQL parser. This has several implications. For example, if you sort two Jsonb values containing only scalar numbers, the implementation invokes the numeric comparator (which the jsonb default B-Tree opclass comparator is defined in terms of). The on-disk representation of jsonb includes the same representation as is used for, say, numerics (as the internal binary representation of JSON primitive numbers, for example). Plus, JSON objects are de-duplicated by key on input, and optimized for cheap binary searches within a single jsonb. Still, like the earlier json type, jsonb in every sense “speaks JSON”. There are some limitations on what can be represented as a jsonb number, but those are exactly the same limitations that apply to the core numeric type (plus some limitations imposed by the JSON RFC, such as not accepting NaN values). I hope it suffices to say that these limitations are virtually irrelevant, and that many implementations have similar or worse limitations. All of these minor implementation-defined restrictions are explicitly anticipated and allowed for by the recent JSON RFC-7159.
Jsonb is a new datatype for Postgres. It is distinct from the older json datatype in that its internal representation is binary, and in that it is internally typed. It also makes sophisticated nested predicates within queries on jsonb indexable. I've occasionally described the internally-typed scalar values as having “shadow types” unknown to the core SQL parser. This has several implications. For example, if you sort two Jsonb values containing only scalar numbers, the implementation invokes the numeric comparator (which the jsonb default B-Tree opclass comparator is defined in terms of). The on-disk representation of jsonb includes the same representation as is used for, say, numerics (as the internal binary representation of JSON primitive numbers, for example). Plus, JSON objects are de-duplicated by key on input, and optimized for cheap binary searches within a single jsonb. Still, like the earlier json type, jsonb in every sense “speaks JSON”. There are some limitations on what can be represented as a jsonb number, but those are exactly the same limitations that apply to the core numeric type (plus some limitations imposed by the JSON RFC, such as not accepting NaN values). I hope it suffices to say that these limitations are virtually irrelevant, and that many implementations have similar or worse limitations. All of these minor implementation-defined restrictions are explicitly anticipated and allowed for by the recent JSON RFC-7159.
Wednesday, August 1, 2012
Sorting improvements in PostgreSQL 9.2: the case for micro-optimisation
There has been much discussion of performance improvements in the upcoming 9.2 release of PostgreSQL. Recently, I noticed that Regina Obe and Leo Hsu's new book, "PostgreSQL: Up and running" prominently listed "Sorting improvements that improve in-memory sorting operations by as much as 20%" as a performance feature of that release. While they do get things about right there, I'm not sure that this improvement warrants such prominent placement, at least in sheer terms of its likely impact on the performance of production PostgreSQL systems - we packed a lot of great performance improvements into 9.2. The likely reason that it was picked up on in the book, and the real reason for this blogpost, is the story behind the development of the optimisation, which I for one find kind of interesting, and worth sharing. It's more interesting from the perspective of someone with a general interest in systems programming or PostgreSQL's design philosophy than a casual user, though. If you're a casual user, the short version is that simple queries that perform in-memory sorting of integers and floats will be about 23% faster.
Monday, June 4, 2012
Towards 14,000 write transactions per second on my laptop
Postgres 9.2 will have many improvements to both read and write scalability. Simon Riggs and I collaborated on a performance feature that greatly increased the throughput of small write transactions. Essentially, it accomplishes this by reducing the lock contention surrounding an internal lock called WALWriteLock. When an individual backend/connection holds this lock, it is empowered to write WAL from wal_buffers, an area of shared memory that temporarily holds WAL until it is written, and ultimately flushed to persistent storage.
Thursday, March 29, 2012
Much improved statement statistics coming to Postgres 9.2
There is a tendency for people with an interest in improving databases performance to imagine that it mostly boils down to factors outside of their application - the hardware, operating system configuration, and database settings. While these are obviously crucially important, experience suggests that in most cases, by far the largest gains are to be had by optimising the application’s interaction with the database. Doing so invariably involves analysing what queries are being executed in production, their costs, and what the significance of the query is to the application or business process that the database supports.
PostgreSQL has had a module available in contrib since version 8.4 - pg_stat_statements, originally developed by Takahiro Itagaki. The module blames execution costs on queries, so that bottlenecks in production can be isolated to points in the application. It does so by providing a view that is continually updated, giving real-time statistical information. Here is an example from the Postgres 9.2 docs:
PostgreSQL has had a module available in contrib since version 8.4 - pg_stat_statements, originally developed by Takahiro Itagaki. The module blames execution costs on queries, so that bottlenecks in production can be isolated to points in the application. It does so by providing a view that is continually updated, giving real-time statistical information. Here is an example from the Postgres 9.2 docs:
Saturday, August 6, 2011
Clang now builds Postgres without additional warnings
I'm happy to report that as of this evening, Clang builds PostgreSQL without any warnings, apart from a single remaining warning that also occurs when building with GCC, which is actually a bug in GNU Flex that the Flex developers don't seem to want to fix. On GCC 4.6, the warning looks like this:
In file included from gram.y:12962:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:16246:23: warning: unused variable ‘yyg’ [-Wunused-variable]
With Clang, however, it looks like this:
scan.c:16246:23: warning: unused variable 'yyg' [-Wunused-variable]
struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be unused depending upon options. */
^
Note that the "^" is directly underneath the offending variable "yyg" on the terminal emulator that generated this warning.
In file included from gram.y:12962:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:16246:23: warning: unused variable ‘yyg’ [-Wunused-variable]
With Clang, however, it looks like this:
scan.c:16246:23: warning: unused variable 'yyg' [-Wunused-variable]
struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be unused depending upon options. */
^
Note that the "^" is directly underneath the offending variable "yyg" on the terminal emulator that generated this warning.
Thursday, July 28, 2011
Could Clang displace GCC generally? Part II: Performance of PostgreSQL binaries
This is the second in a two-part series on Clang. If you haven't already, you'll want to read my original post on the topic, Could Clang displace GCC among PostgreSQL developers? Part I: Intro and compile times.
So, what about the performance of PostgreSQL binaries themselves when built with each compiler? I had heard contradictory reports of the performance of binaries built with Clang. In Belgium, Chris Lattner said that Clang built binaries could perform better, but a number of independent benchmarks suggested that Clang was generally behind, with some notable exceptions. I asked 2ndQuadrant colleague and PostgreSQL performance expert Greg Smith to suggest a useful benchmark to serve as a good starting point for comparing Postgres performance when built with Clang to performance when built with GCC. He suggested that I apply Jeff Janes' recent patch for pgbench that he'd reviewed. It stresses the executor, and therefore the CPU quite effectively, rather than table locks or IPC mechanisms. The results of this benchmark were very interesting.
So, what about the performance of PostgreSQL binaries themselves when built with each compiler? I had heard contradictory reports of the performance of binaries built with Clang. In Belgium, Chris Lattner said that Clang built binaries could perform better, but a number of independent benchmarks suggested that Clang was generally behind, with some notable exceptions. I asked 2ndQuadrant colleague and PostgreSQL performance expert Greg Smith to suggest a useful benchmark to serve as a good starting point for comparing Postgres performance when built with Clang to performance when built with GCC. He suggested that I apply Jeff Janes' recent patch for pgbench that he'd reviewed. It stresses the executor, and therefore the CPU quite effectively, rather than table locks or IPC mechanisms. The results of this benchmark were very interesting.
Saturday, July 23, 2011
Could Clang displace GCC among PostgreSQL developers? Part I: Intro and compile times
Back in February, I attended FOSDEM, the free and open source software developers' European meeting. Most of my time was spent manning the PostgreSQL stand and networking with fellow members of the community; I often find the “hall track” of conferences is of most interest.
On this occasion though, there was one talk in particular that I really wanted to see: Chris Lattner's “LLVM and Clang: Advancing Compiler Technology”. I was certainly not alone in having the subject pique my interest, as it filled what was apparently Belgium's largest auditorium to capacity, and I was joined by a number of other PostgreSQL people, including some senior community members. Chris' slides are available from here:
http://www.scribd.com/doc/48921683/LLVM-Clang-Advancing-Compiler-Technology
On this occasion though, there was one talk in particular that I really wanted to see: Chris Lattner's “LLVM and Clang: Advancing Compiler Technology”. I was certainly not alone in having the subject pique my interest, as it filled what was apparently Belgium's largest auditorium to capacity, and I was joined by a number of other PostgreSQL people, including some senior community members. Chris' slides are available from here:
http://www.scribd.com/doc/48921683/LLVM-Clang-Advancing-Compiler-Technology
Subscribe to:
Posts (Atom)