Peter Geoghegan's blog

Visualizing Postgres page images within GDB

2019-03-22T17:07:00.001-07:00

It's straightforward to set up GDB to quickly invoke pg_hexedit on a page image, without going through the filesystem. The page image can even come from a local temp buffer.
A user-defined GDB command can be created that shows an arbitrary page image in pg_hexedit from an interactive GDB session.

This is a good way to understand what's really going on when debugging access method code. It also works well with core dumps. I found this valuable during a recent project to improve the Postgres B-Tree code.

An example of how to make this work is available from a newly added section of the pg_hexedit README file:

https://github.com/petergeoghegan/pg_hexedit/#using-pg_hexedit-while-debugging-postgres-with-gdb

Visualizing a column's space overhead using pg_hexedit

2018-05-18T16:11:00.004-07:00

pg_hexedit recently gained the ability to annotate the space taken up by each individual column/attribute within each individual tuple. This works with tables, and with B-Tree indexes.

I had to come up with a way of passing the pg_hexedit frontend utility the relevant pg_attribute metadata to make this work. This metadata describes the "shape" of individual tuples in a relation (backend code uses a closely related structure called a "tuple descriptor"). My approach works seamlessly in simple cases, but can still be used when manually running the pg_hexedit command line tool.

pg_attribute system catalog table with column annotations/tags

This new capability could be applied to optimizing the data layout of a table that is expected to eventually have a massive number of rows. Carefully choosing the order and type of each column can reduce the total on-disk footprint of a table by an appreciable amount, especially when the final table ends up with several 1 byte columns that get packed together.

I am aware of several PostgreSQL users that found it worthwhile to have a highly optimized tuple layout, going so far as to use their own custom dataypes. Alignment-aware micro-optimization of a Postgres client application's schema won't help much in most cases, but it can help noticeably with things like fact tables, or tables that contain machine-generated event data. Developing a sense of proportion around storage overhead should now be easier, and more intuitive.

Decoding pg_filenode.map files with pg_filenodemapdata

2018-03-19T21:44:00.000-07:00

From time to time, you may need to figure out which file in a PostgreSQL data directory corresponds to a particular table or index in the database. For example, pg_hexedit users sometimes need this information, since pg_hexedit is a frontend utility that works by reading relation files from the filesystem. In practice, a pg_hexedit convenience script can usually be used instead. Users need only give the name of the table or index that is to be examined. The convenience scripts call the built-in function pg_relation_filepath() via an SQL query.

This approach won't always work, though. pg_hexedit is a tool for investigating corruption, and sometimes corruption can affect system catalogs in a way that makes it impossible to even establish a connection to the database. You may find that you're greeted with an arcane error any time you attempt to connect to the database. The error may look something like this:

ERROR: catalog is missing 3 attribute(s) for relid 827

In this example, the issue that prevents us from connecting must have something to do with the system catalog pg_attribute, and/or an index on pg_attribute. The catalog with relid/pg_class OID 827 (pg_default_acl_role_nsp_obj_index) appears to lack pg_attribute entries, making the built-in/catalog index pg_default_acl_role_nsp_obj_index unusable (note that there is no reason to think that the underlying relfile for pg_default_acl_role_nsp_obj_index is itself corrupt). To confirm this theory, we'll need to directly examine the pg_attribute relation for the database. Of course, there is no way to query pg_attribute, because we cannot connect. Moreover, there is no easy way to know where the files associated with pg_attribute are, so that we can at least examine pg_attribute using pg_hexedit. The "relfilenode" number that corresponds to pg_attribute (or any other table) isn't hard-coded or stable. For example, the relfilenode of a table will change any time VACUUM FULL is used on the table.

I've written a tool called pg_filenodemapdata that can help when something like this happens. This new pg_hexedit-bundled program prints the contents of pg_filenode.map files. It can be used to determine the relfilenode numbers that correspond to system catalogs entries from a pg_filenode.map file (pg_class OID to relfilenode map file). pg_attribute is an example of a special system catalog that is mapped by per-database pg_filenode.map files. As a rule of thumb, the most severe system catalog corruption is corruption that affects one of the catalogs that Postgres tracks within a pg_filenode.map file. pg_filenodemapdata should help with getting to the bottom of the problem; it's now possible to at least examine the corrupt pg_attribute file.

Exploring SP-GiST and BRIN indexes visually using pg_hexedit

2018-01-08T14:06:00.000-08:00

Support for both BRIN and SP-GiST access methods was recently added to pg_hexedit, the experimental hex editor framework for PostgreSQL relation files. These were the final access methods among the standard Postgres index access methods that required support.

SP-GiST (Space-Partitioned GiST)

Beginning of an SP-GiST leaf page

SP-GiST is unique among index access methods whose index structure is tree-like, in that it supports tree structures that are unbalanced. SP-GiST operator classes exist that support k-d trees, quadtrees, and suffix trees. These structures are traditionally only suited to a fully in-memory representation, with dynamically-allocated nodes that contain a small number of simple pointers (byte addresses) pointing to other nodes.

SP-GiST presents a generalized interface through which all of these space-partitioned trees can be constructed for a given datatype, in a way that minimizes disk seeks (PDF) and works well with block-orientated storage. Essentially, SP-GiST maps tree nodes onto disk blocks in an adaptive fashion, rather than simply having a block directly correspond to a tree node, as happens with other access methods.

There are particularly intricate data structures needed to support all of this. Space utilization can be an issue with SP-GiST indexes, though that's probably very workload dependent. This is something that pg_hexedit can be effective at representing visually.

SP-GiST is a good example of the PostgreSQL community implementing a concept that comes directly from state of the art database research (PDF). I suspect that we have yet to fully realize the benefit of SP-GiST for specific application domains, due to a simple lack of awareness among users and potential users that work in those domains. Perhaps this enhancement can contribute in some small way towards a better understanding of what is possible.

BRIN (Block Range Index)

BRIN "revmap" page

The structure of BRIN indexes is not at all tree-like. BRIN works by summarizing the locations of ranges of values within the underlying indexed table, and exploiting an underlying natural ordering (e.g., a date column on a large append-only table with historic sales records). BRIN indexes are often very small, even when they index a very large underlying table.

BRIN summary page

The on-disk representation is fairly simple. One novel aspect of BRIN's on-disk representation is that it's the only index access method that updates index tuples in-place (this is possible in part because it doesn't have one index tuple for every heap tuple/HOT chain from the indexed table, another unique property). These in-place updates can happen when a range in a summary page changes. Visibility into how often this happens in the real world may prove useful.

pg_hexedit now supports GiST, GIN, and hash indexes

2017-12-15T12:29:00.000-08:00

I've added several enhancements to pg_hexedit, the experimental hex editor toolkit that allows you to open up raw PostgreSQL relation files with useful tags and annotations about the state and purpose of each field. The tool now supports annotations for GiST, GIN, and hash indexes, as well as sequences.

GIN "posting tree" leaf page. Compressed TIDs are in orange.

It wasn't very time consuming to add these enhancements, because most index access methods share the same basic approach to page layout. I plan to add support for the two remaining index access methods (BRIN and SP-GiST) early in the new year. My hope is that this will spur interest in the internals of PostgreSQL index access methods (PDF link), and how they deal with index tuples and space management (PDF link).

Hat tip to Pat Shaughnessy, who just today wrote a great blog post on the internals of GiST. The fact that he has done such a thorough job of explaining how GiST works to a wider audience is encouraging.

pg_hexedit: Rich hex editor annotations for Postgres relfiles

2017-11-26T17:48:00.000-08:00

I've written an experimental tool for presenting PostgreSQL relation files in a hex editor with annotations/tags and tooltips that show the structure of the data and its content, including bit field values. This tool is called pg_hexedit, and is available from:

https://github.com/petergeoghegan/pg_hexedit

pg_hexedit is built on top of the open source, cross-platform GUI hex editor wxHexEditor. Since it's an experimental tool that is primarily made available for educational purposes, you are well advised to not use it on any data directory that isn't entirely disposable. It may cause data corruption. Opening a Postgres relation file in a hex editor while the server is running is a fundamentally unsafe thing to do if you care about your data. Use of the tool should be limited to throwaway installations on users' personal machines.

wxHexeditor and pg_hexedit together show information about each individual field in an interactive, easy to use way:

pg_type catalog table opened in wxHexeditor, with annotations

I originally wrote the tool in order to meet my own needs in this area. I was working on corruption detection, and it became clear that a tool like this would help with corruption simulation/white-box testing, something that I've spent rather a lot of time on. Simulating and testing novel corruption scenarios became significantly easier with pg_hexedit.

Tools like contrib/pageinspect are great, but they are still somewhat interpretive, which can actually be a hindrance for this kind of work. In short, pageinspect functions show "what tuples are on the page" logically, as well as the physical contents of individual tuples, but the exact physical state of the entire page is obscured, in order to support an item-pointer-wise SQL interface. The subtle details of how free space is managed within a single page can matter. At least to me.

"cities" nbtree page, starting with ItemId array (shown as blue tags)

I eventually realized that pg_hexedit is also broadly useful as an educational tool, and decided to make it available as an open source project. Anything that helps to demystify the internals of Postgres seems like a good thing to me. I hope that pg_hexedit will be useful to users or aspiring hackers that want to understand how PostgreSQL works from the ground up. I welcome pull requests from users that want to expand pg_hexedit. For example, support for additional index access methods would be nice. Only heapam and the nbtree index AM are currently supported.

amcheck “table-matches-index” enhancement now available, detects "freeze-the-dead" corruption

2017-10-26T08:27:00.000-07:00

I’m pleased to announce that v1.2 of amcheck, a tool for detecting that PostgreSQL relations are logically consistent (that they do not appear to be corrupt) is now generally available. This version adds a big enhancement - the optional ability to check if every tuple that should have an entry in the index does in fact have such an entry. Specifically, we check for a table entry with matching data, as well as a matching heap TID. This happens at the end of the existing tests, as an optional extra step.

This enhancement is significant because it seems much more likely to catch corruption in the wild. In general, inconsistencies between a table and its indexes are more likely to occur than inconsistencies between blocks within an index for many reasons. There is simply a much larger window for an inconsistency to arise when something is amiss with database storage that breaks the assumptions PostgreSQL makes during crash recovery, for example.

The enhancement is also significant because it played a role in identifying a PostgreSQL data corruption bug that will be fixed in the next point release, scheduled for November 9th, 2017. This bug affects all supported PostgreSQL versions. It was informally dubbed the “freeze-the-dead” bug.

November 6 2017 update: The fix was reverted due to additional concerns that came to light. The community is working on a new, more comprehensive fix for the next point release.

December 15 2017 update: A new fix has been committed, and will appear in 9.3.21, 9.4.16, 9.5.11, 9.6.7, and 10.2 point releases, scheduled for February 8th, 2018.

Packages for v1.2 are available from the community Debian/Ubuntu apt repository, as well as packages from the community Redhat/CentOS/SLES yum repository. Full details on installing these packages are available from the README:

https://github.com/petergeoghegan/amcheck/

"Freeze-the-dead" corruption detection

I should emphasize that the bug is something that I believe to be very unlikely to hit in the real world, because there is only a very small window. Moreover, it is probably virtually impossible to hit without a manual VACUUM FREEZE. It can only happen with the use of foreign keys (strictly speaking, directly allocating MultiXacts could cause the issue in environments where foreign keys are not used).

amcheck is effective in detecting corruption caused by the “freeze-the-dead” bug because the corruption results in a logical inconsistency in a heap page. There could be multiple versions of the same row visible at once in cases where a prune never took place. In other scenarios, there could be a HOT chain that is pruned prematurely, leading to wrong answers to query plans that use an index scan, while sequential scan plans (plans that should get the same answer without using an index) still give correct answers. Even though amcheck was not written with these specific inconsistencies in mind, it still seems to reliably detect them.

Managing risk

The PostgreSQL development community is well known for putting data integrity first. With that in mind, I think that it’s important to avoid overstating the significance of the fact that amcheck detects corruption caused by bugs in PostgreSQL itself. While it is true that finding data corruption bugs is one goal of amcheck, and while it is also true that it has actually done so more than once already, that in itself shouldn’t be seen as a blemish on the project’s reputation for ensuring data integrity. If your take-away about amcheck appearing necessary is along the lines of “that certainly doesn’t inspire confidence”, I would argue that you’re thinking about the issues in the wrong way. Besides, amcheck is hardly the first tool like this to appear - similar tools are available for all other major RDBMSs.

I think that it will prove useful to have an immediate way of mechanically detecting corruption caused by the “freeze-the-dead” bug. We may actually hear reports of corruption that it has caused in the wild (if you happen to have been affected, please let the pgsql-hackers list know about it). I also believe that the newly enhanced amcheck will detect corruption caused by other historic bugs, including the CREATE INDEX CONCURRENTLY bug detected in February of 2017, as well as a similar CREATE INDEX CONCURRENTLY bug from back in 2012. Estimating the prevalence of corruption of this general nature is a very tricky business, though; it largely comes down to workload, and those are incredibly varied. And, even if an issue is on average very unlikely to strike, that doesn’t help those that are affected despite those odds. In short, there is no such thing as the average PostgreSQL database, and there are many practical problems with applying statistical models to complex domains.

I would venture to predict that a more detailed picture of how prevalent corruption like this is across all PostgreSQL installations will emerge over time, as amcheck is used more widely, and that that picture will be fairly boring. It still seems important to make every effort here, though. Going forward, we’ll have more to go on than educated guesses. And, those unlucky few that turn out to be affected by bugs that lead to corruption will have a relatively simple, non-disruptive tool to isolate the problem. DBAs can use amcheck to help with managing the risk of data corruption, including but not limited to corruption originating from bugs in PostgreSQL.

amcheck for Postgres 9.4+ now available from PGDG apt and yum repositories

2017-10-15T18:44:00.000-07:00

amcheck, a tool for index corruption detection, now has packages available from the community Debian/Ubuntu apt repository, as well as packages from the community Redhat/CentOS/SLES yum repository.

This means that installations built on those community resources can easily install amcheck, even on PostgreSQL versions before PostgreSQL 10, the release that contrib/amcheck actually first appears in.

Full details on installing these packages are available from the README: https://github.com/petergeoghegan/amcheck/

It's also possible to install the packages on PostgreSQL 10, because the extension these packages install is actually named "amcheck_next" (not "amcheck"). Currently, it isn't really useful to install "amcheck_next" on PostgreSQL 10, because its functionality is identical to contrib/amcheck. That's expected to change soon, though. I will add a new enhancement to amcheck_next in the coming weeks, allowing verification functions to perform "heap matches index" verification on top of what is already possible.

Many thanks to Christoph Berg and Devrim Gündüz for their help with the packaging.

PostgreSQL Index bloat under a microscope

2017-07-18T22:43:00.000-07:00

I've posted a snippet query to the PostgreSQL Wiki that "summarizes the keyspace" of a target B-Tree index. This means that it displays which range of indexed values belong on each page, starting from the root. It requires pageinspect. The query recursively performs a breadth-first search. Along the way, it also displays information about the space utilization of each page, and the number of distinct key values that actually exist on the page, allowing you to get a sense of how densely filled each page is relative to what might be expected.

The query is available from:

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index

If I use the query against the largest index that results from initializing a pgbench database at scale factor 10 (pgbench_accounts_pkey), the query takes about 3 seconds to execute on my laptop, and returns the following:

       
 level | l_item | blkno | btpo_flags | type | live_items | dead_items | avg_item_size | page_size | free_size | distinct_real_item_keys | highkey | distinct_block_pointers 
-------+--------+-------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------+---------+-------------------------
     2 |      1 |   290 |          2 | r    |         10 |          0 |            15 |      8192 |      7956 |                      10 |         |                      10
     1 |      1 |     3 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  103945 |                     284
     1 |      2 |   289 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  207889 |                     284
     1 |      3 |   575 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  311833 |                     284
     1 |      4 |   860 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  415777 |                     284
     1 |      5 |  1145 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  519721 |                     284
     1 |      6 |  1430 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  623665 |                     284
     1 |      7 |  1715 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  727609 |                     284
     1 |      8 |  2000 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  831553 |                     284
     1 |      9 |  2285 |          0 | i    |        285 |          0 |            15 |      8192 |      2456 |                     284 |  935497 |                     284
     1 |     10 |  2570 |          0 | i    |        177 |          0 |            15 |      8192 |      4616 |                     177 |         |                     177
     0 |      1 |     1 |          1 | l    |        367 |          0 |            16 |      8192 |       808 |                     366 |     367 |                       6
     0 |      2 |     2 |          1 | l    |        367 |          0 |            16 |      8192 |       808 |                     366 |     733 |                       6
     0 |      3 |     4 |          1 | l    |        367 |          0 |            16 |      8192 |       808 |                     366 |    1099 |                       6
...
     0 |   2730 |  2741 |          1 | l    |        367 |          0 |            16 |      8192 |       808 |                     366 |  999181 |                       6
     0 |   2731 |  2742 |          1 | l    |        367 |          0 |            16 |      8192 |       808 |                     366 |  999547 |                       6
     0 |   2732 |  2743 |          1 | l    |        367 |          0 |            16 |      8192 |       808 |                     366 |  999913 |                       6
     0 |   2733 |  2744 |          1 | l    |         88 |          0 |            16 |      8192 |      6388 |                      88 |         |                       2
(2744 rows)

Note that I changed the query to use int4_from_page_data() here, so that the split points/high key values are displayed as ordinary int4 output. The items are in logical order (the int4 sort order the index uses).

A few interesting things are displayed here, or can be inferred from what is displayed:

There are 3 levels - one root level, an additional level of internal pages, and the leaf level, which is always level 0. (Note that most leaf pages are omitted for brevity.)

B-Trees are almost always rather short, and in general tend to look a lot more like a bush than a tree when they are greater than a few pages in size. We see that here.

Leaf pages point to 6 distinct table blocks in all cases, with the sole exception of the rightmost page (at block 2744). Since the index is 21MB and the table is 128MB, the ratio of index size to table size is just over 6:1. They match.

The size of the 11 internal index pages is negligible. The width of both (leaf) index tuples and heap (table) tuples happens to be completely uniform with any pgbench table. The ratio of the width of every index tuple to any heap tuple almost exactly matches the ratio of the overall size of the index to the size of the table it indexes, which almost exactly matches the number of heap tuples pointed to from within each leaf page. The logical/physical correlation between index and table must be very close to 1.0. This is good for B-Tree index scans that read through many leaf pages.

There are no duplicates within pages.

The lack of any duplicates is evidenced by the number of live items (not including the high key) exactly matching the number of distinct real items. They're almost always both 366 here (once again, the rightmost page is the only exception). Note that there is usually one more "live item" than real item (item with a real table pointer). The page high key is counted as a live item by pageinspect, but is not counted as a "real item value" by my query, since it's just metadata. This is a little confusing, but does make sense when considered in the broader context of how PostgreSQL B-Trees work.

Even though this is a unique index, it might still have physical duplicates at some point in the future. Not right now, though.

Production issues

I'm not aware that anyone has used a query like this to debug tricky production performance problems with index bloat before now. It could certainly help with that. Bloat can sometimes be localized to one part of an index, and feedback from this query could allow someone to tie it back to a problem in application code.

The query might also help users provide information on production performance issues to a mailing list like pgsql-performance or pgsql-hackers, for the perusal of hackers like myself. It might be possible to use this kind of feedback to improve how VACUUM and related mechanisms handle index bloat. The query could take a while to execute for larger indexes. It probably wouldn't be unreasonable to run in production at an off-peak time with a moderately large index.

amcheck: Verify the logical consistency of PostgreSQL B-Tree indexes

2016-05-10T11:50:00.001-07:00

I've created a project page on Github for amcheck, a tool for verifying the logical consistency of PostgreSQL B-Tree indexes:

https://github.com/petergeoghegan/amcheck

The tool is primarily useful for detecting index corruption in production database systems. It can do this with low overhead; most verification requires only a non-disruptive lock on the index as it is verified. The strength of the lock taken on an index as it is verified matches that of simple SELECT statements (unless the highest level of verification is requested). The locking involved will generally not block concurrent reads or writes, and will not prevent VACUUM from running concurrently.

amcheck is proposed as a contrib extension for PostgreSQL 9.7. This externally maintained version of the extension exists to support earlier versions of PostgreSQL (PostgreSQL 9.4+), and to make the tool available to those that need it sooner. While the level of verification is not totally comprehensive (in particular, there is no verification of indexes against underlying tables), the tool is still likely to detect many subtle problems in practice.

amcheck verifies that certain invariants that must hold in the structure of B-Tree indexes actually do, in fact, hold. It's fairly exhaustive. One example of a problem that the tool can detect is inconsistency arising from the recent PostgreSQL 9.5 abbreviated keys glibc issue, where the new-to-9.5 abbreviated keys performance optimization could lead to structurally inconsistent indexes due to a bug in some glibc versions. This issue created a need to get amcheck into the hands of users sooner rather than later.

It's not ideal that the tool is maintained externally, since there are complex locking protocols involved; the implementation must make sure that there cannot be false positives to be of much practical use, and so the tool ought to be considered whenever there is a question about these locking protocols. Unfortunately, we ran out of time to get amcheck into PostgreSQL 9.6. Technically there is no disadvantage to an externally maintained tool, but in my opinion amcheck should really be maintained alongside the B-Tree index code itself.

Suggesting a corrected column name/spelling in the event of a column misspelling

2015-11-14T12:25:00.002-08:00

One small PostgreSQL 9.5 feature I worked on is the new hinting mechanism feature, which sometimes hints, based on a score, what you might have meant to type following misspelling a column name in an SQL query. The score heavily weighs Levenshtein distance. A HINT message is sent to the client, which psql and other client tools will display by default.

It's common to not quite recall offhand if a column name is pluralized, or where underscores are used to break up words that make up the name of a column. This feature is targeted at that problem, providing guidance that allows the user to quickly adjust their query without mental context switching. For example:

postgres=# select * from orders where order_id = 5;
ERROR: 42703: column "order_id" does not exist
LINE 1: select * from orders where order_id = 5;
^
HINT: Perhaps you meant to reference the column "orders"."orderid".

You may also see a hint in the case of two possible matches, provided both matches have the same score, and the score crosses a certain threshold of assumed usefulness:

postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderid where order_id = 5;

ERROR: 42703: column "order_id" does not exist

LINE 1: ...oin orderlines ol on o.orderid = ol.orderid where order_id =...

HINT: Perhaps you meant to reference the column "o"."orderid" or the column "ol"."orderid".

If an alias was used here (which this query must have anyway), the hint becomes more specific:

postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderid where o.order_id = 5;

ERROR: 42703: column o.order_id does not exist

LINE 1: ...oin orderlines ol on o.orderid = ol.orderid where o.order_id...

HINT: Perhaps you meant to reference the column "o"."orderid".

This feature should make writing queries interactively in psql a bit more pleasant. Mental context switching to figure these incidental details out has a tendency to slow things down.

Avoid naming a constraint directly when using ON CONFLICT DO UPDATE

2015-10-02T11:36:00.001-07:00

PostgreSQL 9.5 will have support for a feature that is popularly known as "UPSERT" - the ability to either insert or update a row according to whether an existing row with the same key exists. If such a row already exists, the implementation should update it. If not, a new row should be inserted. This is supported by way of a new high level syntax (a clause that extends the INSERT statement) that more or less relieves the application developer from having to give any thought to race conditions. This common operation for client applications is set to become far simpler and far less error-prone than legacy ad-hoc approaches to UPSERT involving subtransactions.

When we worked on UPSERT, many edge-cases were carefully considered. A technique called "unique index inference" allows DML statement authors to be very explicit about what condition they want to take the alternative (UPDATE or NOTHING) path on. That alternative path can only be taken in the event of a would-be duplicate violation in an "arbiter" unique index (for the DO NOTHING variant, a would-be exclusion violation is also a possible reason to take the alternative NOTHING path). The ability to write UPSERT statements explicitly and safely while also having lots of flexibility is an important differentiator for PostgreSQL's UPSERT in my view.

As the 9.5 INSERT documentation explains, the inference syntax contains one or more column_name_index (columns) and/or expression_index expressions (expressions), and perhaps an optional index_predicate (for partial unique indexes, which are technically not constraints at all). This is internally used to figure out which of any available unique indexes ought to be considered as an arbiter of taking the alternative path. If none can be found, the optimizer raises an error.

The inference syntax is very flexible, and very tolerant of variations in column ordering, whether or not a partial unique index predicate is satisfied, and several other things. It can infer multiple unique indexes at a time, which is usually not necessary, but can be in the event of a migration. CREATE INDEX CONCURRENTLY supports creating unique indexes, and it's easy to imagine someone reasonably having two logically equivalent unique indexes (or equivalent in all the ways that matter to certain UPSERT statements) for a while. Plus, unique indexes cannot be named directly, since they happen to not be cataloged as constraints. We considered that. Use the inference syntax, and you're unlikely to have any problems like this at all.

DML statement authors must be explicit when writing a statement using DO UPDATE in any case -- omitting some particular condition to take the UPDATE path on is simply disallowed (DO NOTHING does not have this restriction). We also added an escape hatch to name a constraint directly, ON CONFLICT ON CONSTRAINT <constraint_name>. This could be useful for exclusion constraints, but its use is generally discouraged since it does not handle these edge-cases.

Abbreviated keys for numeric to accelerate numeric sorts

2015-04-04T09:19:00.002-07:00

Andrew Gierth's numeric abbreviated keys patch was committed recently. This commit added abbreviation/sortsupport for the numeric type (the PostgreSQL type which allows practically arbitrary precision, typically recommended for representing monetary values).

The encoding scheme that Andrew came up with is rather clever - it has an excellent tendency to concentrate entropy from the original values into the generated abbreviated keys in real world cases. As far as accelerating sorts goes, numeric abbreviation is at least as effective as the original text abbreviation scheme. I easily saw improvements of 6x-7x with representative queries that did not spill to disk (i.e. that used quicksort). In essence, the patch makes sorting numeric values almost as cheap as sorting simple integers, since that is often all that is actually required during sorting proper (the abbreviated keys compare as integers, except that the comparison is inverted to comport with how abbreviation builds abbreviated values from numerics as tuples are copied into local memory ahead of sorting - see the patch for exact details).

Separately, over lunch at pgConf.US in New York, Corey Huinker complained about a slow, routine data warehousing CREATE INDEX operation that took far too long. The indexes in question were built on a single text column. I suggested that Corey check out how PostgreSQL 9.5 performs, where this operation is accelerated by text abbreviation, often very effectively.

Corey chose an organic set of data that could be taken as a reasonable proxy for how PostgreSQL behaves when he performs these routine index builds. In all cases maintenance_work_mem was set to 64MB, meaning that an external tapesort is always required - those details were consistent. This was a table with 18 million rows. Apparently, on PostgreSQL 9.4, without abbreviation, the CREATE INDEX took 10 minutes and 19 seconds in total. On PostgreSQL 9.5, with identical settings, it took only 51.3 seconds - a 12x improvement! This was a low cardinality pre-sorted column, but if anything that is a less compelling case for abbreviation - I think that the improvements could sometimes be even greater when using external sorts on big servers with fast CPUs. Further organic benchmarks of abbreviated key sorts are very welcome. Of course, there is every reason to imagine that abbreviation would now improve things just as much if not more with large numeric sorts that spill to disk.

Future work

With numeric abbreviation committed, and support for the "datum" case likely to be committed soon, you might assume that abbreviation as a topic on the pgsql-hackers development mailing list had more or less played out (the "datum " sort case is used by things like "SELECT COUNT(DISTINCT FOO) ..." - this is Andrew Gierth's work again). You might now reasonably surmise that it would be nice to have support for the default B-Tree opclasses of one or two other types, like character(n), but that's about it, since clearly abbreviation isn't much use for complex/composite types - we're almost out of interesting types to abbreviate. However, I think that work on abbreviated keys is far from over. Abbreviation as a project is only more or less complete as a technique to accelerate sorting, but that's likely to only be half the story (Sorry Robert!).

I intend to undertake research on using abbreviated keys within internal B-Tree pages in the next release cycle. Apart from amortizing the cost of comparisons that are required to service index scans, I suspect that they can greatly reduce the number of cache misses by storing abbreviated keys inline in the ItemId array of internal B-Tree pages. Watch this space!

Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance

2015-01-23T16:43:00.000-08:00

On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of text in PostgreSQL. This was the last and the largest in a series of such patches, the patch that adds "abbreviated keys". PostgreSQL 9.5 will have big improvements in sort performance.

In realistic cases, CREATE INDEX operations on text are over 3 times faster than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but many will be.

This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting text is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing CLUSTER operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with ORDER BY or DISTINCT clauses, or aggregates using the GroupAggregate strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (CREATE INDEX CONCURRENTLY is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.

My interest in sorting is not new: I first worked on it in 2011. Early research on it back then prompted Robert Haas and Tom Lane to write the SortSupport infrastructure, which I've now extended here. Originally, the SortSupport infrastructure was all about providing alternative versions of comparators for use in sort routines, versions that avoided certain overhead otherwise inherent to calling functions that are generally accessible from SQL. As a highly extensible system, PostgreSQL requires that sort behavior be defined in terms of a default B-Tree operator class, which is itself defined in terms of SQL operators with underlying SQL-callable functions. These functions are written in C for built-in types, but in principle they could be written in a PL, like PL/Python, for example. When the underlying comparator can be expected to compile to just a few CPU instructions, "fmgr elision" (avoiding SQL function call overhead) is important - the time spent in the "fmgr" when not eliding it shows up prominently on profiles with certain types (or rather, it did in the past).

Note that I generalized sortSupport to work for more cases, so B-Tree index builds will get a nice little boost in PostgreSQL 9.5, even for types like integer and float8. That's not what this blog post is really about, though. This blog post is about the interesting new direction that the SortSupport infrastructure has been taken in, beyond mere "fmgr elision" - abbreviation.

A well known problem with the text datatype in PostgreSQL is that it uses the operating system/C standard library strcoll() function to resolve comparisons as tuples are sorted, which is very expensive. It's at least a thousand times more expensive than comparing integers, for example. This general problem is something that Robert Haas has expressed concern about in the past.

The expense relates to a normalization process whereby string comparisons use complex tables to make sure that strings are compared according to the rules of some particular culture or nation (that is, some particular collation associated with a locale). Even in English speaking countries, this is important; for example, the en_US collation considers difference in case (higher case versus lower case) after alphabetical ordering and diacritical differences, so case is considered last of all (the c locale, on the other hand, will sort upper case and lower case strings into two distinct batches, which is typically not desirable). In addition, while English usually doesn't have diacritics, sometimes it does. At work, I'm still sometimes annoyed by the sort order of the Linux Hipchat client's user list, which uses the C locale. Hi Ómar!

It was always suspected that we could more effectively amortize the cost of these locale-aware comparisons, by performing a transformation of strings into binary keys using strxfrm(), and sorting the keys instead (using a strcmp()-based comparator with the keys, which only considers raw byte ordering). This comparison will produce equivalent results to just using strcoll() directly. But the binary keys are much larger than the original strings - typically almost 4x larger. Moreover, we'd still need to do a tie-breaker strcmp() comparison (to check for strict binary equality) using the original string, when strcoll() reports equality, because the historic idea of equality that the text type offers is strict binary equality. There were some historic edge cases where a tie-breaker strcmp() was not performed following strcoll() returning '0', resulting in corrupt B-Tree indexes on a Hungarian database. strcoll() could return 0 despite not being passed a pair of bitwise-identical strings.

Having to keep around the original text datum seemed like an additional burden on the whole idea of using strxfrm() blobs as sort keys. It seemed like using binary keys to sort had a lot of promise, but we couldn't quite work out how to exploit that idea - until recently.

Abbreviated keys were committed:

Use abbreviated keys for faster sorting of text datums.

This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob. If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys. HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.

It's surprisingly effective to just store the first 8 bytes of a strxfrm() blob, and tie-break relatively infrequently by using a full old-style comparison, rather than the more obvious approach of sorting with pointers to strxfrm()-generated blobs (the approach that the C standard recommends for general purpose text sorting).

Entropy

By storing just the first 8 bytes (on 64-bit platforms; 4 bytes on 32-bit platforms) of the strxfrm() blob in a field that would otherwise contain a pointer-to-text (since text is a pass-by-reference type) - the same type-punned field that directly stores the representation of pass-by-value types like integer - most comparisons can often be resolved using just those 8 bytes directly, and without pointer-chasing. At the same time, the cost of locale transformations is still quite effectively amortized, because as always when using strxfrm(), binary key blobs performs a transformation O(n) times, rather than an average of O(n log n) times (the transformation process performed by strxfrm() and strcoll() may not be exactly comparable, but close enough).

It turns out that the large binary key blobs produced by strxfrm(), while much larger than the original strings, have a significant concentration of entropy towards the start of the blob (assuming the use of the Unicode collation algorithm, or an algorithm with similar properties). This is because the representation consists of a series of "levels". The "primary weights", which appear first, represent primary alphabetical ordering when using Latin scripts. So whitespace differences and punctuation differences are not represented at that level (nor are differences in case). For accented Latin characters, for example, diacritics are represented at a subsequent level, and so the abbreviated key representation typically won't vary if accents are added or removed to a text datum. This is important because languages that use accents extensively, like French or Spanish, will get a concentration of entropy in their 8 byte abbreviated keys that's about the same as if no accents were used, even though accented code points usually take 2 bytes of storage in UTF-8, rather than 1 byte on unaccented Latin alphabet code points.

Locality

A more general problem with sort performance is the problem of cache misses. My earlier work on sorting targeted pass-by-value PostgreSQL types like integer and float8. These pass-by-value types naturally have great locality of reference. Their comparisons are integral operations, which are fast, but operating on a representation that is tightly packed is what makes sorting integers with Quicksort perhaps as fast as is practically possible for a comparison-based sort. Cache miss penalties are likely to be the dominant cost on modern CPUs, that are more bottlenecked on memory bandwidth and latency in every successive generation (PDF); sorting infrastructure must heavily weigh this.

When I initially discussed the idea of abbreviated keys, there was a certain degree of skepticism from other Postgres hackers. What if most comparisons are not resolved by abbreviated comparisons, due to text datums with a lot of redundant or repeated information at the beginning? Could all the strxfrm() work go to waste when that happens? Well, for one thing, low cardinality sets (tuples with text columns that have a relatively low number of distinct values) are not a problem. That's because strcoll() is still a huge cost, and if we can have our authoritative tie-breaker comparator observe that the strings are identical, then no strcoll() comparison is ever needed - we can just exit early with a simple, cheap opportunistic binary comparison (memcmp()), which is almost as good. But what about when there are many different strings with differences towards the end of the string, past the 8th or so byte?

CPU cache characteristics have presented complicated engineering trade-offs for sorting infrastructure for a long time. Database luminary Jim Gray proposed an abbreviation-like technique as early as 1994, in his AlphaSort paper (PDF). He describes a "key-prefix sort" in the paper. Even back in 1994, Gray observed that memory latency was the dominant cost by a wide margin. The underlying trends in CPU performance characteristics have continued apace since then. Before his death in 2007, Gray officiated the Sort Benchmark. Among the rules for the "Daytona sort" category, which concerns the sort performance of general-purpose algorithms (which is what I'm interested in), it states that Daytona sort entrants "must not be overly dependent on the uniform and random distribution of key values in the sort input". It's almost as if Gray was saying: "of course I expect you to use abbreviated keys, but don't push your luck!". And so it is for PostgreSQL. Some cases benefit much more than others, and some cases might even be slightly regressed.

Merge Joins

An earlier piece of work for 9.5 had conventional comparisons (not involving abbreviated keys) always try and opportunistic memcmp() tie-breaker. This is likely to be particularly beneficial for merge joins (quite apart from any sort node that may feed the merge join), since they must "synchronize" relations using comparisons that can often be expected to indicate equality. Multi-column sorts on text are also considerably accelerated, where many leading column comparisons can be expected to indicate equality. It's also important for abbreviated keys, because as already mentioned we can still win big with low cardinality sets provided the full tie-breaker comparisons are resolved with a cheap memcmp().

Insurance

Fundamentally, when you do a cost/benefit analysis, abbreviated keys are very compelling. The upsides are clearly very large, and the break-even point for switching to using abbreviation is surprisingly far out. We cannot ignore the performance benefits of these techniques because some much rarer cases will be slightly regressed. But, as it happens, we have cheap worst case insurance: HyperLogLog is used to cheaply and fairly accurately check the cardinality of both abbreviated keys and the original text values. If abbreviated cardinality is an effective proxy for full cardinality, then most comparisons will either use abbreviated comparisons, or use a cheap memcmp() tie-breaker, which is almost as good. Otherwise, we abort abbreviation before the sort proper is underway.

Future

Abbreviated keys are just infrastructure. While text is the most compelling case, there are at least a few other datatypes that would greatly benefit from support for abbreviation. These include:

~~numeric~~
~~character(n)~~
~~uuid~~
~~bytea~~
citext (case insensitive text, from contrib/citext)
inet
cidr
~~macaddr~~

Update: A patch for numeric sortsupport with abbreviation was committed!

I welcome others with an interest in making sorting faster to work on the relevant opclass support for each of these types, and possibly others. Other people may be able to come up with novel encoding schemes for these types, that maximize the entropy within the finished abbreviated keys. Order-preserving compression is likely to be an area where text's support could be improved, by making comparisons resolved at the abbreviated key level more frequent. Hopefully the benefits of the abbreviated key infrastructure will not be limited to accelerating sorts on text.

What I think of jsonb

2014-03-23T18:39:00.000-07:00

Unsurprisingly, there has been a lot of interest in the jsonb type, which made it into the upcoming 9.4 release of Postgres. I was initially a reviewer of jsonb, although since I spent weeks polishing the code, I was ultimately credited as a co-author.

Jsonb is a new datatype for Postgres. It is distinct from the older json datatype in that its internal representation is binary, and in that it is internally typed. It also makes sophisticated nested predicates within queries on jsonb indexable. I've occasionally described the internally-typed scalar values as having “shadow types” unknown to the core SQL parser. This has several implications. For example, if you sort two Jsonb values containing only scalar numbers, the implementation invokes the numeric comparator (which the jsonb default B-Tree opclass comparator is defined in terms of). The on-disk representation of jsonb includes the same representation as is used for, say, numerics (as the internal binary representation of JSON primitive numbers, for example). Plus, JSON objects are de-duplicated by key on input, and optimized for cheap binary searches within a single jsonb. Still, like the earlier json type, jsonb in every sense “speaks JSON”. There are some limitations on what can be represented as a jsonb number, but those are exactly the same limitations that apply to the core numeric type (plus some limitations imposed by the JSON RFC, such as not accepting NaN values). I hope it suffices to say that these limitations are virtually irrelevant, and that many implementations have similar or worse limitations. All of these minor implementation-defined restrictions are explicitly anticipated and allowed for by the recent JSON RFC-7159.

Jsonb is emphatically not like the BSON format used by MongoDB. That format accepts input in such a way as to be backwards compatible with JSON, but I believe that BSON isn't really a practical interchange format, because the software development community at large is presumably disinclined to buy into an interchange format that as yet is not described by any RFC, or any communiqué of a recognized standards body. In contrast, jsonb is a datatype that will only ever output valid textual JSON, and will only ever accept valid textual JSON (subject to the aforementioned obscure and practically irrelevant restrictions, and the caveat on automatically normalizing duplicate-keyed pairs within the same object). Jsonb also imposes an internal ordering on object pairs. Again, this is all anticipated and allowed for by the JSON RFC.

It's possible that I'm mistaken, and that BSON or something else will emerge as an actual standard (either de facto or de jure), since I've heard that there is support in the works for database systems other than MongoDB. It's not impossible that pursuing something like BSON might be an interesting future direction for Postgres, since for one thing BSON supports more than the 4 standard JSON primitive types. In any case it's important to note that the protocol or on-disk binary representation of jsonb is an implementation detail; we're not in competition with BSON, and this isn't a new standard. It's just a new Postgres datatype, with new indexing capabilities. I think it's notable that BSON doesn't have a JSON-style universal number type. It has 32-bit and 64-bit integer types, and double precision 64-bit IEEE 754 floating point numbers. It strikes me that this omission tells me all I need to know about binary interchange formats.

To understand how the jsonb type works in more detail, I suggest taking a look at the devel documentation. It's worth taking a close look at containment semantics, since that's the really compelling way of searching through jsonb documents.

Strategic significance

There has been a little back and forth among senior community members about the significance of jsonb. Josh Berkus wrote that he thought it was the most important 9.4 feature. Robert Haas was skeptical, preferring the logical decoding stuff. I've even seen one or two people in the comments section of various news articles grumble about Postgres jumping on the JSON bandwagon.

I have to admit that relatively speaking, jsonb is not in and of itself all that technically complex. While it is a great feature, and while I think it puts Postgres in a very competitive position relative to certain other systems, it would be almost trivial to ship a jsonb extension that works with earlier versions of Postgres. However, without taking a position on what the best 9.4 feature is going to be (I like both jsonb and logical decoding, and contributed in various ways to both), I think that it's possible that Josh Berkus and Robert Haas are both more or less right at the same time, and their apparent disagreement reflects only their individual priorities for Postgres.

It is very much to the credit of the principle jsonb authors, Oleg Bartunov and Teodor Sigaev, that with some help from Andrew Dunstan and myself they managed to define what I think is internally a solid nested, strongly-typed format for us to build on, with a textual output representation that just so happens to be the same one that has emerged as a standard for this kind of thing. But, to me, as a Postgres hacker, their previous work – and the previous work of Alexander Korotkov (who, due to an unfortunate oversight, was not credited in the jsonb commit message) – is the real story here. As the authors of GIN, Oleg and Teodor are perhaps most responsible for the foundation on which jsonb is built, a foundation built over many years. Alexander's excellent recent work on improving the GIN access method (with help in various areas from Heikki Linnakangas), which also made it into 9.4, is probably what will end up making jsonb really shine. This includes compression of GIN posting lists, speeding up "rare & frequent" type GIN queries, multi-key GIN search skipping, and further optimizations to multi-key searching. Alexander had earlier reported some very impressive improvements in PostgreSQL full-text search performance as a result of all of this, with performance apparently competing with that of external systems like Sphinx and Solr. It is likely that many of the same big performance improvements seen there concomitantly benefit the jsonb GIN opclasses.

jsonb_path_ops

Having said that, Alexander's jsonb_path_ops alternate GIN operator class, which was his contribution to the big jsonb patch deserves an honorable mention. By combining GIN with hashing of either key/value pairs, or array elements, resulting indexes can give great performance for sophisticated “containment” type queries against JSON documents. Indexes are a fraction of the size of the data indexed, index scans are incredibly fast, and yet these GIN indexes make indexable very complex nested “containment” queries. The results are so impressive that at last November's pgConf.EU conference, an EXPLAIN ANALYZE comparative example in a presentation given by Oleg and Teodor was greeted with sustained applause.

I'm really pleased that we worked towards making all of this as beneficial as possible to the largest possible number of people, but even as it puts Postgres in a very competitive position with respect to some non-relational systems, jsonb does not really represent any kind of pivot towards Postgres as a document store – Postgres has always been an object-relational system. Rather, I think it is one particular outcome of a much bigger process that has been underway for many years.

I'll watch the future development of the "VODKA" index access method with interest, because at this early stage it is my understanding that it's intended to make searching nested, heterogeneous structures more flexible and better performing still. It seems likely that there will be a number of other applications for that infrastructure too, since like GiST, GIN, and SP-GiST, it is intended to be an extensible infrastructure that serves many analogous needs in a general way.

Moving on

2013-01-21T10:29:00.000-08:00

Today was my last day at 2ndQuadrant.

The experience of working with 2ndQuadrant in the last couple of years has been very positive. I just decided it was time for a change. Being able to work on interesting problems during my time at 2ndQuadrant, both as a Postgres developer and as a consultant has been great fun, and very personally rewarding. I would like to acknowledge the invaluable support of both Simon Riggs and Greg Smith - thank you both. I wish the entire 2ndQuadrant staff all the best.

I expect to remain active as a Postgres developer, and already have plans to relocate to work for another company that is well known within the community.

Finding plans in pg_stat_plans easily with pg_find_plans

2012-12-04T02:40:00.001-08:00

As I recently blogged about, pg_stat_plans is a PostgreSQL satellite project I've been working on that aims to support earlier versions of Postgres that cannot use the new pg_stat_statements, and to track execution costs at the plan rather than the query granularity. It allows the user to easily explain each stored query text to see the plan for the entry, and has features that facilitate monitoring planner regressions.

Since PostgreSQL 9.0, support for machine-readable EXPLAIN output has existed. I'm not aware that anyone else got around to actually doing something interesting with this capability, though. I knew that in order to get the most benefit from pg_stat_plans, it ought to be possible to leverage this capability to search for plans based on arbitrary criteria, directly from SQL.

I've written an experimental submodule of pg_stat_plans, called pg_find_plans, that is designed to do just that - to quickly find plans and their execution costs, for those plans that, say, perform a sequential scan on a known large table.

Here's the description of pg_find_plans from its documentation:

pg_find_plans is written in PL/Python and PL/PgSQL. It is intended to provide users with a better way to ask questions like "what are the execution costs of all plans tracked since last statistics reset that involve a sequential scan against mytable, and have more than 2 joins?". That might be written as:
mydb=# select
  join_count(json_plan),
  p.*
from
  pg_stat_plans p
  join
  stored_plans sp on (p.userid=sp.userid and p.dbid=sp.dbid and p.planid=sp.planid)
where
  from_our_database
and
  join_count(json_plan) > 2
and
  contains_node(json_plan, 'Seq Scan', 'mytable');
order by
  1 desc nulls last;
Users should have a high degree of confidence that their queries on plan's structure are free of detectable errors, and pg_find_plans ensures this by carefully sanitising user input. For example, if the node of interest was specified as 'seq scan' above, the query would raise an error - to do any less might result in a false sense of security about the actual costs of plans that sequentially scan the tablemytable, since the implementation might then naively ignore sequential scan nodes, as a case-sensitive comparison is used internally. In general, making the interface hard to use incorrectly is even more important than making it easy to use correctly.

Strictly speaking, pg_find_plans is nothing more than a simple set of functions for storing JSON explain texts of plans that appear as pg_stat_plans entries into a dedicated table, and subsequently parsing those plans to answer interesting questions using SQL. However, pg_find_plans is a module that is likely to make pg_stat_plans much more useful than it might otherwise be. pg_find_plans is by no means feature complete or especially polished. The author's ambitions for the tool are described under "limitations" below.

pg_find_plans is distributed with pg_stat_plans, version 1.0 beta 3, as a submodule. I'm reasonably confident that there will be a stable release of pg_stat_plans soon.

While there are some problems with using a query text as a proxy for a plan that was once produced by that query text, these cases are handled reasonably well, though the "limitations" section of the pg_find_plans documentation should be understood by users. Still, pg_find_plans exists mostly to "test the waters" for a better-principled implementation. It remains to be seen just how much demand there is for this kind of functionality.

I must say that working on this gave me a new-found appreciation for JSON as a data-interchange format - it dawned on me just why some people consider the Postgres 9.2 JSON datatype so compelling a feature. The last time I needed to write some code that used a lowest common denominator interchange format, that format was the ludicrously verbose XML. Having a format that maps almost perfectly onto scripting language data structures cut down on the amount of boilerplate required considerably. JSON interacts well with Python's dynamic, strong typing, because a piece of JSON data almost looks like a declaration of a nested Python data structure, and can be fairly easily made to be manipulated as one too.

Notes on index-only scans

2012-11-16T04:31:00.000-08:00

One of the most important performance features in Postgres 9.2 is index-only scans: the ability for certain types of queries to be performed without retrieving data from tables, potentially greatly reducing the amount of I/O needed. I recently completely overhauled the Index-only scans PostgreSQL wiki page, so that the page is now targeted at experienced PostgreSQL users that hope to get the most out of the feature.

My apologies to the authors of the feature, Robert Haas, Ibrar Ahmed, Heikki Linnakangas and Tom Lane, if my handling of the topic seems to focus on the negatives. Any reasonable article about any given index-only scan implementation would have to extensively discuss that implementation's limitations. Any discussion of Postgres index-only scans that focussed on the positives would be much shorter, and would essentially just say: "Index-only scans can make some of your queries go much faster!".

First release of pg_stat_plans

2012-10-20T07:53:00.000-07:00

Anyone who attended my recent talk at Postgres Open, which was co-presented with my 2ndQuadrant colleague Greg Smith, "Beyond Query Logging", will be aware that pg_stat_statements, the standard contrib module that assigns execution costs to queries and makes them available from a view in the database, has been improved considerably in the recent 9.2 Postgres release. It has been improved in a way that we believe will alter the preferred approach to workload analysis on PostgreSQL databases away from log analysis tools, which just don't offer the performance, flexibility or granularity of this new approach.

We also announced a new open source tool that addresses a related but slightly different problem (the analysis of plan execution costs, and planner regressions), as well as making most of the benefits of pg_stat_statements on 9.2 available to users stuck on earlier versions of Postgres. This new tool is called pg_stat_plans, and is itself based on pg_stat_statements.

The 9.2 pg_stat_statements feature of particular importance, the ability to "normalise" non-prepared statements that the large majority of applications use exclusively is now brought to earlier versions (versions 9.0 and 9.1, though pg_stat_plans works fine on 9.2 too). Since pg_stat_plans fingerprints plans rather than query trees, the way this works is slightly different to pg_stat_statements, and perhaps doesn't quite match people's intuitive expectations about how normalisation ought to behave in some cases. These differences have been extensively documented.

pg_stat_plans also has the ability to EXPLAIN a stored, representative SQL text, in order to facilitate deeper analysis of plan execution costs. Plan total_cost and startup_cost is tracked over time for each plan, for example, so that the "crossover point" at which the planner begins to prefer an alternative plan can sometimes be observed, and the planner's "reasoning" can perhaps be better understood.

pg_stat_plans is distributed under the PostgreSQL licence. I'd originally hoped to offer plan fingerprinting within pg_stat_statements itself, and said as much at the PostgreSQL developer's meeting in May, but I ultimately felt that due to the demand for this on earlier Postgres versions, the Postgres community would be best served by having the module as a satellite project. I'm naturally willing to accept third-party contributions through Github's "pull request" mechanism.

I am pleased to announce the first release of pg_stat_plans, 1.0 beta 1.

Those of you who missed the talk in Chicago can catch it at next week's PostgreSQL Conference Europe in Prague. See you there!

Official pg_stat_plans page: http://www.2ndQuadrant.com/en/pg_stat_plans

Sorting improvements in PostgreSQL 9.2: the case for micro-optimisation

2012-08-01T19:04:00.002-07:00

There has been much discussion of performance improvements in the upcoming 9.2 release of PostgreSQL. Recently, I noticed that Regina Obe and Leo Hsu's new book, "PostgreSQL: Up and running" prominently listed "Sorting improvements that improve in-memory sorting operations by as much as 20%" as a performance feature of that release. While they do get things about right there, I'm not sure that this improvement warrants such prominent placement, at least in sheer terms of its likely impact on the performance of production PostgreSQL systems - we packed a lot of great performance improvements into 9.2. The likely reason that it was picked up on in the book, and the real reason for this blogpost, is the story behind the development of the optimisation, which I for one find kind of interesting, and worth sharing. It's more interesting from the perspective of someone with a general interest in systems programming or PostgreSQL's design philosophy than a casual user, though. If you're a casual user, the short version is that simple queries that perform in-memory sorting of integers and floats will be about 23% faster.

I wrote a rough prototype of the patch, that had a number of ideas, and proved the viability of the approach. Principal among those ideas was specialisation of the quicksort code: Formatting the code such that the compiler had compile-time knowledge of functions, with inlining used as an enabling optimisation, and a few variations produced. So rather than using complex indirection involving function pointers, a macro infrastructure was used to generate multiple specialisations, allowing the compiler to optimise the code more effectively as a result of being able to integrate everything.

A secondary problem was that comparators (i.e. the comparison functions that all sorting within Postgres currently needs) were accessed in a round-about away.

Roughly speaking, tuplesort (the part of the code that deals with sorting tuples, perhaps as part of a query's execution, or perhaps as the first step in creating a new index) is very generic code. Code very similar to the C standard library's qsort() is used directly in 9.1 - our particular implementation was lifted from NetBSD, as it happens. This qsort function is passed a function pointer. That function pointer pointed to a "tupleclass encapsulating comparator" (i.e. the comparator differs for "heap" (table) tuples and index tuples). This comparator in turn calls the "comparator proper" for each and every sortkey (i.e. ORDER BY column) to be sorted. The indirection doesn't stop there though.

Postgres comparators ("comparator propers" - the actual, datatype specific functions that Postgres uses for sorting) can be called from SQL. While they are written in C in the case of all built-in types, in 9.1, we still accessed the comparators through the SQL function call machinery, rather than a direct function call (or function call through a function pointer). This isn't ordinarily a big deal - we do use this in some other performance critical codepaths, such as the index access method code - PostgreSQL doesn't know anything about indexes other than what this "abstract interface" tells it, even when interacting with btree indexes. In this way, a module author can define a whole new index type (I don't just mean implementing a new indexing scheme based on GiST/GIN - I mean a whole new index type). Though not a big deal there, in a tight, frequently executed loop, repeatedly calling a comparison that is essentially just one or two CPU instructions, the overhead of this fmgr trampoline can start to matter a lot.

Tom Lane and Robert Haas fixed this problem. My work inspired their development of the SortSupport infrastructure as a first phase/patch towards realising the full potential of these ideas; this is essentially just a way of facilitating datatype authors in providing alternative versions of comparators that can be accessed more directly, through function pointers. Owing to the fact that Postgres has a long tradition of being highly extensible, it was necessary to generalise this entirely, so that even third party module authors could use the infrastructure. It was likely to only be immediately useful for a few built-in datatypes like floats and integers, but it's hard to predict how people may choose to use this - there was some talk of novel applications of sortSupport.

In the latter phase, we worked on producing the actual specialisations of the sort code. For heap tuples, we specialise on single sort keys and multiple sort keys, producing variant quicksort code for each. The "tupleclass encapsulating comparator" (the comparator for heap tuples in this case) is inlined. This is actually the more valuable of the two optimisations, counter-intuitive though that is.

Why should simple inlining make such a large difference? Inlining certainly isn't just about eliminating the function call overhead (though there is that too); in fact, it's generally much more important as an enabling transformation. Inlining may enable loop-invariant code motion, dead code elimination, or induction variable elimination. As a broad principle, the more the compiler knows, the greater leeway it has to apply its optimisations. These optimisations are not to be sniffed at - for quicksorting, a simple, isolated test case can show sorting 2.5 - 3 times faster when inlining of comparators occurs, rather than accessing comparators through function pointers (this indirection is necessary to support the interface of the C standard library's qsort())

Just to see where things stand after all of this, I performed this benchmark, where we measure the transactions per second after 45 seconds of executing the same simple ORDER BY query as many times as possible within a single session. I do this both with and without client overhead to see how important a factor that is (basically, no tuples are actually returned when we don't measure the overhead, and yet the underlying execution costs are the same - we do this by appending "OFFSET 10001" to the query).

[peter@peterlaptop tests]$ # no client overhead, 9.1:
[peter@peterlaptop tests]$ pgbench -f sort_no.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 12159
tps = 270.180571 (including connections establishing)
tps = 270.205386 (excluding connections establishing)
[peter@peterlaptop tests]$ # no client overhead, 9.2:
[peter@peterlaptop tests]$ pgbench -f sort_no.sql -T 10
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 3687
tps = 368.683077 (including connections establishing)
tps = 368.890864 (excluding connections establishing)
[peter@peterlaptop tests]$ # client overhead, 9.1:
[peter@peterlaptop tests]$ pgbench -f sort_o.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 7304
tps = 162.301377 (including connections establishing)
tps = 162.316128 (excluding connections establishing)
[peter@peterlaptop tests]$ # client overhead, 9.2:
[peter@peterlaptop tests]$ pgbench -f sort_o.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 8977
tps = 199.470582 (including connections establishing)
tps = 199.494630 (excluding connections establishing)

Boiling that down further:

	Postgres 9.1	Postgres 9.2	Difference
No Client overhead	270.205 TPS	368.890 TPS	+ 36.5%
Client overhead	162.316 TPS	199.494 TPS	+ 22.9%

So that's an improvement in throughput of about 23% for this sympathetic, though representative query.

To quote Linus Torvalds, "To some degree people say that you should not micro-optimise, but if you love micro-optimisation, that's what you should do". I don't know that I love micro-optimisation, but it can certainly be well worthwhile to perform re-jiggering, or specialisation, or lowering of cache miss rates, at least in the case of an important infrastructure project's innermost loops.

I was happy that I was ultimately able to overcome objections about the possible distributed costs of creating specialisations - this can result in "binary bloat". Each specialisation needs to independently justify the resulting increase in object file size. In general it's very difficult to argue that the trade-off represented by any particular specialisation is generally worth it. There was a protracted discussion of the merits of doing all of this. That was the factor that made it stick out in people's minds, I suspect.

Towards 14,000 write transactions per second on my laptop

2012-06-04T11:00:00.001-07:00

Postgres 9.2 will have many improvements to both read and write scalability. Simon Riggs and I collaborated on a performance feature that greatly increased the throughput of small write transactions. Essentially, it accomplishes this by reducing the lock contention surrounding an internal lock called WALWriteLock. When an individual backend/connection holds this lock, it is empowered to write WAL from wal_buffers, an area of shared memory that temporarily holds WAL until it is written, and ultimately flushed to persistent storage.

Original update.sql "new group commit" benchmark, January 2012. This made it into Postgres 9.2. Here, we compare the performance of my original patch (red line) and Postgres master in January (green line). 9.1 performance on this benchmark would probably be very similar to that of the baseline seen here.

With this patch, we don’t have the backends queue up for the WALWriteLock to write their WAL as before. Rather, they either immediately obtain the WALWriteLock, or else queue up for it. However, when the lock becomes available, no waiting backend actually immediately acquires the lock. Rather, each backend once again checks if WAL has been flushed up to the LSN that the transaction being committed needs to be flushed up to. Oftentimes, they will find that this has happened, and will be able to simply fastpath out of the function that ensures that WAL is flushed (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL. In this manner, we batch commits, resulting in a really large increase in throughput, as you can tell from the diagram above.

In Postgres 9.2, this improvement automatically becomes available without any further configuration. This was one of the subjects of my recent talk, co-presented with Greg Smith at PgCon 2012, “A Batch of Commit Batching”.

There is some confusion about the semantics of group commit. Previously, the project sometimes referred to two settings - commit_delay and commit_siblings - as offering a group commit implementation. Certainly, the intent of those settings was to allow the DBA to trade off latency for throughput, which characterises the group commit feature of some well known proprietary RDBMSs (though not, I believe, any of the various MySQL flavours). The patch that we worked on for 9.2 did not add arbitrary latency at any point in the hope of increasing throughput, and was technically totally orthogonal to commit_delay and commit_siblings.

commit_delay was always something that we didn't really have much practical use for. It merely adds a latency of commit_delay microseconds just before each transaction goes to commit, provided there are at least commit_siblings number of concurrent transactions in progress, in the hope that this latency will allow each backend to find their “flush up-to” point in the sequential WAL stream already flushed-up to when they go to flush WAL immediately after the delay, due to some other backend having flushed up to that point by then. When this happens, they can fastpath out of the function, often without ever having to acquire WALWriteLock.

It is intuitively obvious that these are not hugely useful settings, generally just as likely to hurt performance as help it, and they were only really retained because they could improve throughput in certain narrow benchmarks. After all, whatever “certain other” backend was supposed to flush everyone else’s WAL was probably delayed too. With the new group commit implementation, the setting became more marginal than ever, and seemed to have next to no positive effect on commit speed. I argued for its removal in a dedicated thread on pgsql-hackers, and Greg and I were vocal in calling for its deprecation during our talk.

I subsequently realised that if we were to do that, there was still an uncomfortable tension. Even though commit_delay is almost completely ineffective, its purported purpose - to allow the DBA to trade-off transaction latency to increase the throughput of the server as a whole - is still a quite reasonable use-case, and a use-case that we were not serving well, with or without commit_delay. To give a good example of when throughput is more important than latency, Google’s F1 specialist relational database, built to replace their legacy Adwords MySQL cluster, very explicitly prioritises throughput over latency. What’s more, as I’ve already mentioned, this trade-off seems to be integral to the group commit implementation of a particular expensive, proprietary RDBMS. It occurred to me that commit_delay really doesn’t work with the new group commit. It is an entirely separate piece of code, whose usefulness was never revisited in light of new group commit, that sometimes simply adds a sleep. I then had what turned out to be a valuable insight: by delaying only within the leader backend, and not within all connection backends, we can much more effectively stagger transactions so that they coincide and can be flushed together. What previously required serendipitous timing - which, if botched, could have the opposite effect to that desired - could now be mostly ensured. The leader alone would wait the full commit_delay before proceeding with flushing, but followers would continue to get in the queue behind the leader for that much longer. This was found to further dramatically improve transaction throughput, particularly in cases that the original “new group commit” did not do so well on, such as workloads where the server didn’t have so many connections, or transactions were not single-statement write statements. It also made a respectable dent even at the highest client counts:

insert.sql benchmark. Importantly, the patch to adjust commit_delay (green line) shows by far the largest improvements over master without a commit_delay (red line) at lower client counts that are more representative of the real world. In this example, setting commit_delay to the same value, 4,000, without the still-not-in-postgres adjustment/patch to commit_delay behaviour (blue line) actually hurts performance a bit.

Update: The commit_delay patch has since been committed, and will be available in Postgres 9.3.

Interestingly, a commit_delay of around 4,000 (microseconds) seem about optimal for the above benchmark, performed on my laptop, which is roughly half of the raw fdatasync() (my wal_sync_method, the default for Linux) speed for the 7200 RPM disk that I have in my Thinkpad laptop, as reported by pg_test_fsync. The below benchmark is for pgbench’s tpc-b benchmark, and compares a 9.2 HEAD/new group commit baseline, which did not improve nearly so much as a result of our initial work on this:

Less dramatic, but still rather compelling improvements with the new commit_delay patch (green line) for a tpc-b.sql workload, as compared to setting commit_delay to 0 (red line), which is representative of master. This workload was previously not helped much by "new group commit".

Experimentation shows that a commit_delay of around 4,000 remains roughly optimal here, even though the tpc-b transactions do rather more than the single-insert statement of the insert.sql benchmark’s transactions, indicating that commit_delay now shows some promise as a value that can be set in a well-principled way, without having to optimise for some synthetic workload to see these kinds of improvements, or having to do something else so baroque as to make the whole feature completely useless.

Thanks to the 4CaaSt project for funding this recent research into commit_delay and commit_siblings.

Unfortunately, I missed the 9.2 deadline for submitting this small but critical adjustment to commit_delay. For that reason, you’ll have to wait for 9.3 to benefit from it.

Much improved statement statistics coming to Postgres 9.2

2012-03-29T06:02:00.006-07:00

There is a tendency for people with an interest in improving databases performance to imagine that it mostly boils down to factors outside of their application - the hardware, operating system configuration, and database settings. While these are obviously crucially important, experience suggests that in most cases, by far the largest gains are to be had by optimising the application’s interaction with the database. Doing so invariably involves analysing what queries are being executed in production, their costs, and what the significance of the query is to the application or business process that the database supports.

PostgreSQL has had a module available in contrib since version 8.4 - pg_stat_statements, originally developed by Takahiro Itagaki. The module blames execution costs on queries, so that bottlenecks in production can be isolated to points in the application. It does so by providing a view that is continually updated, giving real-time statistical information. Here is an example from the Postgres 9.2 docs:

bench=# SELECT pg_stat_statements_reset();

$ pgbench -i bench
$ pgbench -c10 -t300 bench

bench=# \x
bench=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
               nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
          FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
-[ RECORD 1 ]---------------------------------------------------------------------
query       | UPDATE pgbench_branches SET bbalance = bbalance + ? WHERE bid = ?;
calls       | 3000
total_time  | 9.60900100000002
rows        | 2836
hit_percent | 99.9778970000200936
-[ RECORD 2 ]---------------------------------------------------------------------
query       | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?;
calls       | 3000
total_time  | 8.015156
rows        | 2990
hit_percent | 99.9731126579631345
-[ RECORD 3 ]---------------------------------------------------------------------
query       | copy pgbench_accounts from stdin
calls       | 1
total_time  | 0.310624
rows        | 100000
hit_percent | 0.30395136778115501520
-[ RECORD 4 ]---------------------------------------------------------------------
query       | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
calls       | 3000
total_time  | 0.271741999999997
rows        | 3000
hit_percent | 93.7968855088209426
-[ RECORD 5 ]---------------------------------------------------------------------
query       | alter table pgbench_accounts add primary key (aid)
calls       | 1
total_time  | 0.08142
rows        | 0
hit_percent | 34.4947735191637631

This is an extremely useful feature. However, its usefulness has historically been highly limited by the fact that it differentiated queries based solely on their query string. This effectively limited the use of the module to code that mostly or entirely used prepared statements. If non-prepared statements were used, what you ended up with was something that looked like this:

-[ RECORD 1 ]---------------------------------------------------------------------
query       | UPDATE pgbench_branches SET bbalance = bbalance + 5 WHERE bid = 3;
calls       | 1
total_time  | 0.002
rows        | 1
hit_percent | 99.9778970000200936
-[ RECORD 2 ]---------------------------------------------------------------------
query       | UPDATE pgbench_branches SET bbalance = bbalance + 4343 WHERE bid = 42;
calls       | 1
total_time  | 0.0044
rows        | 1
hit_percent | 99.9778970000200936
-[ RECORD 3 ]---------------------------------------------------------------------
query       | UPDATE pgbench_branches SET bbalance = bbalance + -2329 WHERE bid = 4543;
calls       | 1
total_time  | 0.003
rows        | 1
hit_percent | 99.9778970000200936
-[ RECORD 4 ]---------------------------------------------------------------------
query       | UPDATE pgbench_branches SET bbalance = bbalance + 9005 WHERE bid = 7392;
calls       | 1
total_time  | 0.005
rows        | 1
hit_percent | 99.9778970000200936

Pretty soon, the shared memory area that stores these statements is filled with a great many entries, each perhaps differing only in the constants used in a single execution of the query, and with no way to aggregate the information to usefully inform optimisation efforts.

Why not just always use prepared statements? Well, some applications do. Prepared statements have historically been well supported by JDBC for example, so Java apps tended to be okay. However, there are practical reasons to avoid them. Prepared statements can easily have performance regressions compared to equivalent unprepared versions, since the optimizer won't necessarily be able to use as much data about MCVs and so on as in the regular case. There are also limitations in some popular ORMs, with, for example, the popular Rails' ActiveRecord ORM having just added prepared statement support recently. Ad-hoc queries are generally never prepared either. The traditional solution was log-parsing utilities, which analyse log files after-the-fact and output reports. I have found some of these tools to be rather awkward, with high resource requirements that often necessitate running log analysis on a separate server to process often hundreds of megabytes a day of verbose log ouput. Postgres logs are naturally very verbose when every single query's execution is logged.

Enter pg_stat_statements normalisation. Last night, Tom Lane committed the following patch of mine, which was sponsored by Heroku:

Improve contrib/pg_stat_statements to lump "similar" queries together.

pg_stat_statements now hashes selected fields of the analyzed parse tree to assign a "fingerprint" to each query, and groups all queries with the same fingerprint into a single entry in the pg_stat_statements view. In practice it is expected that queries with the same fingerprint will be equivalent except for values of literal constants. To make the display more useful, such constants are replaced by "?" in the displayed query strings.

This mechanism currently supports only optimizable queries (SELECT, INSERT, UPDATE, DELETE). Utility commands are still matched on the basis of their literal query strings.

There remain some open questions about how to deal with utility statements that contain optimizable queries (such as EXPLAIN and SELECT INTO) and how to deal with expiring speclative hashtable entries that are made to save the normalized form of a query string. However, fixing these issues should require only localized changes, and since there are other open patches involving contrib/pg_stat_statements, it seems best to go ahead and commit what we've got.

Peter Geoghegan, reviewed by Daniel Farina

I would like to acknowledge the invaluable assistance of both Tom and Daniel in bringing this project to maturity.

I have benchmarked this featured, and found that it implies an overhead of about 1% - 2.5% on pgbench's default TPC-B style workload. This is rather good, and only marginally worse than the hit taken when using prepared statements with classic pg_stat_statements.

As if that wasn't good enough, pg_stat_statements has been made even more useful by the efforts of other people for the upcoming 9.2 release. A patch by Ants Aasma was recently committed to expose I/O timings at the query granularity through pg_stat_statements. Robert Haas wrote a patch to expose blocks dirtied and written by statements through pg_stat_statements too.

I expect the uptake of pg_stat_statements to really climb after the release of Postgres 9.2.

Power consumption in Postgres 9.2

2012-01-28T16:00:00.000-08:00

One of the issues of major concern to CPU vendors is optimising the power consumption of their devices. In a world where increasingly, computing resources are purchased in terms of fairly abstract units of work, and where, when selecting the location of a major data-centre, the local price of a kilowatt hour is likely to be weighed just as heavily as the wholesale price of bandwidth, this is quite understandable.

Globally, data centres consumed between 1.1 and 1.5 percent of electricity in 2010 (Source: Koomey). The economic and ecological importance of minimizing that number is fairly obvious.

The broad trend towards increasing amounts of computing being performed within large data centres, with consolidated infrastructure, sold as a service rather than a product is undeniable. Of course, the term “cloud computing” is often applied to this phenomenon. That’s a term that I try to avoid, as it’s fairly ambiguous.

There has been considerable effort to reduce wake-ups when idle in software in general, including everything from web browsers to word processors, which is related to the increasing importance of mobile and embedded platforms. However, this effort is most pronounced among developers of software that is expected to be deployed in virtualised environment on many servers, as wake-ups prevent CPUs from entering various idle states that allow them to save electricity, and when these wakeups are multiplied by thousands of VM instances, they add up very quickly.

As part of 4CaaSt, a research project funded by the European Commission's Seventh Framework programme, that brings together members of industry and academia with the collective goal of producing an innovative platform-as-as-service offering, I spent time reducing the idle wake-ups per second in PostgreSQL. Postgres services firm 2ndQuadrant, where I work as a database architect, has had the development of several PostgreSQL features sponsored by 4CaaSt in furtherance of that goal, of which this is only one.

Historically, PostgreSQL has been weak in this particular area. With a standard Postgres server, with no special configuration, I have measured the wake-ups when idle at 11.5 per second, using Intel’s powertop utility, as of the current 9.1 release. This was thought to be unacceptably high, for 4CaaSt, other solutions that leverage virtualisation extensively, and for embedded systems too.

CPUs have a number of methods of reducing power consumption. These are specified by the ACPI standard (which covers discoverability, configuration and power-management), which in case you hadn't heard, is an open specification that makes minimal assumptions about the architecture or platform in use, and was written to help authors of operating system kernels.

Briefly, ACPI describes the following states (I’ve avoided mentioning other states that have more to do with things like managing laptop hibernation):

Performance states P0 through to PN (i.e. the exact number of states is implementation-defined). Dynamic CPU frequency scaling states. This might be better known under marketing names for specific implementations, like “Intel SpeedStep technology”. Ever notice how the frequency reported for your CPU under /proc/cpuinfo varies from one moment to the next on Linux? This is why! This state tends to be a bit sticky, in that it might take a few seconds to observe changes in frequency, as it is increased to meet demand.

Processor states C0 through to C3. Processors will change this state very quickly, and we basically want to keep this as high as possible, as higher values are associated with using less power.

C0 is the operating state.

C1, or the halt state, is a state where the processor is not executing instructions, but can return to an executing state essentially instantaneously.

C2, or the Stop-Clock state, is a state where the processor maintains all software-visible state, but may take longer to wake up.

C3, or the sleep state, is a state where the processor does not need to maintain cache coherency, but does maintain some other state. There can even be graduations of how deep a sleep this state represents, depending on the implementation - the Intel Core i5 chip in my laptop has a C4 state, for example.

Postgres has a multi-process architecture, which includes at a minimum a number of “auxiliary processes”: processes that perform a single, well defined task across the installation. There is also a process associated with each connection, and autovacuum daemon. Out-of-the-box, you’ll see just the following processes, once the PostgreSQL server becomes idle:

Postmaster. A “clearing-house process”, that manages all other processes, and is minimally exposed to installation-wide failures, so that it has a good chance of recovering the server in the event of an unanticipated failure. To simulate this, you can kill another auxiliary process, and watch as the postmaster starts it again.

Background writer. A process that is charged with writing out “dirty”, or unwritten buffers, in the hope of preventing individual connection backends from ever having to.

WAL Writer. A process that writes out WAL, log files that describe changes made to data in PostgreSQL databases. This is part of a whole subsystem through which the server efficiently maintains its crash-safety/durability guarantees.

Autovacuum launcher. This process notices if there is a need to vacuum dead rows, which are an artifact of the Postgres MVCC implementation. It launches autovacuum worker processes as needed, to perform this garbage collection.

Statistics collector. This process collects statistics on tables and queries, both to facilitate how autovacuum apportions work to vacuum dead rows and build more detailed statistics for the planner, and for general instrumentation.

Checkpointer (new to 9.2). This process is responsible for managing checkpoints - smoothed writing of all data to disk, so that WAL files that describe those changes in sequence before a certain point can finally be truncated. This used to be an additional responsibility of the background writer.

The reason that all these wake-ups had to occur within each auxiliary process was because they needed to check if the Postmaster was still alive very regularly, or if they had work assigned to them. If they took too long to notice that the Postmaster was dead (a major failure that neccessitates all processes immediately exiting), they would take too long to detach from shared memory, which would prevent the DBA from starting a new instance, as Postgres will refuse to start when it notices this to avoid data corruption.

The solution was to amend the latch, a low-level facility to wait-sleep on an event that was already used for synchronous replication, to also monitor Postmaster death. This infrastructure was committed first. I then proceeded to write patches for each auxiliary process, most recently the Background writer, which was particularly tricky, though accounted for most of the wake-ups when idle among auxiliary processes - usually 5 per second.

Some considerable progress has been made. Additional variability has been added to the number of wake-ups per second, but if you monitor the wake-ups per second using powertop at a sufficiently high granularity, it stabilises at:

  3.8% ( 35.0)   SignalSender
  3.0% ( 27.2)   [kernel scheduler] Load balancing tick
  2.8% ( 25.6)   kworker/0:0
  0.8% ( 7.6)   postgres
  0.6% ( 5.7)   [TLB shootdowns] <kernel IPI>
  0.6% ( 5.6)   [kernel core] hrtimer_start (tick_sched_ti

To give you some notion of how this relates to CPU states, this is an account of the time my laptop’s CPU spends in each of the states at one moment in time, according to powertop:

Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 1.5%)       Turbo Mode     3.0%
polling           0.0ms ( 0.0%)         2.00 Ghz     0.1%
C1 mwait          1.0ms ( 1.3%)         1.80 Ghz     0.1%
C2 mwait          1.5ms ( 1.8%)         1200 Mhz     0.2%
C3 mwait          1.4ms ( 0.4%)          800 Mhz    96.7%
C4 mwait          7.9ms (95.0%)

There is still some more work to do though. Simon Riggs and I submitted a patch to add group commit to PostgreSQL, which is being reviewed in the ongoing commitfest. This feature is anticipated to be very valuable to workloads that are bound by their commit rate, and a number of benchmarks that have been performed are very promising. That patch included support for allowing the WAL Writer to sleep. However, the exact details of group commit’s implementation have yet to be agreed upon, and it is not yet completely clear how effectively we will be able to reduce the WAL writer's idle wake-ups. However, I am hopeful that we will be able to eliminate them entirely, bringing the total number down to 2.6 per second for an idle Postgres 9.2 installation with standard settings. The WAL writer, much like the background writer, accounts for a relatively large 5 wake-ups per second (assuming default settings), and is similarly a bit tricky to adjust in this way.

I’d previously measured the idle wake-ups per second for my distro’s mysqld at 2.2 (mysql-server version 5.1.56, Fedora 14), though when I check now, with mysql-server 5.5.19 on Fedora 16, that’s way up at consistently over 20 wake-ups per second. I’m not sure why that might be, but I welcome input as to what a fair, objective comparison would look like. I have made every effort to be fair here, and I'd speculate that this may have something to do with the storage engine in use in each case.

Clang now builds Postgres without additional warnings

2011-08-06T13:14:00.000-07:00

I'm happy to report that as of this evening, Clang builds PostgreSQL without any warnings, apart from a single remaining warning that also occurs when building with GCC, which is actually a bug in GNU Flex that the Flex developers don't seem to want to fix. On GCC 4.6, the warning looks like this:

In file included from gram.y:12962:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:16246:23: warning: unused variable ‘yyg’ [-Wunused-variable]

With Clang, however, it looks like this:

scan.c:16246:23: warning: unused variable 'yyg' [-Wunused-variable]
struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be unused depending upon options. */
^
Note that the "^" is directly underneath the offending variable "yyg" on the terminal emulator that generated this warning.

Note also that Clang usefully gives the context of the warning, and as a result a comment is displayed that suggests that the warning is spurious.

The Clang developers finally committed a fix to remove spurious warnings that occured when building Postgres as a result of it being statically detected that there are assignments past what appears to be the end of a single element array at the end of a struct. That doesn't happen now, although only under circumstances exactly consistent with the use of a popular idiom that is seen quite a bit in the Postgres code.

In working towards removing all Clang warnings, we detected a bug; we were assigning an enum constant from one enum to a variable that was actually another type of enum, which represented a potentially dangerous misuse of an abstraction that the Postgres code uses to represent nodes. This all occurred within a nested macro. Without Clang, it probably would have taken a long time for the problem to be noticed.

Could Clang displace GCC generally? Part II: Performance of PostgreSQL binaries

2011-07-28T09:12:00.000-07:00

This is the second in a two-part series on Clang. If you haven't already, you'll want to read my original post on the topic, Could Clang displace GCC among PostgreSQL developers? Part I: Intro and compile times.

So, what about the performance of PostgreSQL binaries themselves when built with each compiler? I had heard contradictory reports of the performance of binaries built with Clang. In Belgium, Chris Lattner said that Clang built binaries could perform better, but a number of independent benchmarks suggested that Clang was generally behind, with some notable exceptions. I asked 2ndQuadrant colleague and PostgreSQL performance expert Greg Smith to suggest a useful benchmark to serve as a good starting point for comparing Postgres performance when built with Clang to performance when built with GCC. He suggested that I apply Jeff Janes' recent patch for pgbench that he'd reviewed. It stresses the executor, and therefore the CPU quite effectively, rather than table locks or IPC mechanisms. The results of this benchmark were very interesting.

Greg provided me with shell access to a beefy server, the same server that he used in his review of Jeff’s patch, which added the -P option: http://archives.postgresql.org/message-id/4DFE788F.5020704@2ndQuadrant.com . I hacked together a shell script to run pgbench for this purpose. Binaries were built using GCC and Clang, each with exactly the same flags - Clang accepts the same flags as GCC. To smooth the results out, and to get a conclusive outcome, I decided on 16 10 minute -P runs with 4 connections, that alternated between using each set of binaries, lasting a total of 3 hours. Here’s a summary of the results:

1) GCC test:
tps = 34.242839 (including connections establishing)
2) Clang test:
tps = 34.370732 (including connections establishing)
3) GCC test:
tps = 34.186687 (including connections establishing)
4) Clang test:
tps = 34.922954 (including connections establishing)
5) GCC test:
tps = 32.393383 (including connections establishing)
6) Clang test:
tps = 34.994233 (including connections establishing)
7) GCC test:
tps = 33.019546 (including connections establishing)
8) Clang test:
tps = 34.234937 (including connections establishing)
9) GCC test:
tps = 33.233653 (including connections establishing)
10) Clang test:
tps = 35.233373 (including connections establishing)
11) GCC test:
tps = 33.962637 (including connections establishing)
12) Clang test:
tps = 33.869868 (including connections establishing)
13) GCC test:
tps = 33.488347 (including connections establishing)
14) Clang test:
tps = 33.005470 (including connections establishing)
15) GCC test:
tps = 33.600023 (including connections establishing)
16) Clang test:
tps = 34.770840 (including connections establishing)

The total transactions per second with the Clang binaries was marginally ahead of the GCC binaries. While further analysis is certainly needed, it is a remarkable achievement for Clang to have been able to hold its own against, or even slightly outperform a compiler as mature and popular as GCC here.

So, is it ready for prime-time? Well, not quite. Even my bleeding edge Fedora 15 system only comes with Clang 2.8, and only Clang 2.9 is listed as supported in 9.1. These build time figures are only obtainable on very recent revisions of Clang. While I’m extremely encouraged by our pgbench benchmark results, I hesitate to recommend the use of Clang for building production PostgreSQL binaries just yet. The benchmark is quite synthetic, and may not be a great proxy for general performance, although it certainly wasn’t cherry-picked. I encourage others to independently reproduce my work here, and to suggest alernative benchmarks.

I can heartily recommend Clang for hacking on PostgreSQL today. I was also impressed with how accessible the Clang developers are if you have a problem. Sometimes you have to be a bit persistent, but in general the Clang community are quite responsive to end-users concerns.

Watch this space.