tag:blogger.com,1999:blog-76096116253111263072024-03-12T23:05:59.648-07:00Peter Geoghegan's blogMusings on PostgreSQL, database technology in general, and software developmentPeter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.comBlogger26125tag:blogger.com,1999:blog-7609611625311126307.post-85225411690424796992019-03-22T17:07:00.001-07:002019-03-23T14:47:52.029-07:00Visualizing Postgres page images within GDBIt's straightforward to set up GDB to quickly invoke pg_hexedit on a page image, without going through the filesystem. The page image can even come from a local temp buffer.<br />
A user-defined GDB command can be created that shows an arbitrary page image in pg_hexedit from an interactive GDB session.<br />
<br />
This is a good way to understand what's really going on when debugging access method code. It also works well with core dumps. I found this valuable during a recent project to improve the Postgres B-Tree code.<br />
<br />
An example of how to make this work is available from a newly added section of the pg_hexedit README file:<br />
<br />
<a href="https://github.com/petergeoghegan/pg_hexedit/#using-pg_hexedit-while-debugging-postgres-with-gdb">https://github.com/petergeoghegan/pg_hexedit/#using-pg_hexedit-while-debugging-postgres-with-gdb</a>Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-15363979985746506262018-05-18T16:11:00.004-07:002018-05-18T16:17:52.676-07:00Visualizing a column's space overhead using pg_hexedit<a href="https://github.com/petergeoghegan/pg_hexedit" target="_blank">pg_hexedit</a> recently gained the ability to annotate the space taken up by each individual column/attribute within each individual tuple. This works with tables, and with B-Tree indexes.<br />
<br />
I had to come up with a way of passing the pg_hexedit frontend utility the relevant <a href="https://www.postgresql.org/docs/current/static/catalog-pg-attribute.html" target="_blank">pg_attribute</a> metadata to make this work. This metadata describes the "shape" of individual tuples in a relation (backend code uses a closely related structure called a "tuple descriptor"). My approach works seamlessly in simple cases, but <a href="https://github.com/petergeoghegan/pg_hexedit#direct-invocation" target="_blank">can still be used when manually running the pg_hexedit command line tool</a>.<br />
<br />
<a name='more'></a><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEXsrT-v1OqvDp7OLr0CjCE7GIhegE6lAsHZL7hTuqz4TzD9ID4dVOE1ZFmi7PBZRWEc9REro8WrYW6AXito-slXy5nZUuWbKn1m4jcnYVMV0zAj3Zrf-46nI5hWpjL9w3YhvQNMuvY8SP/s1600/pg_attribute_hexedit_blog.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1029" data-original-width="1299" height="253" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEXsrT-v1OqvDp7OLr0CjCE7GIhegE6lAsHZL7hTuqz4TzD9ID4dVOE1ZFmi7PBZRWEc9REro8WrYW6AXito-slXy5nZUuWbKn1m4jcnYVMV0zAj3Zrf-46nI5hWpjL9w3YhvQNMuvY8SP/s320/pg_attribute_hexedit_blog.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">pg_attribute system catalog table with column annotations/tags</td></tr>
</tbody></table>
<br />
This new capability could be applied to optimizing the data layout of a table that is expected to eventually have a massive number of rows. Carefully choosing the order and type of each column can reduce the total on-disk footprint of a table by an appreciable amount, especially when the final table ends up with several 1 byte columns that get packed together.<br />
<br />
I am aware of several <a href="https://youtu.be/BgcJnurVFag?t=14m58s" target="_blank">PostgreSQL users that found it worthwhile to have a highly optimized tuple layout</a>, going so far as to use their own custom dataypes. <a href="https://en.wikipedia.org/wiki/Data_structure_alignment" target="_blank">Alignment</a>-aware micro-optimization of a Postgres client application's schema won't help much in most cases, but it can help noticeably with things like fact tables, or tables that contain machine-generated event data. Developing a sense of proportion around storage overhead should now be easier, and more intuitive.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-37826712565586547032018-03-19T21:44:00.000-07:002018-03-20T01:16:32.610-07:00Decoding pg_filenode.map files with pg_filenodemapdataFrom time to time, you may need to figure out which file in a PostgreSQL data directory corresponds to a particular table or index in the database. For example, <a href="https://github.com/petergeoghegan/pg_hexedit/" target="_blank">pg_hexedit</a> users sometimes <a href="https://github.com/petergeoghegan/pg_hexedit#direct-invocation" target="_blank">need this information</a>, since pg_hexedit is a frontend utility that works by reading relation files from the filesystem. In practice, a pg_hexedit <a href="https://github.com/petergeoghegan/pg_hexedit#quickstart-guide---using-the-convenience-scripts" target="_blank">convenience script</a> can usually be used instead. Users need only give the name of the table or index that is to be examined. The convenience scripts call the <a href="https://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADMIN-DBLOCATION" target="_blank">built-in function pg_relation_filepath()</a> via an SQL query.<br />
<br />
This approach won't always work, though. pg_hexedit is a tool for investigating corruption, and sometimes corruption can affect system catalogs in a way that makes it impossible to even establish a connection to the database. You may find that you're greeted with an arcane error any time you attempt to connect to the database. The error may look something like this:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">ERROR: catalog is missing 3 attribute(s) for relid 827</span><br />
<br />
<a name='more'></a><br />
In this example, the issue that prevents us from connecting must have something to do with the system catalog <a href="https://www.postgresql.org/docs/current/static/catalog-pg-attribute.html" target="_blank">pg_attribute</a>, and/or an index on pg_attribute. The catalog with relid/<a href="https://www.postgresql.org/docs/current/static/catalog-pg-class.html" target="_blank">pg_class</a> OID 827 (pg_default_acl_role_nsp_obj_index) appears to lack pg_attribute entries, making the built-in/catalog index pg_default_acl_role_nsp_obj_index unusable (note that there is no reason to think that the underlying relfile for pg_default_acl_role_nsp_obj_index is itself corrupt). To confirm this theory, we'll need to directly examine the pg_attribute relation for the database. Of course, there is no way to query pg_attribute, because we cannot connect. Moreover, there is no easy way to know where the files associated with pg_attribute are, so that we can at least examine pg_attribute using pg_hexedit. The "relfilenode" number that corresponds to pg_attribute (or any other table) isn't hard-coded or stable. For example, the relfilenode of a table <a href="https://www.postgresql.org/docs/current/static/storage-file-layout.html" target="_blank">will change any time VACUUM FULL</a> is used on the table.<br />
<br />
I've written a tool called pg_filenodemapdata that can help when something like this happens. This new pg_hexedit-bundled program <a href="https://github.com/petergeoghegan/pg_hexedit#determining-catalog-relation-file-mappings-without-a-database-connection" target="_blank">prints the contents of pg_filenode.map files</a>. It can be used to determine <a href="https://blog.2ndquadrant.com/postgresql-filename-to-table/" target="_blank">the relfilenode numbers that correspond</a> to system catalogs entries from a pg_filenode.map file (pg_class OID to relfilenode map file). pg_attribute is an example of a special system catalog that is mapped by per-database pg_filenode.map files. As a rule of thumb, the most severe system catalog corruption is corruption that affects one of the catalogs that Postgres tracks within a pg_filenode.map file. pg_filenodemapdata should help with getting to the bottom of the problem; it's now possible to at least examine the corrupt pg_attribute file.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-14336724499136696592018-01-08T14:06:00.000-08:002018-01-08T14:06:56.273-08:00Exploring SP-GiST and BRIN indexes visually using pg_hexeditSupport for both <a href="https://www.postgresql.org/docs/current/static/brin-intro.html" target="_blank">BRIN</a> and <a href="https://www.postgresql.org/docs/current/static/spgist-intro.html" target="_blank">SP-GiST</a> access methods was recently added to <a href="https://github.com/petergeoghegan/pg_hexedit" target="_blank">pg_hexedit</a>, the <a href="https://pgeoghegan.blogspot.com/2017/11/pghexedit-rich-hex-editor-annotations.html" target="_blank">experimental hex editor framework for PostgreSQL relation files</a>. These were the final access methods among the <a href="https://www.postgresql.org/docs/current/static/indexam.html" target="_blank">standard Postgres index access methods</a> that required support.<br />
<br />
<a name='more'></a><h3>
SP-GiST (Space-Partitioned GiST)</h3>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><div style="text-align: left;">
<br /></div>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMAC5yguTH_ODqpiTKAUqLSa6Ana9jH-_mJcQx-3-WzSTMxzPfdx-eaxUZ4wpI7oXc7ynmaEC3o7smrTWoDgQhgGY_JwAvFa1p53I04qfCYN1uO3XKrWzski6grotJs5U3G0uYfFAt0Ore/s1600/spgist-hexedit-cropped.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1501" data-original-width="1600" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMAC5yguTH_ODqpiTKAUqLSa6Ana9jH-_mJcQx-3-WzSTMxzPfdx-eaxUZ4wpI7oXc7ynmaEC3o7smrTWoDgQhgGY_JwAvFa1p53I04qfCYN1uO3XKrWzski6grotJs5U3G0uYfFAt0Ore/s320/spgist-hexedit-cropped.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Beginning of an <a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/spgist/README#L35" target="_blank">SP-GiST leaf</a> page</td></tr>
</tbody></table>
<br />
SP-GiST is unique among index access methods whose index structure is tree-like, in that it supports tree structures that are <i>unbalanced</i>. SP-GiST operator classes exist that support <a href="https://en.wikipedia.org/wiki/K-d_tree" target="_blank">k-d trees</a>, <a href="https://en.wikipedia.org/wiki/Quadtree" target="_blank">quadtrees</a>, and <a href="https://en.wikipedia.org/wiki/Suffix_tree" target="_blank">suffix trees</a>. These structures are traditionally only suited to a fully in-memory representation, with dynamically-allocated nodes that contain a small number of simple <a href="https://en.wikipedia.org/wiki/Pointer_(computer_programming)" target="_blank">pointers</a> (byte addresses) pointing to other nodes.<br />
<br />
SP-GiST presents a <i>generalized</i> interface through which all of these space-partitioned trees can be constructed for a given datatype, <a href="https://www.pgcon.org/2011/schedule/attachments/197_pgcon-2011.pdf" target="_blank">in a way that minimizes disk seeks</a> (PDF) and works well with block-orientated storage. Essentially, SP-GiST maps tree nodes onto disk blocks in an adaptive fashion, rather than simply having a block directly correspond to a tree node, as happens with other access methods.<br />
<br />
There are particularly intricate data structures needed to support all of this. Space utilization can be an issue with SP-GiST indexes, though that's probably very workload dependent. This is something that pg_hexedit can be effective at representing visually.<br />
<br />
SP-GiST is a good example of the PostgreSQL community implementing a concept that comes directly from <a href="https://www.cs.purdue.edu/spgist/papers/W87R36P214137510.pdf" target="_blank">state of the art database research</a> (PDF). I suspect that we have yet to fully realize the benefit of SP-GiST for specific application domains, due to a simple lack of awareness among users and potential users that work in those domains. Perhaps this enhancement can contribute in some small way towards a better understanding of what is possible.<br />
<br />
<h3>
BRIN (Block Range Index)</h3>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_D_zr1DcMfzgn_yZd7JMELxg5EpsAvqZ9qWbKCzT7UIR6UNrIqD2TIrYwyey_tskdp6iu063u3b32rnpvuBiKJjHedaAG5G2tzU4L27HnPaXPy1vix7IEvn7MxnstZJEzG6NCsnJ0top8/s1600/brin-revmap-page-cropped.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1510" data-original-width="1600" height="301" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_D_zr1DcMfzgn_yZd7JMELxg5EpsAvqZ9qWbKCzT7UIR6UNrIqD2TIrYwyey_tskdp6iu063u3b32rnpvuBiKJjHedaAG5G2tzU4L27HnPaXPy1vix7IEvn7MxnstZJEzG6NCsnJ0top8/s320/brin-revmap-page-cropped.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">BRIN <a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/brin/README#L76" target="_blank">"revmap"</a> page</td></tr>
</tbody></table>
The structure of BRIN indexes is not at all tree-like. BRIN works by summarizing the locations of ranges of values within the underlying indexed table, and exploiting an underlying natural ordering (e.g., a date column on a large append-only table with historic sales records). BRIN indexes are often very small, even when they index a very large underlying table.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizFeIHTiwAGM7rcmp2s3bH8F_vyc78TNCalwGkEntoaTOSNZuST704vOsK4Eu5CMNE0R-0GsP_8aCoI_TOHB0Gop_5uXxBzhzcPi-2w3XU3oXZVjgzQr90mOJXglkVhT85Cd8Wrigbq7CS/s1600/brin-summary-page-cropped.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1503" data-original-width="1600" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizFeIHTiwAGM7rcmp2s3bH8F_vyc78TNCalwGkEntoaTOSNZuST704vOsK4Eu5CMNE0R-0GsP_8aCoI_TOHB0Gop_5uXxBzhzcPi-2w3XU3oXZVjgzQr90mOJXglkVhT85Cd8Wrigbq7CS/s320/brin-summary-page-cropped.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">BRIN summary page</td></tr>
</tbody></table>
The on-disk representation is fairly simple. One novel aspect of BRIN's on-disk representation is that it's the only index access method that updates index tuples in-place (this is possible in part because it doesn't have one index tuple for every heap tuple/<a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/heap/README.HOT#L45" target="_blank">HOT chain</a> from the indexed table, another unique property). These in-place updates can happen when a range in a summary page changes. Visibility into how often this happens in the real world may prove useful.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-45098760913114301752017-12-15T12:29:00.000-08:002017-12-15T12:29:04.103-08:00pg_hexedit now supports GiST, GIN, and hash indexes I've added several enhancements to <a href="https://github.com/petergeoghegan/pg_hexedit" target="_blank">pg_hexedit</a>, the experimental hex editor toolkit that allows you to open up raw PostgreSQL relation files with useful tags and annotations about the state and purpose of each field. The tool now supports annotations for <a href="https://www.postgresql.org/docs/current/static/gist-intro.html" target="_blank">GiST</a>, <a href="https://www.postgresql.org/docs/current/static/gin-intro.html" target="_blank">GIN</a>, and <a href="https://rhaas.blogspot.com/2017/09/postgresqls-hash-indexes-are-now-cool.html" target="_blank">hash indexes</a>, as well as <a href="https://www.postgresql.org/docs/current/static/functions-sequence.html" target="_blank">sequences</a>.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhM8NlGSsrF41aSQX4mrzKMmEmJ2W9FrG26QUYYDuGRokZP1uzX7i4BlONr1S_AnnsScAF-2LNiAxCps-3SH_uEXb8OohLeflaD24Od8kH6YQPYHDQj9Yw3J3IO7omOMMpjfKqMAPMRggTA/s1600/gin-posting-tree-page.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="883" data-original-width="1600" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhM8NlGSsrF41aSQX4mrzKMmEmJ2W9FrG26QUYYDuGRokZP1uzX7i4BlONr1S_AnnsScAF-2LNiAxCps-3SH_uEXb8OohLeflaD24Od8kH6YQPYHDQj9Yw3J3IO7omOMMpjfKqMAPMRggTA/s320/gin-posting-tree-page.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">GIN "posting tree" leaf page. <a href="https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html" target="_blank">Compressed</a> TIDs are in orange.</td></tr>
</tbody></table>
<a name='more'></a><br />It wasn't very time consuming to add these enhancements, because most index access methods share the same basic approach to page layout. I plan to add support for the two remaining index access methods (BRIN and SP-GiST) early in the new year. My hope is that this will spur interest in <a href="https://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf" target="_blank">the internals of PostgreSQL index access methods</a> (PDF link), and how they deal with <a href="https://pgconf.ru/media/2016/05/13/tuple-internals.pdf" target="_blank">index tuples and space management</a> (PDF link).<br />
<br />
Hat tip to Pat Shaughnessy, who just today <a href="http://patshaughnessy.net/2017/12/15/looking-inside-postgres-at-a-gist-index" target="_blank">wrote a great blog post</a> on the internals of GiST. The fact that he has done such a thorough job of explaining how GiST works to a wider audience is encouraging.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-70902941893278919082017-11-26T17:48:00.000-08:002017-11-26T17:48:13.105-08:00pg_hexedit: Rich hex editor annotations for Postgres relfilesI've written an experimental tool for presenting PostgreSQL relation files in a hex editor with annotations/tags and tooltips that show the structure of the data and its content, including <a href="https://en.wikipedia.org/wiki/Bit_field" target="_blank">bit field values</a>. This tool is called pg_hexedit, and is available from:<br />
<br />
<a href="https://github.com/petergeoghegan/pg_hexedit">https://github.com/petergeoghegan/pg_hexedit</a><br />
<br />
pg_hexedit is built on top of the open source, cross-platform GUI hex editor <a href="https://github.com/EUA/wxHexEditor" target="_blank">wxHexEditor</a>. Since it's an experimental tool that is primarily made available for educational purposes, you are well advised to not use it on any data directory that isn't <b>entirely disposable</b>. It may cause<b> data corruption</b>. Opening a Postgres relation file in a hex editor while the server is running is a <b>fundamentally unsafe</b> thing to do if you care about your data. Use of the tool should be limited to throwaway installations on users' personal machines.<br />
<a name='more'></a><br />
wxHexeditor and pg_hexedit together show information about each individual field in an interactive, easy to use way:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgI9UloFY_X8mknJubbqlXcBzrzUJI1DBVVMfr3aKaOBAAudUZ9SRlpVJmuBNu8qgXbaxE0BQrIqfurZBXNAPyvwJNT1mrJNRCwbnBoaZHe-_7Vb99dOAbY_HYNgw37XFs3b3uSNiAUElFF/s1600/pg_hexedit_pg_type.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="881" data-original-width="1600" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgI9UloFY_X8mknJubbqlXcBzrzUJI1DBVVMfr3aKaOBAAudUZ9SRlpVJmuBNu8qgXbaxE0BQrIqfurZBXNAPyvwJNT1mrJNRCwbnBoaZHe-_7Vb99dOAbY_HYNgw37XFs3b3uSNiAUElFF/s320/pg_hexedit_pg_type.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="https://www.postgresql.org/docs/current/static/catalog-pg-type.html" target="_blank">pg_type</a> catalog table opened in wxHexeditor, with annotations</td></tr>
</tbody></table>
<br />I originally wrote the tool in order to meet my own needs in this area. I was <a href="https://github.com/petergeoghegan/amcheck" target="_blank">working on corruption detection</a>, and it became clear that a tool like this would help with corruption simulation/white-box testing, something that I've spent rather a lot of time on. Simulating and testing novel corruption scenarios became significantly easier with pg_hexedit.<br />
<br />
Tools like <a href="https://www.postgresql.org/docs/current/static/pageinspect.html" target="_blank">contrib/pageinspect</a> are great, but they are still somewhat interpretive, which can actually be a hindrance for this kind of work. In short, pageinspect functions show "what tuples are on the page" logically, as well as the physical contents of <i>individual</i> tuples, but the <a href="https://www.postgresql.org/docs/current/static/storage-page-layout.html" target="_blank">exact physical state of the entire page</a> is obscured, in order to support an <a href="http://rachbelaid.com/introduction-to-postgres-physical-storage/" target="_blank">item-pointer</a>-wise SQL interface. The subtle details of how free space is managed <i>within a single page</i> can matter. At least to me.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggCM4mtBcdTx_ZKAB9-jOzK5zPjEsRjKFTTCVjgrbHMLbofo3quDUdj6TlfTeuFXm4Jk4j2vlzJUSbrcPIto6JjWcclKU5CzFK66wBMXYq5pw_B48SyJ_RuSdV7SUVdvpwH-NAydRd6mPV/s1600/pg_hexedit_cities_index.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="881" data-original-width="1600" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggCM4mtBcdTx_ZKAB9-jOzK5zPjEsRjKFTTCVjgrbHMLbofo3quDUdj6TlfTeuFXm4Jk4j2vlzJUSbrcPIto6JjWcclKU5CzFK66wBMXYq5pw_B48SyJ_RuSdV7SUVdvpwH-NAydRd6mPV/s320/pg_hexedit_cities_index.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">"cities" nbtree page, starting with ItemId array (shown as blue tags)</td></tr>
</tbody></table>
I eventually realized that pg_hexedit is also broadly useful as an educational tool, and decided to make it available as an open source project. Anything that helps to demystify the internals of Postgres seems like a good thing to me. I hope that pg_hexedit will be useful to users or <a href="https://wiki.postgresql.org/wiki/Developer_FAQ" target="_blank">aspiring hackers</a> that want to understand how PostgreSQL works from the ground up. I welcome pull requests from users that want to expand pg_hexedit. For example, support for additional index access methods would be nice. Only <a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/heap/heapam.c" target="_blank">heapam</a> and the <a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/nbtree/README" target="_blank">nbtree index AM</a> are currently supported.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-21949926947210206992017-10-26T08:27:00.000-07:002017-12-15T12:53:07.370-08:00amcheck “table-matches-index” enhancement now available, detects "freeze-the-dead" corruptionI’m pleased to announce that v1.2 of amcheck, a tool for detecting that PostgreSQL relations are logically consistent (that they do not <i>appear</i> to be corrupt) is now generally available. This version adds a big enhancement - the optional ability to check if every tuple that <i>should</i> have an entry in the index <i>does</i> in fact have such an entry. Specifically, we check for a table entry with matching data, as well as a matching <a href="https://www.postgresql.org/docs/current/static/ddl-system-columns.html" target="_blank">heap TID</a>. This happens at the end of the existing tests, as an optional extra step.<br />
<br />
This enhancement is significant because it seems much more likely to catch corruption in the wild. In general, inconsistencies between a table and its indexes are more likely to occur than inconsistencies between blocks within an index for many reasons. There is simply a much larger window for an inconsistency to arise when something is amiss with database storage that breaks <a href="https://www.postgresql.org/docs/current/static/wal-reliability.html" target="_blank">the assumptions</a> PostgreSQL makes during crash recovery, for example.
<br />
<br />
The enhancement is also significant because it played a role in <a href="https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com">identifying</a> a <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a5736bf754c82d8b86674e199e232096c679201d">PostgreSQL data corruption bug</a> that will be fixed in the next point release, <a href="https://www.postgresql.org/developer/roadmap/">scheduled</a> for November 9th, 2017. This bug affects <i>all</i> supported PostgreSQL versions. It was informally dubbed the “freeze-the-dead” bug.<br />
<br />
<b>November 6 2017 update: </b>The fix was <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c6764eb3aea63f3f95582bd660785e2b0d4439f9" target="_blank">reverted</a> due to additional concerns that came to light. The community is working on a new, more comprehensive fix for the next point release.<br />
<br />
<b>December 15 2017 update: </b>A <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9c2f0a6c3cc8bb85b78191579760dbe9fb7814ec" target="_blank">new fix has been committed</a>, and will appear in 9.3.21, 9.4.16, 9.5.11, 9.6.7, and 10.2 point releases, <a href="https://www.postgresql.org/developer/roadmap/">scheduled</a> for February 8th, 2018.<br />
<br />
Packages for v1.2 are available from the community Debian/Ubuntu apt repository, as well as packages from the community Redhat/CentOS/SLES yum repository. Full details on installing these packages are available from the README:<br />
<br />
<a href="https://github.com/petergeoghegan/amcheck/">https://github.com/petergeoghegan/amcheck/</a><br />
<br />
<a name='more'></a><h3>
"Freeze-the-dead" corruption detection</h3>
<div>
<br /></div>
I should emphasize that the bug is something that I believe to be <b>very unlikely</b> to hit in the real world, because there is only a very small window. Moreover, it is probably virtually impossible to hit without a manual VACUUM FREEZE. It can only happen with the use of foreign keys (strictly speaking, directly <a href="http://www.databasesoup.com/2015/12/a-christmas-present-postgresql-95-rc1.html">allocating MultiXacts</a> could cause the issue in environments where foreign keys are not used).<br />
<br />
amcheck is effective in detecting corruption caused by the “freeze-the-dead” bug because the corruption results in a logical inconsistency in a heap page. There could be <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=20b655224249e6d2daf7ef0595995228baddb381" target="_blank">multiple versions of the same row visible at once</a> in cases where a prune never took place. In other scenarios, there could be a <a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/heap/README.HOT#L34" target="_blank">HOT chain</a> that is <a href="https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/access/heap/README.HOT#L101" target="_blank">pruned prematurely</a>, leading to wrong answers to query plans that use an index scan, while sequential scan plans (plans that should get the same answer without using an index) still give correct answers. <b>Even though amcheck was not written with these specific inconsistencies in mind, it still seems to reliably detect them</b>.<br />
<br />
<h3>
Managing risk</h3>
<div>
<br /></div>
The PostgreSQL development community is well known for putting data integrity first. With that in mind, I think that it’s important to avoid overstating the significance of the fact that amcheck detects corruption caused by bugs in PostgreSQL itself. While it is true that finding data corruption bugs is one goal of amcheck, and while it is also true that it has actually done so <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=008c4135ccf67e74239a17a85f912d1a51b6349e">more</a> than <a href="https://trac.osgeo.org/postgis/ticket/3841">once</a> already, that in itself shouldn’t be seen as a blemish on the project’s reputation for ensuring data integrity. If your take-away about amcheck appearing necessary is along the lines of “that certainly doesn’t inspire confidence”, I would argue that you’re thinking about the issues in the wrong way. Besides, amcheck is hardly the first tool like this to appear - similar tools are available for <i>all</i> other major RDBMSs.<br />
<br />
I think that it will prove useful to have an immediate way of mechanically detecting corruption caused by the “freeze-the-dead” bug. We may actually hear reports of corruption that it has caused in the wild (if you happen to have been affected, <a href="https://wiki.postgresql.org/wiki/Mailing_Lists#Using_the_discussion_lists" target="_blank">please let the pgsql-hackers list</a> know about it). I also believe that the newly enhanced amcheck will detect corruption caused by other historic bugs, including the <a href="https://blog.2ndquadrant.com/index-corruption-in-create-index-concurrently/">CREATE INDEX CONCURRENTLY bug</a> detected in February of 2017, as well as a <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3c84046490bed3c22e0873dc6ba492e02b8b9051" target="_blank">similar CREATE INDEX CONCURRENTLY bug</a> from back in 2012. Estimating the prevalence of corruption of this general nature is a very tricky business, though; it largely comes down to workload, and those are incredibly varied. And, even if an issue is on average very unlikely to strike, that doesn’t help those that <i>are</i> affected despite those odds. In short, there is no such thing as the average PostgreSQL database, and there are <a href="https://en.wikipedia.org/wiki/Ludic_fallacy" target="_blank">many practical problems with applying statistical models to complex domains</a>.<br />
<br />
I would venture to predict that a more detailed picture of how prevalent corruption like this is across all PostgreSQL installations will emerge over time, as amcheck is used more widely, and that that picture will be fairly boring. It still seems important to make every effort here, though. Going forward, we’ll have more to go on than educated guesses. And, those unlucky few that turn out to be affected by bugs that lead to corruption will have a relatively simple, non-disruptive tool to isolate the problem. DBAs can use amcheck to help with managing the risk of data corruption, including but not limited to corruption originating from bugs in PostgreSQL.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-16107001275037940552017-10-15T18:44:00.000-07:002017-10-15T18:47:46.212-07:00amcheck for Postgres 9.4+ now available from PGDG apt and yum repositoriesamcheck, a tool for index corruption detection, now has packages available from the community Debian/Ubuntu apt repository, as well as packages from the community Redhat/CentOS/SLES yum repository.<br />
<br />
This means that installations built on those community resources can easily install amcheck, even on PostgreSQL versions before PostgreSQL 10, the release that <a href="https://www.postgresql.org/docs/current/static/amcheck.html" target="_blank">contrib/amcheck</a> actually first appears in.<br />
<br />
Full details on installing these packages are available from the README:<a href="https://github.com/petergeoghegan/amcheck/" target="_blank"> https://github.com/petergeoghegan/amcheck/</a><br />
<br />
It's also possible to install the packages on PostgreSQL 10, because the extension these packages install is actually named "amcheck_next" (not "amcheck"). Currently, it isn't really useful to install "amcheck_next" on PostgreSQL 10, because its functionality is identical to contrib/amcheck. That's expected to change soon, though. I will add a new enhancement to amcheck_next in the coming weeks, allowing verification functions to perform "heap matches index" verification on top of what is already possible.<br />
<br />
Many thanks to Christoph Berg and Devrim Gündüz for their help with the packaging.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-9820484166169630712017-07-18T22:43:00.000-07:002017-07-19T12:20:35.826-07:00PostgreSQL Index bloat under a microscopeI've posted a snippet query to the PostgreSQL Wiki that "summarizes the keyspace" of a target B-Tree index. This means that it displays which range of indexed values belong on each page, starting from the root. It requires <a href="https://www.postgresql.org/docs/current/static/pageinspect.html" target="_blank">pageinspect</a>. The query recursively performs a breadth-first search. Along the way, it also displays information about the space utilization of each page, and the number of distinct key values that actually exist on the page, allowing you to get a sense of how densely filled each page is relative to what might be expected.<br />
<br />
The query is available from:<br />
<br />
<a href="https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index">https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index</a><br />
<br />
<a name='more'></a><br />
If I use the query against the largest index that results from <a href="https://www.postgresql.org/docs/current/static/pgbench.html" target="_blank">initializing a pgbench database</a> at scale factor 10 (pgbench_accounts_pkey), the query takes about 3 seconds to execute on my laptop, and returns the following:<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"> <code style="color: black; word-wrap: normal;">
level | l_item | blkno | btpo_flags | type | live_items | dead_items | avg_item_size | page_size | free_size | distinct_real_item_keys | highkey | distinct_block_pointers
-------+--------+-------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------+---------+-------------------------
2 | 1 | 290 | 2 | r | 10 | 0 | 15 | 8192 | 7956 | 10 | | 10
1 | 1 | 3 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 103945 | 284
1 | 2 | 289 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 207889 | 284
1 | 3 | 575 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 311833 | 284
1 | 4 | 860 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 415777 | 284
1 | 5 | 1145 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 519721 | 284
1 | 6 | 1430 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 623665 | 284
1 | 7 | 1715 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 727609 | 284
1 | 8 | 2000 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 831553 | 284
1 | 9 | 2285 | 0 | i | 285 | 0 | 15 | 8192 | 2456 | 284 | 935497 | 284
1 | 10 | 2570 | 0 | i | 177 | 0 | 15 | 8192 | 4616 | 177 | | 177
0 | 1 | 1 | 1 | l | 367 | 0 | 16 | 8192 | 808 | 366 | 367 | 6
0 | 2 | 2 | 1 | l | 367 | 0 | 16 | 8192 | 808 | 366 | 733 | 6
0 | 3 | 4 | 1 | l | 367 | 0 | 16 | 8192 | 808 | 366 | 1099 | 6
...
0 | 2730 | 2741 | 1 | l | 367 | 0 | 16 | 8192 | 808 | 366 | 999181 | 6
0 | 2731 | 2742 | 1 | l | 367 | 0 | 16 | 8192 | 808 | 366 | 999547 | 6
0 | 2732 | 2743 | 1 | l | 367 | 0 | 16 | 8192 | 808 | 366 | 999913 | 6
0 | 2733 | 2744 | 1 | l | 88 | 0 | 16 | 8192 | 6388 | 88 | | 2
(2744 rows)
</code>
</pre>
<br />
Note that I changed the query to use <a href="https://wiki.postgresql.org/wiki/Index_Maintenance#Interpreting_bt_page_items.28.29_.22data.22_field_as_a_little-endian_int4_attribute" target="_blank">int4_from_page_data()</a> here, so that the split points/high key values are displayed as ordinary int4 output. The items are in logical order (the int4 <a href="https://www.postgresql.org/docs/current/static/indexes-ordering.html" target="_blank">sort order the index uses</a>).<br />
<br />
A few interesting things are displayed here, or can be inferred from what is displayed:<br />
<ul>
<li>There are 3 levels - one root level, an additional level of internal pages, and the leaf level, which is always level 0. (Note that most leaf pages are omitted for brevity.)</li>
</ul>
<div>
B-Trees are almost always rather short, and in general tend to look a lot more like a bush than a tree when they are greater than a few pages in size. We see that here.</div>
<ul>
<li>Leaf pages point to 6 distinct table blocks in all cases, with the sole exception of the rightmost page (at block 2744). Since the index is 21MB and the table is 128MB, the ratio of index size to table size is just over 6:1. They match.</li>
</ul>
<div>
The size of the 11 internal index pages is negligible. The width of both (leaf) index tuples and heap (table) tuples happens to be completely uniform with any pgbench table. The ratio of the width of every index tuple to any heap tuple almost exactly matches the ratio of the overall size of the index to the size of the table it indexes, which almost exactly matches the number of heap tuples pointed to from within each leaf page. The <a href="https://www.postgresql.org/docs/current/static/view-pg-stats.html" target="_blank">logical/physical correlation</a> between index and table must be very close to 1.0. This is good for B-Tree index scans that read through many leaf pages.</div>
<ul>
<li>There are no duplicates within pages.</li>
</ul>
<div>
The lack of any duplicates is evidenced by the number of live items (not including the high key) exactly matching the number of distinct real items. They're almost always both 366 here (once again, the rightmost page is the only exception). Note that there is usually one more "live item" than real item (item with a real table pointer). The page high key is counted as a live item by pageinspect, but is not counted as a "real item value" by my query, since it's just metadata. This is a little confusing, but does make sense when considered in the broader context of how PostgreSQL B-Trees work.<br />
<br />
Even though this is a unique index, it might still have physical duplicates at some point in the future. Not right now, though.<br />
<br /></div>
<h3>
Production issues</h3>
I'm not aware that anyone has used a query like this to debug tricky production performance problems with index bloat before now. It could certainly help with that. Bloat can sometimes be localized to one part of an index, and feedback from this query could allow someone to tie it back to a problem in application code.<br />
<br />
The query might also help users provide information on production performance issues to a mailing list like pgsql-performance or pgsql-hackers, for the perusal of hackers like myself. It might be possible to use this kind of feedback to improve how VACUUM and related mechanisms handle index bloat. The query could take a while to execute for larger indexes. It probably wouldn't be unreasonable to run in production at an off-peak time with a moderately large index.<br />
<div>
<br /></div>
Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-16858990923960830412016-05-10T11:50:00.001-07:002016-05-10T11:50:37.333-07:00amcheck: Verify the logical consistency of PostgreSQL B-Tree indexesI've created a project page on Github for amcheck, a tool for verifying the logical consistency of PostgreSQL B-Tree indexes:<br />
<br />
<a href="https://github.com/petergeoghegan/amcheck">https://github.com/petergeoghegan/amcheck</a><br />
<br />
The tool is primarily useful for detecting index corruption in production database systems. It can do this with low overhead; most verification requires only a non-disruptive lock on the index as it is verified. The <a href="http://www.postgresql.org/docs/current/static/explicit-locking.html#TABLE-LOCK-COMPATIBILITY" target="_blank">strength of the lock taken</a> on an index as it is verified matches that of simple SELECT statements (unless the highest level of verification is requested). The locking involved will generally not block concurrent reads or writes, and will not prevent VACUUM from running concurrently.<br />
<br />
<a name='more'></a><br />
<br />
amcheck is proposed as a contrib extension for PostgreSQL 9.7. This externally maintained version of the extension exists to support earlier versions of PostgreSQL (PostgreSQL 9.4+), and to make the tool available to those that need it sooner. While the level of verification is not totally comprehensive (in particular, there is no verification of indexes against underlying tables), the tool is still likely to detect many <i>subtle</i> problems in practice.<br />
<br />
amcheck verifies that certain <i>invariants</i> that must hold in the structure of B-Tree indexes actually do, in fact, hold. It's fairly exhaustive. One example of a problem that the tool can detect is inconsistency arising from the recent <a href="https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue" target="_blank">PostgreSQL 9.5 abbreviated keys glibc issue</a>, where the new-to-9.5 abbreviated keys performance optimization could lead to structurally inconsistent indexes due to a bug in some glibc versions. This issue created a need to get amcheck into the hands of users sooner rather than later.<br />
<br />
It's not ideal that the tool is maintained externally, since there are <a href="https://github.com/petergeoghegan/amcheck/commit/94087531fbeb5565f07567742f87be31c52514de#diff-bb3c83dac7fda65c54cbe6b682c27df4R875" target="_blank">complex locking protocols</a> involved; the implementation must make sure that there cannot be false positives to be of much practical use, and so the tool ought to be considered whenever there is a question about these locking protocols. Unfortunately, we ran out of time to get amcheck into PostgreSQL 9.6. Technically there is no disadvantage to an externally maintained tool, but in my opinion amcheck should really be maintained alongside the B-Tree index code itself.<br />
<br />
<br />Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-14119907129790284592015-11-14T12:25:00.002-08:002015-11-14T13:03:39.570-08:00Suggesting a corrected column name/spelling in the event of a column misspelling<br />
One small PostgreSQL 9.5 feature I worked on is the new <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e529cd4ffa605c6f14f1391af5559b3a44da0336" target="_blank">hinting mechanism</a> feature, which sometimes hints, based on a score, what you might have meant to type following misspelling a column name in an SQL query. The score heavily weighs <a href="https://en.wikipedia.org/wiki/Levenshtein_distance" target="_blank">Levenshtein distance</a>. A <a href="http://www.postgresql.org/docs/current/static/error-style-guide.html" target="_blank">HINT message</a> is sent to the client, which psql and other client tools will display by default.<br />
<br />
It's common to not quite recall offhand if a column name is pluralized, or where underscores are used to break up words that make up the name of a column. This feature is targeted at that problem, providing guidance that allows the user to quickly adjust their query without mental context switching. For example:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">postgres=# select * from orders where order_id = 5;</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">ERROR: 42703: column "order_id" does not exist</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">LINE 1: select * from orders where order_id = 5;</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ^</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">HINT: Perhaps you meant to reference the column "orders"."orderid".</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"></span><br />
<a name='more'></a><br />
<div>
<br /></div>
<div>
You may also see a hint in the case of two possible matches, provided both matches have the same score, and the score crosses a certain threshold of assumed usefulness:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderid where order_id = 5;</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">ERROR: 42703: column "order_id" does not exist</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">LINE 1: ...oin orderlines ol on o.orderid = ol.orderid where order_id =...</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ^</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">HINT: Perhaps you meant to reference the column "o"."orderid" or the column "ol"."orderid".</span></div>
</div>
<div>
<br /></div>
<div>
If an alias was used here (which this query must have anyway), the hint becomes more specific:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderid where o.order_id = 5;</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">ERROR: 42703: column o.order_id does not exist</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">LINE 1: ...oin orderlines ol on o.orderid = ol.orderid where o.order_id...</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ^</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">HINT: Perhaps you meant to reference the column "o"."orderid".</span></div>
</div>
<div>
<br /></div>
<div>
This feature should make writing queries interactively in psql a bit more pleasant. Mental context switching to figure these incidental details out has a tendency to slow things down.</div>
Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com2tag:blogger.com,1999:blog-7609611625311126307.post-60640666411528999522015-10-02T11:36:00.001-07:002015-10-02T13:02:54.653-07:00Avoid naming a constraint directly when using ON CONFLICT DO UPDATEPostgreSQL 9.5 will have support for a feature that is popularly known as "UPSERT" - the ability to either insert or update a row according to whether an existing row with the same key exists. If such a row already exists, the implementation should update it. If not, a new row should be inserted. This is supported by way of a <a href="http://www.postgresql.org/docs/9.5/static/sql-insert.html#SQL-ON-CONFLICT" target="_blank">new high level syntax</a> (a clause that extends the INSERT statement) that more or less relieves the application developer from having to give any thought to race conditions. This common operation for client applications is set to become far simpler and far less error-prone than <a href="http://www.postgresql.org/docs/devel/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE" target="_blank">legacy ad-hoc approaches to UPSERT involving subtransactions</a>.<br />
<br />
When <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=168d5805e4c08bed7b95d351bf097cff7c07dd65" target="_blank">we worked on UPSERT</a>, many edge-cases were carefully considered. A technique called "unique index inference" allows DML statement authors to be very explicit about what condition they want to take the alternative (UPDATE or NOTHING) path on. That alternative path can only be taken in the event of a would-be duplicate violation in an "arbiter" unique index (for the DO NOTHING variant, a would-be <a href="http://www.postgresql.org/docs/9.5/static/sql-createtable.html#SQL-CREATETABLE-EXCLUDE" target="_blank">exclusion violation</a> is also a possible reason to take the alternative NOTHING path). The ability to write UPSERT statements explicitly and safely while also having lots of flexibility is an important differentiator for PostgreSQL's UPSERT in my view.<br />
<br />
<a name='more'></a><br />
<br />
As the <a href="http://www.postgresql.org/docs/9.5/static/sql-insert.html#SQL-ON-CONFLICT" target="_blank">9.5 INSERT documentation explains</a>, the inference syntax contains one or more column_name_index (columns) and/or expression_index expressions (expressions), and perhaps an optional index_predicate (for <a href="http://www.postgresql.org/docs/9.5/static/indexes-partial.html#INDEXES-PARTIAL-EX1" target="_blank">partial unique indexes</a>, which are technically not constraints at all). This is internally used to figure out which of any available unique indexes ought to be considered as an arbiter of taking the alternative path. If none can be found, the optimizer raises an error.<br />
<br />
The inference syntax is very flexible, and very tolerant of variations in column ordering, whether or not a partial unique index predicate is satisfied, and several other things. It can infer multiple unique indexes at a time, which is usually not necessary, but can be in the event of a migration. <a href="http://www.postgresql.org/docs/9.5/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY" target="_blank">CREATE INDEX CONCURRENTLY</a> supports creating unique indexes, and it's easy to imagine someone reasonably having two logically equivalent unique indexes (or equivalent in all the ways that matter to certain UPSERT statements) for a while. Plus, unique indexes cannot be named directly, since they happen to not be cataloged as constraints. We considered that. <i>Use the inference syntax, and you're unlikely to have any problems like this at all</i>.<br />
<br />
DML statement authors must be explicit when writing a statement using DO UPDATE in any case -- omitting some particular condition to take the UPDATE path on is simply disallowed (DO NOTHING does not have this restriction). We also added an escape hatch to name a constraint directly, ON CONFLICT ON CONSTRAINT <constraint_name>. This could be useful for exclusion constraints, <i>but its use is generally discouraged since it does not handle these edge-cases</i>.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-54108984634840545052015-04-04T09:19:00.002-07:002015-04-08T11:18:39.832-07:00Abbreviated keys for numeric to accelerate numeric sortsAndrew Gierth's numeric <a href="http://pgeoghegan.blogspot.com/2015/01/abbreviated-keys-exploiting-locality-to.html" target="_blank">abbreviated keys</a> patch was <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=abd94bcac4582903765be7be959d1dbc121df0d0" target="_blank">committed recently</a>. This commit added abbreviation/sortsupport for the <a href="http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-NUMERIC-DECIMAL" target="_blank">numeric type</a> (the PostgreSQL type which allows practically arbitrary precision, typically recommended for representing monetary values).<br />
<br />
The encoding scheme that Andrew came up with is rather clever - it has an excellent tendency to concentrate entropy from the original values into the generated abbreviated keys in real world cases. As far as accelerating sorts goes, numeric abbreviation is at least as effective as the original text abbreviation scheme. I easily saw improvements of 6x-7x with representative queries that did not spill to disk (i.e. that used quicksort). In essence, the patch makes sorting numeric values almost as cheap as sorting simple integers, since that is often all that is actually required during sorting proper (the abbreviated keys compare as integers, except that the comparison is inverted to comport with how abbreviation builds abbreviated values from numerics as tuples are copied into local memory ahead of sorting - see the patch for exact details).<br />
<br />
<a name='more'></a><br />
<br />
Separately, over lunch at pgConf.US in New York, <a href="http://www.pgconf.us/2014/event/14/" target="_blank">Corey Huinker</a> complained about a slow, routine data warehousing <span style="font-family: Courier New, Courier, monospace;">CREATE INDEX</span> operation that took far too long. The indexes in question were built on a single text column. I suggested that Corey check out how PostgreSQL 9.5 performs, where this operation is accelerated by text abbreviation, often very effectively.<br />
<br />
Corey chose an organic set of data that could be taken as a reasonable proxy for how PostgreSQL behaves when he performs these routine index builds. In all cases <span style="font-family: Courier New, Courier, monospace;">maintenance_work_mem</span> was set to 64MB, meaning that an external tapesort is always required - those details were consistent. This was a table with 18 million rows. Apparently, on PostgreSQL 9.4, without abbreviation, the <span style="font-family: Courier New, Courier, monospace;">CREATE INDEX</span> took 10 minutes and 19 seconds in total. On PostgreSQL 9.5, with identical settings, it took only 51.3 seconds - a 12x improvement! This was a low cardinality pre-sorted column, but if anything that is a less compelling case for abbreviation - I think that the improvements could sometimes be even greater when using external sorts on big servers with fast CPUs. Further organic benchmarks of abbreviated key sorts are very welcome. Of course, there is every reason to imagine that abbreviation would now improve things just as much if not more with large <i>numeric</i> sorts that spill to disk.<br />
<br />
<h3>
Future work</h3>
With numeric abbreviation committed, and support for the "datum" case likely to be committed soon, you might assume that abbreviation as a topic on the pgsql-hackers development mailing list had more or less played out (the "datum " sort case is used by things like "<span style="font-family: Courier New, Courier, monospace;">SELECT COUNT(DISTINCT FOO) ...</span>" - this is Andrew Gierth's work again). You might now reasonably surmise that it would be nice to have support for the <a href="http://www.postgresql.org/docs/current/static/xindex.html#XINDEX-OPCLASS-DEPENDENCIES" target="_blank">default B-Tree opclasses</a> of one or two other types, like <span style="font-family: Courier New, Courier, monospace;">character(n)</span>, but that's about it, since clearly abbreviation isn't much use for complex/composite types - we're almost out of interesting types to abbreviate. However, I think that work on abbreviated keys is far from over. Abbreviation as a project is only more or less complete as a technique to <i>accelerate sorting</i>, but that's likely to only be half the story (Sorry Robert!).<br />
<br />
I intend to undertake research on using abbreviated keys within internal B-Tree pages in the next release cycle. Apart from amortizing the cost of comparisons that are required to service index scans, I suspect that they can greatly reduce the number of cache misses by storing abbreviated keys <a href="http://www.postgresql.org/docs/current/static/storage-page-layout.html#PAGE-TABLE" target="_blank">inline in the ItemId array</a> of internal B-Tree pages. Watch this space!<br />
<br />
<br />Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com3tag:blogger.com,1999:blog-7609611625311126307.post-22000524451030658862015-01-23T16:43:00.000-08:002018-08-31T15:20:45.680-07:00Abbreviated keys: exploiting locality to improve PostgreSQL's text sort performance<br />
On Monday, Robert Haas committed a patch of mine that considerably speeds up the sorting of <span style="font-family: "courier new" , "courier" , monospace;">text</span> in PostgreSQL. This was the last and the largest in a series of such patches, the patch that <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4ea51cdfe85ceef8afabceb03c446574daa0ac23" target="_blank">adds "abbreviated keys"</a>. PostgreSQL 9.5 will have big improvements in sort performance.<br />
<br />
In realistic cases, <span style="font-family: "courier new" , "courier" , monospace;">CREATE INDEX</span> operations on <span style="font-family: "courier new" , "courier" , monospace;">text</span> <a href="http://www.postgresql.org/message-id/CA+TgmoaTiYy9aaMRe7m71Z=mrNZ_aPqepspQHtSNHq8Wiafjow@mail.gmail.com" target="_blank">are <i>over 3 times faster</i></a> than in PostgreSQL 9.4. Not every such utility operation, or data warehousing query involving a big sort is sped up by that much, but <i>many</i> will be.<br />
<br />
This was a piece of work that I spent a considerable amount of time on over the past few months. It's easy to justify that effort, though: sorting <span style="font-family: "courier new" , "courier" , monospace;">text</span> is a very fundamental capability of any database system. Sorting is likely the dominant cost when creating B-Tree indexes, performing <span style="font-family: "courier new" , "courier" , monospace;">CLUSTER</span> operations, and, most obviously, for sort nodes that are required by many plans that are executed in the service of queries with <span style="font-family: "courier new" , "courier" , monospace;">ORDER BY</span> or <span style="font-family: "courier new" , "courier" , monospace;">DISTINCT</span> clauses, or aggregates using the <span style="font-family: "courier new" , "courier" , monospace;">GroupAggregate</span> strategy. Most of the utility statements that need to perform sorts must perform them with a very disruptive lock on the target relation (<span style="font-family: "courier new" , "courier" , monospace;">CREATE INDEX CONCURRENTLY</span> is a notable exception), so quite apart from the expense of the sort, the duration of sorts often strongly influences how long a production system is seriously disrupted.<br />
<br />
<a name='more'></a><br />
<a href="http://pgeoghegan.blogspot.com/2012/08/sorting-improvements-in-postgresql-92.html" target="_blank">My interest in sorting is not new</a>: I first worked on it in 2011. Early research on it back then prompted Robert Haas and Tom Lane to write the <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c6e3ac11b60ac4a8942ab964252d51c1c0bd8845" target="_blank">SortSupport infrastructure</a>, which I've now extended here. Originally, the SortSupport infrastructure was all about providing alternative versions of comparators for use in sort routines, versions that avoided certain overhead otherwise inherent to calling functions that are generally accessible from SQL. As a highly extensible system, PostgreSQL requires that sort behavior be <a href="http://www.postgresql.org/docs/current/static/xindex.html#XINDEX-OPCLASS-DEPENDENCIES" target="_blank">defined in terms of a default B-Tree operator class</a>, which is itself defined in terms of SQL operators with underlying SQL-callable functions. These functions are written in C for built-in types, but in principle they could be written in a PL, like <a href="http://www.postgresql.org/docs/devel/static/plpython.html" target="_blank">PL/Python</a>, for example. When the underlying comparator can be expected to compile to just a few CPU instructions, "fmgr elision" (avoiding SQL function call overhead) is important - the time spent in the "fmgr" when not eliding it shows up prominently on profiles with certain types (or rather, it did in the past).<br />
<br />
Note that I <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5ea86e6e65dd2da3e9a3464484985d48328e7fe3" target="_blank">generalized sortSupport to work for more cases</a>, so B-Tree index builds will get a nice little boost in PostgreSQL 9.5, even for types like <span style="font-family: "courier new" , "courier" , monospace;">integer</span> and <span style="font-family: "courier new" , "courier" , monospace;">float8</span>. That's not what this blog post is really about, though. This blog post is about the interesting new direction that the SortSupport infrastructure has been taken in, beyond mere "fmgr elision" - abbreviation.<br />
<br />
A well known problem with the <span style="font-family: "courier new" , "courier" , monospace;">text</span> datatype in PostgreSQL is that it uses the operating system/C standard library <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/strcoll.html" target="_blank">strcoll()</a> function to resolve comparisons as tuples are sorted, which is very expensive. It's at least a thousand times more expensive than comparing integers, for example. This general problem is something that <a href="http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html" target="_blank">Robert Haas has expressed concern about</a> in the past.<br />
<br />
The expense relates to a normalization process whereby string comparisons use complex tables to make sure that strings are compared according to the rules of some particular culture or nation (that is, some particular collation associated with a locale). Even in English speaking countries, this is important; for example, the <span style="font-family: "courier new" , "courier" , monospace;">en_US</span> collation considers difference in case (higher case versus lower case) after alphabetical ordering and diacritical differences, so case is considered last of all (the c locale, on the other hand, will sort upper case and lower case strings into two distinct batches, which is typically not desirable). In addition, while English usually doesn't have diacritics, <i>sometimes</i> it does. At work, I'm still <a href="http://help.hipchat.com/forums/138883-suggestions-ideas/suggestions/5163526-wrong-sort-order-c-locale-used-in-people-list" target="_blank">sometimes annoyed by the sort order of the Linux Hipchat client's user list</a>, which uses the C locale. Hi Ómar!<br />
<br />
It was always suspected that we could more effectively amortize the cost of these locale-aware comparisons, by performing a transformation of strings into binary keys using <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/strxfrm.html" target="_blank">strxfrm()</a>, and sorting the keys instead (using a <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/strcmp.html" target="_blank">strcmp()</a>-based comparator with the keys, which only considers raw byte ordering). This comparison will produce equivalent results to just using <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/strcoll.html" target="_blank">strcoll()</a> directly. But the binary keys are much larger than the original strings - typically almost 4x larger. Moreover, we'd still need to do a tie-breaker strcmp() comparison (to check for strict binary equality) using the original string, when strcoll() reports equality, because the historic idea of equality that the <span style="font-family: "courier new" , "courier" , monospace;">text</span> type offers is strict binary equality. There were <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad" target="_blank">some historic edge cases where a tie-breaker strcmp() was not performed following strcoll() returning '0'</a>, resulting in corrupt B-Tree indexes on a <a href="http://www.siao2.com/2005/11/13/491646.aspx" target="_blank">Hungarian</a> database. strcoll() could return 0 <i>despite not being passed a pair of bitwise-identical strings</i>.<br />
<br />
Having to keep around the original <span style="font-family: "courier new" , "courier" , monospace;">text</span> datum seemed like an additional burden on the whole idea of using strxfrm() blobs as sort keys. It seemed like using binary keys to sort had a lot of promise, but we couldn't quite work out how to exploit that idea - until recently.<br />
<br />
Abbreviated keys were committed:<br />
<br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">Use abbreviated keys for faster sorting of text datums.</span><br />
<br style="background-color: white; font-family: monospace; font-size: small;" />
<span style="background-color: white; font-family: monospace; font-size: x-small;">This commit extends the SortSupport infrastructure to allow operator</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">classes the option to provide abbreviated representations of Datums;</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">in the case of text, we abbreviate by taking the first few characters</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">of the strxfrm() blob. If the abbreviated comparison is insufficent</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">to resolve the comparison, we fall back on the normal comparator.</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">This can be much faster than the old way of doing sorting if the</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">first few bytes of the string are usually sufficient to resolve the</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">comparison.</span><br />
<br style="background-color: white; font-family: monospace; font-size: small;" />
<span style="background-color: white; font-family: monospace; font-size: x-small;">There is the potential for a performance regression if all of the</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">strings to be sorted are identical for the first 8+ characters and</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">differ only in later positions; therefore, the SortSupport machinery</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">now provides an infrastructure to abort the use of abbreviation if</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">it appears that abbreviation is producing comparatively few distinct</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">keys. HyperLogLog, a streaming cardinality estimator, is included in</span><br />
<span style="background-color: white; font-family: monospace; font-size: x-small;">this commit and used to make that determination for text.</span><br />
<br style="background-color: white; font-family: monospace; font-size: small;" />
<span style="background-color: white; font-family: monospace; font-size: x-small;">Peter Geoghegan, reviewed by me.</span><br />
<div>
<br /></div>
It's surprisingly effective to just store the first 8 bytes of a strxfrm() blob, and tie-break relatively infrequently by using a full old-style comparison, rather than the more obvious approach of sorting with pointers to strxfrm()-generated blobs (the approach that the C standard recommends for general purpose text sorting).<br />
<br />
<h3>
Entropy</h3>
<br />
<span style="font-size: small; font-weight: normal;">By storing just the first 8 bytes (on 64-bit platforms; 4 bytes on 32-bit platforms) of the strxfrm() blob in a field that would otherwise contain a pointer-to-text (since <span style="font-family: "courier new" , "courier" , monospace;">text</span> is a pass-by-reference type) - the same <a href="https://en.wikipedia.org/wiki/Type_punning" target="_blank">type-punned</a> field that <i>directly</i> stores the representation of pass-by-value types like <span style="font-family: "courier new" , "courier" , monospace;">integer</span> - most comparisons can often be resolved using just those 8 bytes <i>directly, and without</i> pointer-chasing. At the same time, the cost of locale transformations is still quite effectively amortized, because as always when using strxfrm(), binary key blobs performs a transformation O(n) times, rather than an average of </span><span style="font-size: small; font-weight: normal;">O(n log n) times (the transformation process performed by </span><span style="font-size: small; font-weight: normal;">strxfrm() and strcoll() may not be exactly comparable, but close enough).</span><br />
<span style="font-size: small; font-weight: normal;"><br /></span>
<br />
<div>
It turns out that the large binary key blobs produced by strxfrm(), while much larger than the original strings, have a significant concentration of entropy towards the start of the blob (assuming the <a href="http://unicode.org/reports/tr10/#Step_3" target="_blank">use of the Unicode collation algorithm</a>, or an algorithm with similar properties). This is because the representation consists of a series of "levels". The "primary weights", which appear first, represent primary alphabetical ordering when using Latin scripts. So whitespace differences and punctuation differences are not represented at that level (nor are differences in case). For accented Latin characters, for example, diacritics are represented at a subsequent level, and so the abbreviated key representation typically won't vary if accents are added or removed to a <span style="font-family: "courier new" , "courier" , monospace;">text</span> datum. This is important because languages that use accents extensively, like French or Spanish, will get a concentration of entropy in their 8 byte abbreviated keys that's about the same as if no accents were used, even though accented code points usually take 2 bytes of storage in UTF-8, rather than 1 byte on unaccented Latin alphabet code points.</div>
<div>
<br /></div>
<h3>
Locality</h3>
<div>
<br /></div>
A more general problem with sort performance is the problem of cache misses. My earlier work on sorting targeted pass-by-value PostgreSQL types like <span style="font-family: "courier new" , "courier" , monospace;">integer</span> and <span style="font-family: "courier new" , "courier" , monospace;">float8</span>. These pass-by-value types naturally have great locality of reference. Their comparisons are integral operations, which are fast, but operating on a representation that is tightly packed is what makes sorting integers with Quicksort perhaps as fast as is practically possible for a comparison-based sort. Cache miss penalties are likely to be the dominant cost on modern CPUs, that are <a href="http://users.ece.cmu.edu/~omutlu/pub/mutlu_memory-scaling_imw13_invited-talk.pdf" target="_blank">more bottlenecked on memory bandwidth and latency in every successive generation</a> (PDF); sorting infrastructure must heavily weigh this.<br />
<br />
When I initially discussed the idea of abbreviated keys, there was a certain degree of skepticism from other Postgres hackers. What if most comparisons are not resolved by abbreviated comparisons, due to <span style="font-family: "courier new" , "courier" , monospace;">text</span> datums with a lot of redundant or repeated information at the beginning? Could all the strxfrm() work go to waste when that happens? Well, for one thing, low cardinality sets (tuples with <span style="font-family: "courier new" , "courier" , monospace;">text</span> columns that have a relatively low number of distinct values) are not a problem. That's because strcoll() is still a huge cost, and if we can have our authoritative tie-breaker comparator observe that the strings are identical, then no strcoll() comparison is ever needed - we can just exit early with a simple, cheap opportunistic binary comparison (memcmp()), which is almost as good. But what about when there are many different strings with differences towards the end of the string, past the 8th or so byte?<br />
<br />
CPU cache characteristics have presented complicated engineering trade-offs for sorting infrastructure for a long time. Database luminary Jim Gray proposed an abbreviation-like technique as early as 1994, in his <a href="http://www.vldb.org/journal/VLDBJ4/P603.pdf" target="_blank">AlphaSort paper</a> (PDF). He describes a "key-prefix sort" in the paper. Even back in 1994, Gray observed that memory latency was the dominant cost by a wide margin. The underlying trends in CPU performance characteristics have continued apace since then. Before his death in 2007, Gray officiated the <a href="http://sortbenchmark.org/" target="_blank">Sort Benchmark</a>. Among the rules for the "Daytona sort" category, which concerns the sort performance of general-purpose algorithms (which is what I'm interested in), it states that <a href="http://sortbenchmark.org/FAQ-2014.html#indy" target="_blank">Daytona sort entrants "must not be overly dependent on the uniform and random distribution of key values in the sort input"</a>. It's almost as if Gray was saying: "of course I expect you to use abbreviated keys, but don't push your luck!". And so it is for PostgreSQL. Some cases benefit much more than others, and some cases might even be slightly regressed.<br />
<h3>
<br class="Apple-interchange-newline" />Merge Joins</h3>
<div>
<br /></div>
An earlier piece of work for 9.5 <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e246b3d6eac09d0770e6f68e69f2368d02db88af" target="_blank">had conventional comparisons (not involving abbreviated keys) always try and opportunistic memcmp() tie-breaker</a>. This is likely to be particularly beneficial for merge joins (quite apart from any sort node that may feed the merge join), since they must "synchronize" relations using comparisons that can often be expected to indicate equality. Multi-column sorts on text are also considerably accelerated, where many leading column comparisons can be expected to indicate equality. It's also important for abbreviated keys, because as already mentioned we can still win big with low cardinality sets provided the full tie-breaker comparisons are resolved with a cheap memcmp().<br />
<br />
<h3>
Insurance</h3>
<br />
Fundamentally, when you do a cost/benefit analysis, abbreviated keys are very compelling. The upsides are clearly very large, and the break-even point for switching to using abbreviation is surprisingly far out. We <i>cannot</i> ignore the performance benefits of these techniques because some much rarer cases will be slightly regressed. But, as it happens, we have cheap worst case insurance: <a href="https://en.wikipedia.org/wiki/HyperLogLog" target="_blank">HyperLogLog</a> is used to cheaply and fairly accurately check the cardinality of both abbreviated keys and the original text values. If abbreviated cardinality is an effective proxy for full cardinality, then most comparisons will either use abbreviated comparisons, or use a cheap memcmp() tie-breaker, which is almost as good. Otherwise, we abort abbreviation before the sort proper is underway.<br />
<br />
<h3>
Future</h3>
<div>
<br /></div>
<div>
Abbreviated keys are just infrastructure. While text is the most compelling case, there are at least a few other datatypes that would greatly benefit from support for abbreviation. These include:</div>
<div>
<br /></div>
<div>
<ul>
<li><span style="font-family: "courier new" , "courier" , monospace;"><strike>numeric</strike></span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;"><strike>character(n)</strike></span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;"><strike>uuid</strike></span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;"><strike>bytea</strike></span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;">citext</span> (case insensitive text, from <span style="font-family: "courier new" , "courier" , monospace;">contrib/citext</span>)</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">inet</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;">cidr</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;"><strike>macaddr</strike></span></li>
</ul>
</div>
<b>Update:</b> A patch for numeric sortsupport with abbreviation <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=abd94bcac4582903765be7be959d1dbc121df0d0" target="_blank">was committed</a>!<br />
<br />
I welcome others with an interest in making sorting faster to work on the relevant opclass support for each of these types, and possibly others. Other people may be able to come up with novel encoding schemes for these types, that maximize the entropy within the finished abbreviated keys. Order-preserving compression is likely to be an area where text's support could be improved, by making comparisons resolved at the abbreviated key level more frequent. Hopefully the benefits of the abbreviated key infrastructure will not be limited to accelerating sorts on <span style="font-family: "courier new" , "courier" , monospace;">text</span>.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com10tag:blogger.com,1999:blog-7609611625311126307.post-32222464295633515212014-03-23T18:39:00.000-07:002014-06-20T11:46:46.268-07:00What I think of jsonbUnsurprisingly, there has been a lot of <a href="http://obartunov.livejournal.com/177247.html" target="_blank">interest in the jsonb type</a>, which <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d9134d0a355cfa447adc80db4505d5931084278a" target="_blank">made it into the upcoming 9.4 release of Postgres</a>. I was initially a reviewer of jsonb, although since I spent weeks polishing the code, I was ultimately credited as a co-author.<br />
<br />
Jsonb is a new datatype for Postgres. It is distinct from the <a href="http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json/" target="_blank">older json datatype</a> in that its internal representation is <i>binary</i>, and in that it is internally typed. It also makes sophisticated nested predicates within queries on jsonb <i>indexable</i>. I've occasionally described the internally-typed scalar values as having “shadow types” unknown to the core SQL parser. This has several implications. For example, if you sort two Jsonb values containing only scalar numbers, the implementation invokes the <a href="http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-NUMERIC-TABLE" target="_blank">numeric</a> comparator (which the jsonb default B-Tree opclass comparator is defined in terms of). The on-disk representation of jsonb includes the same representation as is used for, say, numerics (as the internal binary representation of JSON primitive numbers, for example). Plus, JSON objects are de-duplicated by key on input, and optimized for cheap binary searches within a single jsonb. Still, like the earlier json type, jsonb in every sense “speaks JSON”. There are some limitations on what can be represented as a jsonb number, but those are exactly the same limitations that apply to the core numeric type (plus some limitations imposed by the JSON RFC, such as not accepting NaN values). I hope it suffices to say that these limitations are virtually irrelevant, and that many implementations have similar or worse limitations. All of these minor implementation-defined restrictions are explicitly anticipated and allowed for by the recent <a href="http://rfc7159.net/rfc7159" target="_blank">JSON RFC-7159</a>.<br />
<br />
<a name='more'></a><br />
<br />
Jsonb is emphatically <i>not</i> like the BSON format used by MongoDB. That format accepts input in such a way as to be backwards compatible with JSON, but I believe that BSON isn't really a practical interchange format, because the software development community at large is presumably disinclined to buy into an interchange format that as yet is not described by any RFC, or any communiqué of a recognized standards body. In contrast, jsonb is a datatype that will only ever output valid textual JSON, and will only ever accept valid textual JSON (subject to the aforementioned obscure and practically irrelevant restrictions, and the caveat on automatically normalizing duplicate-keyed pairs within the same object). Jsonb also imposes an internal ordering on object pairs. Again, this is all anticipated and allowed for by the JSON RFC.<br />
<br />
It's possible that I'm mistaken, and that BSON or something else will emerge as an actual standard (either de facto or de jure), since I've heard that there is support in the works for database systems other than MongoDB. It's not impossible that pursuing something like BSON might be an interesting future direction for Postgres, since for one thing BSON supports more than the 4 standard JSON primitive types. In any case it's important to note that the protocol or on-disk binary representation of jsonb is an implementation detail; we're not in competition with BSON, and this isn't a new standard. It's just a new <i>Postgres</i> datatype, with new indexing capabilities. I think it's notable that BSON <i>doesn't</i> have a JSON-style universal number type. It has <a href="https://en.wikipedia.org/wiki/BSON" target="_blank">32-bit and 64-bit integer types, and double precision 64-bit IEEE 754 floating point numbers</a>. It strikes me that this omission <a href="http://www.exploringbinary.com/why-0-point-1-does-not-exist-in-floating-point/" target="_blank">tells me all I need to know about binary interchange formats</a>.<br />
<br />
To understand how the jsonb type works in more detail, <a href="http://www.postgresql.org/docs/devel/static/datatype-json.html" target="_blank">I suggest taking a look at the devel documentation</a>. It's worth taking a close look at <a href="http://www.postgresql.org/docs/devel/static/datatype-json.html#JSON-CONTAINMENT" target="_blank">containment semantics</a>, since that's the really compelling way of searching through jsonb documents.<br />
<br />
<h4>
Strategic significance</h4>
<div>
<br /></div>
There has been a little back and forth among senior community members about the significance of jsonb. <a href="http://www.databasesoup.com/2014/02/why-hstore2jsonb-is-most-important.html" target="_blank">Josh Berkus wrote</a> that he thought it was the most important 9.4 feature. <a href="http://rhaas.blogspot.com/2014/03/postgresql-now-has-logical-decoding.html" target="_blank">Robert Haas was skeptical</a>, preferring the logical decoding stuff. I've even seen one or two people in the comments section of various news articles grumble about Postgres jumping on the JSON bandwagon.<br />
<br />
I have to admit that relatively speaking, jsonb is not in and of itself all that technically complex. While it is a great feature, and while I think it puts Postgres in a very competitive position relative to certain other systems, it would be almost trivial to ship a jsonb extension that works with earlier versions of Postgres. However, without taking a position on what the best 9.4 feature is going to be (I like both jsonb and logical decoding, and contributed in various ways to both), I think that it's possible that Josh Berkus and Robert Haas are both more or less right at the same time, and their apparent disagreement reflects only their individual priorities for Postgres.<br />
<br />
It is very much to the credit of the principle jsonb authors, Oleg Bartunov and Teodor Sigaev, that with some help from Andrew Dunstan and myself they managed to define what I think is internally a solid nested, strongly-typed format for us to build on, with a textual output representation that just so happens to be the same one that has emerged as a standard for this kind of thing. But, to me, as a Postgres hacker, their previous work – and the previous work of Alexander Korotkov (who, due to an unfortunate oversight, was not credited in the jsonb commit message) – is the real story here. As the authors of GIN, Oleg and Teodor are perhaps most responsible for the foundation on which jsonb is built, a foundation built over many years. Alexander's excellent recent work on improving the GIN access method (with help in various areas from Heikki Linnakangas), which also made it into 9.4, is probably what will end up making jsonb really shine. This includes <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=36a35c550ac114caa423bcbe339d3515db0cd957" target="_blank">compression of GIN posting lists</a>, <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=dbc649fd773e7e16458bfbec2611bf15f4355bc4" target="_blank">speeding up "rare & frequent" type GIN queries</a>, <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e20c70cb0fa74d5bffa080e21a99b44bf0768667" target="_blank">multi-key GIN search skipping</a>, and <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=626a120656a75bf4fe64b1d0d83c23cb38d3771a" target="_blank">further optimizations to multi-key searching</a>. Alexander had earlier <a href="https://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf" target="_blank">reported some very impressive improvements in PostgreSQL full-text search performance as a result of all of this</a>, with performance apparently competing with that of external systems like Sphinx and Solr. It is likely that many of the same big performance improvements seen there concomitantly benefit the jsonb GIN opclasses.<br />
<h4>
<br class="Apple-interchange-newline" />jsonb_path_ops</h4>
<div>
<br /></div>
Having said that, Alexander's <a href="http://www.postgresql.org/docs/devel/static/datatype-json.html#JSON-OPCLASS" target="_blank">jsonb_path_ops alternate GIN operator class</a>, which was his contribution to the big jsonb patch deserves an honorable mention. By combining GIN with hashing of either key/value pairs, or array elements, resulting indexes can give great performance for sophisticated “containment” type queries against JSON documents. Indexes are a fraction of the size of the data indexed, index scans are incredibly fast, and yet these GIN indexes make indexable very complex nested “containment” queries. The results are so impressive that at last November's pgConf.EU conference, an <a href="http://www.sai.msu.su/~megera/postgres/talks/hstore-dublin-2013.pdf" target="_blank">EXPLAIN ANALYZE comparative example in a presentation</a> given by Oleg and Teodor was <a href="http://momjian.us/main/blogs/pgblog/2013.html#November_1_2013" target="_blank">greeted with sustained applause</a>.<br />
<br />
I'm really pleased that we worked towards making all of this as beneficial as possible to the largest possible number of people, but even as it puts Postgres in a very competitive position with respect to some non-relational systems, jsonb does not really represent any kind of pivot towards Postgres as a document store – <a href="http://www.postgresql.org/about/history/" target="_blank">Postgres has always been an object-relational system</a>. Rather, I think it is one particular outcome of a much bigger process that has been underway for many years.<br />
<br />
I'll watch the <a href="http://www.pgcon.org/2014/schedule/events/696.en.html" target="_blank">future development of the "VODKA" index access method</a> with interest, because at this early stage it is my understanding that it's intended to make searching nested, heterogeneous structures more flexible and better performing still. It seems likely that there will be a number of other applications for that infrastructure too, since like <a href="http://www.postgresql.org/docs/current/static/gist.html" target="_blank">GiST</a>, <a href="http://www.postgresql.org/docs/current/static/gin.html" target="_blank">GIN</a>, and <a href="http://www.postgresql.org/docs/current/static/spgist.html" target="_blank">SP-GiST</a>, it is intended to be an extensible infrastructure that serves many analogous needs in a general way.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com6tag:blogger.com,1999:blog-7609611625311126307.post-75062119210874522372013-01-21T10:29:00.000-08:002013-01-21T10:29:22.749-08:00Moving onToday was my last day at 2ndQuadrant.<br />
<br />
The experience of working with 2ndQuadrant in the last couple of years has been very positive. I just decided it was time for a change. Being able to work on interesting problems during my time at 2ndQuadrant, both as a Postgres developer and as a consultant has been great fun, and very personally rewarding. I would like to acknowledge the invaluable support of both Simon Riggs and Greg Smith - thank you both. I wish the entire 2ndQuadrant staff all the best.<br />
<br />
I expect to remain active as a Postgres developer, and already have plans to relocate to work for another company that is well known within the community.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-20386472654641784772012-12-04T02:40:00.001-08:002012-12-04T08:57:24.151-08:00Finding plans in pg_stat_plans easily with pg_find_plansAs I<a href="http://pgeoghegan.blogspot.com/2012/10/first-release-of-pgstatplans.html" target="_blank"> recently blogged about</a>, <a href="http://www.2ndquadrant.com/en/pg_stat_plans/" target="_blank">pg_stat_plans</a> is a PostgreSQL satellite project I've been working on that aims to support earlier versions of Postgres that cannot use the new pg_stat_statements, and to track execution costs at the plan rather than the query granularity. It allows the user to easily explain each stored query text to see the plan for the entry, and has features that facilitate monitoring planner regressions.<br />
<br />
Since PostgreSQL 9.0, support for machine-readable EXPLAIN output has existed. I'm not aware that anyone else got around to actually doing something interesting with this capability, though. I knew that in order to get the most benefit from pg_stat_plans, it ought to be possible to leverage this capability to search for plans based on arbitrary criteria, directly from SQL.<br />
<br />
I've written an experimental submodule of pg_stat_plans, called <a href="https://github.com/2ndQuadrant/pg_stat_plans/tree/master/pg_find_plans" target="_blank">pg_find_plans</a>, that is designed to do just that - to quickly find plans and their execution costs, for those plans that, say, <a href="https://github.com/2ndQuadrant/pg_stat_plans/blob/master/pg_find_plans/samples.sql" target="_blank">perform a sequential scan on a known large table</a>.<br />
<br />
<a name='more'></a><br />
Here's the description of pg_find_plans from its documentation:<br />
<br />
<blockquote class="tr_bq">
<div style="background-color: white; border: 0px; color: #333333; font-family: Helvetica, arial, freesans, clean, sans-serif; font-size: 14px; line-height: 22px; margin-bottom: 15px; margin-top: 15px; padding: 0px;">
pg_find_plans is written in PL/Python and PL/PgSQL. It is intended to provide users with a better way to ask questions like "what are the execution costs of all plans tracked since last statistics reset that involve a sequential scan against <tt style="background-color: #f8f8f8; border-bottom-left-radius: 3px; border-bottom-right-radius: 3px; border-top-left-radius: 3px; border-top-right-radius: 3px; border: 1px solid rgb(234, 234, 234); font-family: Consolas, 'Liberation Mono', Courier, monospace; font-size: 12px; margin: 0px 2px; padding: 0px 5px;">mytable</tt>, and have more than 2 joins?". That might be written as:</div>
<pre style="background-color: #f8f8f8; border-bottom-left-radius: 3px; border-bottom-right-radius: 3px; border-top-left-radius: 3px; border-top-right-radius: 3px; border: 1px solid rgb(204, 204, 204); color: #333333; font-family: Consolas, 'Liberation Mono', Courier, monospace; font-size: 13px; line-height: 19px; margin-bottom: 15px; margin-top: 15px; overflow: auto; padding: 6px 10px;">mydb=# select
join_count(json_plan),
p.*
from
pg_stat_plans p
join
stored_plans sp on (p.userid=sp.userid and p.dbid=sp.dbid and p.planid=sp.planid)
where
from_our_database
and
join_count(json_plan) > 2
and
contains_node(json_plan, 'Seq Scan', 'mytable');
order by
1 desc nulls last;
</pre>
<div style="background-color: white; border: 0px; color: #333333; font-family: Helvetica, arial, freesans, clean, sans-serif; font-size: 14px; line-height: 22px; margin-bottom: 15px; margin-top: 15px; padding: 0px;">
Users should have a high degree of confidence that their queries on plan's structure are free of detectable errors, and pg_find_plans ensures this by carefully sanitising user input. For example, if the node of interest was specified as 'seq scan' above, the query would raise an error - to do any less might result in a false sense of security about the actual costs of plans that sequentially scan the table<tt style="background-color: #f8f8f8; border-bottom-left-radius: 3px; border-bottom-right-radius: 3px; border-top-left-radius: 3px; border-top-right-radius: 3px; border: 1px solid rgb(234, 234, 234); font-family: Consolas, 'Liberation Mono', Courier, monospace; font-size: 12px; margin: 0px 2px; padding: 0px 5px;">mytable</tt>, since the implementation might then naively ignore sequential scan nodes, as a case-sensitive comparison is used internally. In general, making the interface hard to use incorrectly is even more important than making it easy to use correctly.</div>
<div style="background-color: white; border: 0px; color: #333333; font-family: Helvetica, arial, freesans, clean, sans-serif; font-size: 14px; line-height: 22px; margin-bottom: 15px; margin-top: 15px; padding: 0px;">
Strictly speaking, pg_find_plans is nothing more than a simple set of functions for storing JSON explain texts of plans that appear as pg_stat_plans entries into a dedicated table, and subsequently parsing those plans to answer interesting questions using SQL. However, pg_find_plans is a module that is likely to make pg_stat_plans much more useful than it might otherwise be. pg_find_plans is by no means feature complete or especially polished. The author's ambitions for the tool are described under "limitations" below.</div>
</blockquote>
<br />
pg_find_plans is <a href="https://github.com/2ndQuadrant/pg_stat_plans/tags" target="_blank">distributed with pg_stat_plans, version 1.0 beta 3</a>, as a submodule. I'm reasonably confident that there will be a stable release of pg_stat_plans soon.<br />
<br />
While there are some problems with using a query text as a proxy for a plan that was once produced by that query text, these cases are handled reasonably well, though the "limitations" section of the pg_find_plans documentation should be understood by users. Still, pg_find_plans exists mostly to "test the waters" for a better-principled implementation. It remains to be seen just how much demand there is for this kind of functionality.<br />
<br />
I must say that working on this gave me a new-found appreciation for JSON as a data-interchange format - it dawned on me just why some people consider the Postgres 9.2 JSON datatype so compelling a feature. The last time I needed to write some code that used a lowest common denominator interchange format, that format was the ludicrously verbose XML. Having a format that maps almost perfectly onto scripting language data structures cut down on the amount of boilerplate required considerably. JSON interacts well with Python's dynamic, strong typing, because a piece of JSON data almost looks like a declaration of a nested Python data structure, and can be fairly easily made to be manipulated as one too.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com0tag:blogger.com,1999:blog-7609611625311126307.post-76422055871333693212012-11-16T04:31:00.000-08:002012-11-16T05:44:48.389-08:00Notes on index-only scansOne of the most important performance features in Postgres 9.2 is index-only scans: the ability for certain types of queries to be performed without retrieving data from tables, potentially greatly reducing the amount of I/O needed. I recently completely overhauled the <a href="https://wiki.postgresql.org/wiki/Index-only_scans" target="_blank">Index-only scans PostgreSQL wiki page</a>, so that the page is now targeted at experienced PostgreSQL users that hope to get the most out of the feature.<br />
<div>
<br /></div>
<div>
My apologies to the authors of the feature, Robert Haas, Ibrar Ahmed, Heikki Linnakangas and Tom Lane, if my handling of the topic seems to focus on the negatives. Any reasonable article about <i>any</i> given index-only scan implementation would have to extensively discuss that implementation's limitations. Any discussion of Postgres index-only scans that focussed on the positives would be much shorter, and would essentially just say: "Index-only scans can make some of your queries go much faster!".</div>
Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com2tag:blogger.com,1999:blog-7609611625311126307.post-70994287603066132552012-10-20T07:53:00.000-07:002012-10-20T08:05:42.779-07:00First release of pg_stat_plansAnyone who attended my recent talk at <a href="http://postgresopen.org/" target="_blank">Postgres Open</a>, which was co-presented with my 2ndQuadrant colleague Greg Smith, <a href="https://www.postgresql.eu/events/schedule/pgconfeu2012/session/358-beyond-query-logging/" target="_blank">"Beyond Query Logging"</a>, will be aware that <a href="http://www.postgresql.org/docs/current/static/pgstatstatements.html" target="_blank">pg_stat_statements</a>, the standard contrib module that assigns execution costs to queries and makes them available from a view in the database, has been improved considerably in the recent 9.2 Postgres release. It has been improved in a way that we believe will alter the preferred approach to workload analysis on PostgreSQL databases away from log analysis tools, which just don't offer the performance, flexibility or granularity of this new approach.<br />
<br />
We also announced a new open source tool that addresses a related but slightly different problem (the analysis of <i>plan</i> execution costs, and planner regressions), as well as making most of the benefits of pg_stat_statements on 9.2 available to users stuck on earlier versions of Postgres. This new tool is called pg_stat_plans, and is itself based on pg_stat_statements.<br />
<br />
<a name='more'></a><br />
The 9.2 pg_stat_statements feature of particular importance, the ability to "normalise" non-prepared statements that the large majority of applications use exclusively is now brought to earlier versions (versions 9.0 and 9.1, though pg_stat_plans works fine on 9.2 too). Since pg_stat_plans fingerprints plans rather than query trees, the way this works is slightly different to pg_stat_statements, and perhaps doesn't quite match people's intuitive expectations about how normalisation ought to behave in some cases. These differences have been extensively documented.<br />
<br />
pg_stat_plans also has the ability to <a href="http://www.postgresql.org/docs/current/static/sql-explain.html" target="_blank">EXPLAIN</a> a stored, representative SQL text, in order to facilitate deeper analysis of plan execution costs. Plan total_cost and startup_cost is tracked over time for each plan, for example, so that the "crossover point" at which the planner begins to prefer an alternative plan can sometimes be observed, and the planner's "reasoning" can perhaps be better understood.<br />
<br />
pg_stat_plans is distributed under the PostgreSQL licence. I'd originally hoped to offer plan fingerprinting within pg_stat_statements itself, and said as much at<a href="https://wiki.postgresql.org/wiki/PgCon_2012_Developer_Meeting"> the PostgreSQL developer's meeting in May</a>, but I ultimately felt that due to the demand for this on earlier Postgres versions, the Postgres community would be best served by having the module as a satellite project. I'm naturally willing to accept third-party contributions through Github's "pull request" mechanism.<br />
<br />
I am pleased to announce the first release of pg_stat_plans, 1.0 beta 1.<br />
<br />
Those of you who missed the talk in Chicago can catch it at next week's <a href="http://2012.pgconf.eu/" target="_blank">PostgreSQL Conference Europe</a> in Prague. See you there!<br />
<br />
Official pg_stat_plans page: <a href="http://www.2ndquadrant.com/en/pg_stat_plans">http://www.2ndQuadrant.com/en/pg_stat_plans</a>Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com5tag:blogger.com,1999:blog-7609611625311126307.post-41121461213217683872012-08-01T19:04:00.002-07:002012-08-02T05:29:43.158-07:00Sorting improvements in PostgreSQL 9.2: the case for micro-optimisationThere has been much discussion of performance improvements in the upcoming 9.2 release of PostgreSQL. Recently, I noticed that Regina Obe and Leo Hsu's new book, <a href="http://shop.oreilly.com/product/0636920025061.do">"PostgreSQL: Up and running"</a> prominently listed "Sorting improvements that improve in-memory sorting operations by as much as 20%" as a performance feature of that release. While they do get things about right there, I'm not sure that this improvement warrants such prominent placement, at least in sheer terms of its likely impact on the performance of production PostgreSQL systems - we packed <i>a lot</i> of great performance improvements into 9.2. The likely reason that it was picked up on in the book, and the real reason for this blogpost, is the story behind the development of the optimisation, which I for one find kind of interesting, and worth sharing. It's more interesting from the perspective of someone with a general interest in systems programming or PostgreSQL's design philosophy than a casual user, though. If you're a casual user, the short version is that simple queries that perform in-memory sorting of integers and floats will be about 23% faster.<br />
<br />
<a name='more'></a><br />
I wrote a rough prototype of the patch, that had a number of ideas, and proved the viability of the approach. Principal among those ideas was <i>specialisation </i>of the quicksort code: Formatting the code such that the compiler had compile-time knowledge of functions, with inlining used as an enabling optimisation, and a few variations produced. So rather than using complex indirection involving function pointers, a macro infrastructure was used to generate multiple specialisations, allowing the compiler to optimise the code more effectively as a result of being able to integrate everything.<br />
<br />
A secondary problem was that comparators (i.e. the comparison functions that all sorting within Postgres currently needs) were accessed in a round-about away.<br />
<br />
Roughly speaking, tuplesort (the part of the code that deals with sorting tuples, perhaps as part of a query's execution, or perhaps as the first step in creating a new index) is very generic code. Code very similar to <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/qsort.html">the C standard library's qsort()</a> is used directly in 9.1 - our particular implementation was lifted from <a href="http://www.netbsd.org/">NetBSD</a>, as it happens. This qsort function is passed a function pointer. That function pointer pointed to a "tupleclass encapsulating comparator" (i.e. the comparator differs for "heap" (table) tuples and index tuples). This comparator in turn calls the "comparator proper" for each and every sortkey (i.e. ORDER BY column) to be sorted. The indirection doesn't stop there though.<br />
<br />
Postgres comparators ("comparator propers" - the actual, datatype specific functions that Postgres uses for sorting) can be called from SQL. While they are written in C in the case of all built-in types, in 9.1, we still accessed the comparators through <a href="http://doxygen.postgresql.org/fmgr_8h_source.html">the SQL function call machinery</a>, rather than a direct function call (or function call through a function pointer). This isn't ordinarily a big deal - we do use this in some other performance critical codepaths, such <a href="http://www.postgresql.org/docs/devel/static/indexam.html">as the index access method code</a> - PostgreSQL doesn't know anything about indexes other than what this "abstract interface" tells it, even when interacting with btree indexes. In this way, a module author can define a whole new index type (I don't just mean implementing a new indexing scheme based on GiST/GIN - I <i>mean</i> a <i>whole new index type</i>). Though not a big deal there, in a tight, frequently executed loop, repeatedly calling a comparison that is essentially just one or two CPU instructions, the overhead of this fmgr trampoline can start to matter a lot. <br />
<br />
Tom Lane and Robert Haas fixed this problem. My work inspired their development of the <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c6e3ac11b60ac4a8942ab964252d51c1c0bd8845">SortSupport infrastructure</a> as a first phase/patch towards realising the full potential of these ideas; this is essentially just a way of facilitating datatype authors in providing alternative versions of comparators that can be accessed more directly, through function pointers. Owing to the fact that Postgres has a long tradition of being highly extensible, it was necessary to generalise this entirely, so that even third party module authors could use the infrastructure. It was likely to only be immediately useful for a few built-in datatypes like floats and integers, but it's hard to predict how people may choose to use this - there was some talk of novel applications of sortSupport.<br />
<br />
In the latter phase, we worked on producing the actual specialisations of the sort code. For heap tuples, we specialise on single sort keys and multiple sort keys, producing variant quicksort code for each. The "tupleclass encapsulating comparator" (the comparator for heap tuples in this case) is inlined. This is actually the more valuable of the two optimisations, counter-intuitive though that is.<br />
<br />
Why should simple inlining make such a large difference? Inlining certainly isn't just about eliminating the function call overhead (though there is that too); in fact, it's generally much more important as an <i>enabling transformation</i>. Inlining may enable <a href="http://en.wikipedia.org/wiki/Loop-invariant_code_motion">loop-invariant code motion</a>, <a href="http://en.wikipedia.org/wiki/Dead_code_elimination">dead code elimination</a>, or <a href="http://en.wikipedia.org/wiki/Induction_variable">induction variable elimination</a>. As a broad principle, the more the compiler knows, the greater leeway it has to apply its optimisations. These optimisations are not to be sniffed at - for quicksorting, a simple, isolated test case can show sorting 2.5 - 3 times faster when inlining of comparators occurs, rather than accessing comparators through function pointers (this indirection is necessary to support the interface of the C standard library's <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/qsort.html">qsort()</a>)<br />
<br />
Just to see where things stand after all of this, I performed this benchmark, where we measure the transactions per second after 45 seconds of executing the same simple ORDER BY query as many times as possible within a single session. I do this both with and without client overhead to see how important a factor that is (basically, no tuples are actually returned when we don't measure the overhead, and yet the underlying execution costs are the same - we do this by appending "OFFSET 10001" to the query).<br />
<pre class="SCREEN" style="-webkit-box-shadow: rgb(223, 223, 223) 3px 3px 5px; background-color: white; border-bottom-left-radius: 8px; border-bottom-right-radius: 8px; border-top-left-radius: 8px; border-top-right-radius: 8px; border: 1px solid rgb(207, 207, 207); box-shadow: rgb(223, 223, 223) 3px 3px 5px; font-size: 12px; margin-bottom: 2ex; margin-left: 2ex; margin-top: 2ex; padding: 2ex; text-align: left;"><span style="background-color: white;">[peter@peterlaptop tests]</span>$ <b># no client overhead, 9.1:</b>
[peter@peterlaptop tests]$ pgbench -f sort_no.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 12159
tps = 270.180571 (including connections establishing)
tps = 270.205386 (excluding connections establishing)
<span style="background-color: white;">[peter@peterlaptop tests]</span>$ <b># no client overhead, 9.2:</b>
[peter@peterlaptop tests]$ pgbench -f sort_no.sql -T 10
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 3687
tps = 368.683077 (including connections establishing)
tps = 368.890864 (excluding connections establishing)
<span style="background-color: white;">[peter@peterlaptop tests]</span>$ <b># client overhead, 9.1:</b>
[peter@peterlaptop tests]$ pgbench -f sort_o.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 7304
tps = 162.301377 (including connections establishing)
tps = 162.316128 (excluding connections establishing)
<span style="background-color: white;">[peter@peterlaptop tests]</span>$ <b># client overhead, 9.2:</b>
[peter@peterlaptop tests]$ pgbench -f sort_o.sql -T 45
starting vacuum...end.
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 45 s
number of transactions actually processed: 8977
tps = 199.470582 (including connections establishing)
tps = 199.494630 (excluding connections establishing)</pre>
<br />
Boiling that down further:<br />
<br />
<table border="1">
<tbody>
<tr>
<td></td>
<td><b>Postgres 9.1</b></td>
<td><b>Postgres 9.2</b></td>
<td><b>Difference</b></td>
</tr>
<tr>
<td><b>No Client overhead</b></td>
<td>270.205 TPS</td>
<td>368.890 TPS</td>
<td><b>+ 36.5%</b></td>
</tr>
<tr>
<td><b>Client overhead</b></td>
<td>162.316 TPS</td>
<td>199.494 TPS</td>
<td><b>+ 22.9%</b></td>
</tr>
</tbody>
</table>
<br />
<div>
So that's an improvement in throughput of about 23% for this sympathetic, though representative query.<br />
<br />
To quote Linus Torvalds, "To some degree people say that you should not micro-optimise, but if you love micro-optimisation, that's what you should do". I don't know that I love micro-optimisation, but it can certainly be well worthwhile to perform re-jiggering, or specialisation, or lowering of cache miss rates, at least in the case of an important infrastructure project's innermost loops.<br />
<br />
I was happy that I was ultimately able to overcome objections about the possible distributed costs of creating specialisations - this can result in "binary bloat". Each specialisation needs to independently justify the resulting increase in object file size. In general it's very difficult to argue that the trade-off represented by any particular specialisation is generally worth it. There was a protracted discussion of the merits of doing all of this. <i>That</i> was the factor that made it stick out in people's minds, I suspect.</div>Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com6tag:blogger.com,1999:blog-7609611625311126307.post-47482713942153487222012-06-04T11:00:00.001-07:002013-05-27T14:37:22.961-07:00Towards 14,000 write transactions per second on my laptop<div>
Postgres 9.2 will have many improvements to both read and write scalability. <a href="http://database-explorer.blogspot.com/">Simon Riggs</a> and I collaborated on a performance feature that greatly increased the throughput of small write transactions. Essentially, it accomplishes this by reducing the lock contention surrounding an internal lock called <a href="http://doxygen.postgresql.org/lwlock_8h_source.html#l00047"><span style="font-family: 'Courier New', Courier, monospace;">WALWriteLock</span></a>. When an individual backend/connection holds this lock, it is empowered to write WAL from <span style="font-family: 'Courier New', Courier, monospace;"><a href="http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-BUFFERS">wal_buffers</a></span>, an area of shared memory that temporarily holds WAL until it is written, and ultimately flushed to persistent storage.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLprHrrUJif45SZWzOjy963pusQSLKL5ox7cm6Ku-jwMYq40i05XGLsllH5DqVDKN0lzyZ74eT8tZy7R2z3POuDk3Dd6Js5iPNo7niSO9qed8HnWgmg5WInzw9u_Y1J6HstuUQDZqcX2XR/s1600/new_group_commit.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLprHrrUJif45SZWzOjy963pusQSLKL5ox7cm6Ku-jwMYq40i05XGLsllH5DqVDKN0lzyZ74eT8tZy7R2z3POuDk3Dd6Js5iPNo7niSO9qed8HnWgmg5WInzw9u_Y1J6HstuUQDZqcX2XR/s400/new_group_commit.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Original <b>update.sql</b> "new group commit" benchmark, January 2012. This made it into Postgres 9.2. Here, we compare the performance of my original patch (red line) and Postgres master in January (green line). 9.1 performance on this benchmark would probably be very similar to that of the baseline seen here.</td></tr>
</tbody></table>
<br />
<br />
<a name='more'></a></div>
<div>
<br /></div>
<div>
With this patch, we don’t have the backends queue up for the <span style="font-family: 'Courier New', Courier, monospace;">WALWriteLock</span> to write their WAL as before. Rather, they either immediately obtain the <span style="font-family: 'Courier New', Courier, monospace;">WALWriteLock</span>, or else queue up for it. However, when the lock becomes available, no waiting backend actually immediately acquires the lock. Rather, each backend once again checks if WAL has been flushed up to the LSN that the transaction being committed needs to be flushed up to. Oftentimes, they will find that this has happened, and will be able to simply fastpath out of <a href="http://doxygen.postgresql.org/xlog_8c.html#ae146649e2c12d20c14a4fa9a4988e90e">the function that ensures that WAL is flushed</a> (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL. In this manner, we batch commits, resulting in a really large increase in throughput, as you can tell from the diagram above.</div>
<div>
<br /></div>
<div>
In Postgres 9.2, this improvement automatically becomes available without any further configuration. This was one of the subjects of my recent talk, co-presented with <a href="http://blog.2ndquadrant.com/author/greg-smith/">Greg Smith</a> at PgCon 2012, <a href="http://www.pgcon.org/2012/schedule/events/453.en.html">“A Batch of Commit Batching”</a>. </div>
<div>
<br /></div>
<div>
There is some confusion about the semantics of group commit. Previously, the project sometimes referred to two settings - <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> and <span style="font-family: 'Courier New', Courier, monospace;">commit_siblings</span> - as offering a group commit implementation. Certainly, the <i>intent</i> of those settings was to allow the DBA to trade off latency for throughput, which characterises the group commit feature of some well known proprietary RDBMSs (though not, I believe, any of the various MySQL flavours). The patch that we worked on for 9.2 did not add arbitrary latency at any point in the hope of increasing throughput, and was technically totally orthogonal to <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> and <span style="font-family: 'Courier New', Courier, monospace;">commit_siblings</span>.</div>
<div>
<br /></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> was always something that we didn't really have much practical use for. It merely adds a latency of <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> microseconds just before each transaction goes to commit, provided there are at least <span style="font-family: 'Courier New', Courier, monospace;">commit_siblings</span> number of concurrent transactions in progress, in the hope that this latency will allow each backend to find their “flush up-to” point in the sequential WAL stream already flushed-up to when they go to flush WAL immediately after the delay, due to some other backend having flushed up to that point by then. When this happens, they can fastpath out of the function, often without ever having to acquire <span style="font-family: 'Courier New', Courier, monospace;">WALWriteLock</span>. </div>
<div>
<br /></div>
<div>
It is intuitively obvious that these are not hugely useful settings, generally just as likely to hurt performance as help it, and they were only really retained because they could improve throughput in certain narrow benchmarks. After all, whatever “certain other” backend was supposed to flush everyone else’s WAL was probably delayed too. With the new group commit implementation, the setting became more marginal than ever, and seemed to have next to no positive effect on commit speed. I argued for its removal in a dedicated thread on pgsql-hackers, and Greg and I were vocal in calling for its deprecation during our talk.</div>
<div>
<br /></div>
<div>
I subsequently realised that if we were to do that, there was still an uncomfortable tension. Even though <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> is almost completely ineffective, its purported purpose - to allow the DBA to trade-off transaction latency to increase the throughput of the server as a whole - is still a quite reasonable use-case, and a use-case that we were not serving well, with or without <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span>. To give a good example of when throughput is more important than latency, Google’s F1 specialist relational database, built to replace their legacy Adwords MySQL cluster, very explicitly prioritises throughput over latency. What’s more, as I’ve already mentioned, this trade-off seems to be integral to the group commit implementation of a particular expensive, proprietary RDBMS. It occurred to me that <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> really doesn’t work with the new group commit. It is an entirely separate piece of code, whose usefulness was never revisited in light of new group commit, that sometimes simply adds a sleep. I then had what turned out to be a valuable insight: by delaying only within the leader backend, and not within all connection backends, we can much more effectively stagger transactions so that they coincide and can be flushed together. What previously required serendipitous timing - which, if botched, could have the opposite effect to that desired - could now be mostly ensured. The leader alone would wait the full <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> before proceeding with flushing, but followers would continue to get in the queue behind the leader for that much longer. This was found to further dramatically improve transaction throughput, particularly in cases that the original “new group commit” did not do so well on, such as workloads where the server didn’t have so many connections, or transactions were not single-statement write statements. It also made a respectable dent even at the highest client counts:<br />
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzy35rf13gRpe03WeYzPHEkFIB8Sa653wGHJK0dBilMAYcxPPBJVnjGyqu7HKOeO2rGnETwWoTS40RVxqy9rkNJsdlYUy4CDauRzxnDfBhHenFCdpSY8EBIprEYFaQNEmvNHqDJHjzP9ti/s1600/insert_w_master_delay.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzy35rf13gRpe03WeYzPHEkFIB8Sa653wGHJK0dBilMAYcxPPBJVnjGyqu7HKOeO2rGnETwWoTS40RVxqy9rkNJsdlYUy4CDauRzxnDfBhHenFCdpSY8EBIprEYFaQNEmvNHqDJHjzP9ti/s400/insert_w_master_delay.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><b>insert.sql</b> benchmark. Importantly, the patch to adjust commit_delay (green line) shows by far the largest improvements over master without a commit_delay (red line) at lower client counts that are more representative of the real world. In this example, setting commit_delay to the same value, 4,000, <i>without</i> the still-not-in-postgres adjustment/patch to commit_delay behaviour (blue line) actually hurts performance a bit.</td></tr>
</tbody></table>
</div>
<div>
<div class="separator" style="clear: both; text-align: -webkit-auto;">
<br /></div>
<blockquote class="tr_bq">
<b>Update</b>: The <span style="font-family: Courier New, Courier, monospace;">commit_delay</span> patch has since been committed, and will be available in Postgres 9.3.</blockquote>
<br />
Interestingly, a <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> of around 4,000 (microseconds) seem about optimal for the above benchmark, performed on my laptop, which is roughly half of the raw <span style="font-family: 'Courier New', Courier, monospace;"><a href="http://pubs.opengroup.org/onlinepubs/7908799/xsh/fdatasync.html">fdatasync()</a></span> (my <span style="font-family: 'Courier New', Courier, monospace;"><a href="http://www.postgresql.org/docs/current/static/runtime-config-wal.html#GUC-WAL-SYNC-METHOD">wal_sync_method</a></span>, the default for Linux) speed for the <a href="http://archives.postgresql.org/pgsql-hackers/2012-01/txt3Ezlyo8ZTK.txt">7200 RPM disk</a> that I have in my Thinkpad laptop, as reported by <a href="http://www.postgresql.org/docs/current/static/pgtestfsync.html">pg_test_fsync</a>. The below benchmark is for pgbench’s tpc-b benchmark, and compares a 9.2 HEAD/new group commit baseline, which did not improve nearly so much as a result of our initial work on this:<br />
<br /></div>
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgN4fGVIWAiEzmoesm4Son0na6fqBntQSIDn8LH91cYxJZKeS8UJ3HDETu3s67Pe7Hgdq4NaCsKuYyfYdEnR7Wou0tGgIWChjJTcxAzaF7_ofixhCrVIIrZO7AGAqMPIvHeMG7E5hTHz2Z9/s1600/tpc_b_delay.png" imageanchor="1" style="margin-left: auto; margin-right: auto; text-align: center;"><span style="font-size: x-small;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgN4fGVIWAiEzmoesm4Son0na6fqBntQSIDn8LH91cYxJZKeS8UJ3HDETu3s67Pe7Hgdq4NaCsKuYyfYdEnR7Wou0tGgIWChjJTcxAzaF7_ofixhCrVIIrZO7AGAqMPIvHeMG7E5hTHz2Z9/s400/tpc_b_delay.png" width="400" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Less dramatic, but still rather compelling improvements with the new commit_delay patch (green line) for a <b>tpc-b.sql</b> workload, as compared to setting commit_delay to 0 (red line), which is representative of master. This workload was previously not helped much by "new group commit".</td></tr>
</tbody></table>
<div style="text-align: center;">
<br /></div>
<br />
Experimentation shows that a <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> of around 4,000 <i>remains</i> roughly optimal here, even though the tpc-b transactions do rather more than the single-insert statement of the insert.sql benchmark’s transactions, indicating that <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> now shows some promise as a value that can be set in a well-principled way, <i>without</i> having to optimise for some synthetic workload to see these kinds of improvements, or having to do something else so baroque as to make the whole feature completely useless.<br />
<br />
Thanks to <a href="http://4caast.morfeo-project.org/">the 4CaaSt project</a> for funding this recent research into <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span> and <span style="font-family: 'Courier New', Courier, monospace;">commit_siblings</span>.</div>
<div>
<br /></div>
<div>
Unfortunately, I missed the 9.2 deadline for submitting this small but critical adjustment to <span style="font-family: 'Courier New', Courier, monospace;">commit_delay</span>. For that reason, you’ll have to wait for 9.3 to benefit from it.</div>
Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com2tag:blogger.com,1999:blog-7609611625311126307.post-38723856898480694872012-03-29T06:02:00.006-07:002012-08-05T11:11:56.649-07:00Much improved statement statistics coming to Postgres 9.2There is a tendency for people with an interest in improving databases performance to imagine that it mostly boils down to factors outside of their application - the hardware, operating system configuration, and database settings. While these are obviously crucially important, experience suggests that in most cases, by far the largest gains are to be had by optimising the application’s interaction with the database. Doing so invariably involves analysing what queries are being executed in production, their costs, and what the significance of the query is to the application or business process that the database supports.<br />
<br />
PostgreSQL has had a module available in contrib since version 8.4 - <a href="http://www.postgresql.org/docs/current/static/pgstatstatements.html">pg_stat_statements</a>, originally developed by Takahiro Itagaki. The module blames execution costs on queries, so that bottlenecks in production can be isolated to points in the application. It does so by providing a view that is continually updated, giving real-time statistical information. Here is an example from the Postgres 9.2 docs:<br />
<br />
<a name='more'></a><br />
<pre class="SCREEN" style="-webkit-box-shadow: rgb(223, 223, 223) 3px 3px 5px; background-color: white; border-bottom-color: rgb(207, 207, 207); border-bottom-left-radius: 8px; border-bottom-right-radius: 8px; border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(207, 207, 207); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(207, 207, 207); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(207, 207, 207); border-top-left-radius: 8px; border-top-right-radius: 8px; border-top-style: solid; border-top-width: 1px; box-shadow: rgb(223, 223, 223) 3px 3px 5px; font-size: 12px; margin-bottom: 2ex; margin-left: 2ex; margin-top: 2ex; padding-bottom: 2ex; padding-left: 2ex; padding-right: 2ex; padding-top: 2ex; text-align: left;">bench=# SELECT pg_stat_statements_reset();
$ pgbench -i bench
$ pgbench -c10 -t300 bench
bench=# \x
bench=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
-[ RECORD 1 ]---------------------------------------------------------------------
query | UPDATE pgbench_branches SET bbalance = bbalance + ? WHERE bid = ?;
calls | 3000
total_time | 9.60900100000002
rows | 2836
hit_percent | 99.9778970000200936
-[ RECORD 2 ]---------------------------------------------------------------------
query | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?;
calls | 3000
total_time | 8.015156
rows | 2990
hit_percent | 99.9731126579631345
-[ RECORD 3 ]---------------------------------------------------------------------
query | copy pgbench_accounts from stdin
calls | 1
total_time | 0.310624
rows | 100000
hit_percent | 0.30395136778115501520
-[ RECORD 4 ]---------------------------------------------------------------------
query | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
calls | 3000
total_time | 0.271741999999997
rows | 3000
hit_percent | 93.7968855088209426
-[ RECORD 5 ]---------------------------------------------------------------------
query | alter table pgbench_accounts add primary key (aid)
calls | 1
total_time | 0.08142
rows | 0
hit_percent | 34.4947735191637631</pre>
<br />
This is an extremely useful feature. However, its usefulness has historically been highly limited by the fact that it differentiated queries based solely on <i>their query string</i>. This effectively limited the use of the module to code that mostly or entirely used prepared statements. If non-prepared statements were used, what you ended up with was something that looked like this:<br />
<pre class="SCREEN" style="-webkit-box-shadow: rgb(223, 223, 223) 3px 3px 5px; background-color: white; border-bottom-color: rgb(207, 207, 207); border-bottom-left-radius: 8px; border-bottom-right-radius: 8px; border-bottom-style: solid; border-bottom-width: 1px; border-left-color: rgb(207, 207, 207); border-left-style: solid; border-left-width: 1px; border-right-color: rgb(207, 207, 207); border-right-style: solid; border-right-width: 1px; border-top-color: rgb(207, 207, 207); border-top-left-radius: 8px; border-top-right-radius: 8px; border-top-style: solid; border-top-width: 1px; box-shadow: rgb(223, 223, 223) 3px 3px 5px; font-size: 12px; margin-bottom: 2ex; margin-left: 2ex; margin-top: 2ex; overflow-x: auto; overflow-y: auto; padding-bottom: 2ex; padding-left: 2ex; padding-right: 2ex; padding-top: 2ex; text-align: left;">-[ RECORD 1 ]---------------------------------------------------------------------
query | UPDATE pgbench_branches SET bbalance = bbalance + <b>5</b> WHERE bid = <b>3</b>;
calls | 1
total_time | 0.002
rows | 1
hit_percent | 99.9778970000200936
-[ RECORD 2 ]---------------------------------------------------------------------
query | UPDATE pgbench_branches SET bbalance = bbalance + <b>4343</b> WHERE bid = <b>42</b>;
calls | 1
total_time | 0.0044
rows | 1
hit_percent | 99.9778970000200936
-[ RECORD 3 ]---------------------------------------------------------------------
query | UPDATE pgbench_branches SET bbalance = bbalance + <b>-2329</b> WHERE bid = <b>4543</b>;
calls | 1
total_time | 0.003
rows | 1
hit_percent | 99.9778970000200936
-[ RECORD 4 ]---------------------------------------------------------------------
query | UPDATE pgbench_branches SET bbalance = bbalance + <b>9005</b> WHERE bid = <b>7392</b>;
calls | 1
total_time | 0.005
rows | 1
hit_percent | 99.9778970000200936
</pre>
Pretty soon, the shared memory area that stores these statements is filled with a great many entries, each perhaps differing only in the constants used in a single execution of the query, and with no way to aggregate the information to usefully inform optimisation efforts.<br />
<br />
Why not just always use prepared statements? Well, some applications do. Prepared statements have historically been well supported by JDBC for example, so Java apps tended to be okay. However, there are practical reasons to avoid them. Prepared statements can easily have performance regressions compared to equivalent unprepared versions, since the optimizer won't necessarily be able to use as much data about MCVs and so on as in the regular case. There are also limitations in some popular ORMs, with, for example, the popular Rails' ActiveRecord ORM having just added prepared statement support recently. Ad-hoc queries are generally never prepared either. The traditional solution was log-parsing utilities, which analyse log files after-the-fact and output reports. I have found some of these tools to be rather awkward, with high resource requirements that often necessitate running log analysis on a separate server to process often hundreds of megabytes a day of verbose log ouput. Postgres logs are naturally very verbose when every single query's execution is logged.<br />
<br />
Enter pg_stat_statements normalisation. Last night, Tom Lane <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7313cc016344a5705eb3e6916d8c4ea849c57975">committed</a> the following patch of mine, which was sponsored by <a href="https://postgres.heroku.com/">Heroku</a>:<br />
<br />
<blockquote class="tr_bq">
Improve contrib/pg_stat_statements to lump "similar" queries together.<br />
<br />
pg_stat_statements now hashes selected fields of the analyzed parse tree to assign a "fingerprint" to each query, and groups all queries with the same fingerprint into a single entry in the pg_stat_statements view. In practice it is expected that queries with the same fingerprint will be equivalent except for values of literal constants. To make the display more useful, such constants are replaced by "?" in the displayed query strings.<br />
<br />
This mechanism currently supports only optimizable queries (SELECT, INSERT, UPDATE, DELETE). Utility commands are still matched on the basis of their literal query strings.<br />
<br />
There remain some open questions about how to deal with utility statements that contain optimizable queries (such as EXPLAIN and SELECT INTO) and how to deal with expiring speclative hashtable entries that are made to save the normalized form of a query string. However, fixing these issues should require only localized changes, and since there are other open patches involving contrib/pg_stat_statements, it seems best to go ahead and commit what we've got.<br />
<br />
Peter Geoghegan, reviewed by Daniel Farina</blockquote>
<br />
I would like to acknowledge the invaluable assistance of both Tom and Daniel in bringing this project to maturity.<br />
<br />
I have benchmarked this featured, and found that<a href="http://pgbenchstatstatements.staticloud.com/"> it implies an overhead of about 1% - 2.5% on pgbench's default TPC-B style workload</a>. This is rather good, and only marginally worse than the hit taken when using prepared statements with classic pg_stat_statements.<br />
<br />
As if that wasn't good enough, pg_stat_statements has been made even more useful by the efforts of other people for the upcoming 9.2 release. A<a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b4f346611431361339253203d486789e4babb02"> patch by Ants Aasma</a> was recently committed to expose I/O timings at the query granularity through pg_stat_statements. <a href="http://rhaas.blogspot.com/">Robert Haas</a> wrote <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2254367435fcc4a31cc3b6d8324e33c5c30f265a">a patch</a> to expose blocks dirtied and written by statements through pg_stat_statements too.<br />
<br />
I expect the uptake of pg_stat_statements to really climb after the release of Postgres 9.2.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com7tag:blogger.com,1999:blog-7609611625311126307.post-18541083918990891942012-01-28T16:00:00.000-08:002012-01-30T10:08:29.084-08:00Power consumption in Postgres 9.2<div><b id="internal-source-marker_0.7957186095882207"><span style="font-family: inherit;"><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">One of the issues of major concern to CPU vendors is optimising the power consumption of their devices. In a world where increasingly, computing resources are purchased in terms of fairly abstract units of work, and where, when selecting the location of a major data-centre, the local price of a kilowatt hour is likely to be weighed just as heavily as the wholesale price of bandwidth, this is quite understandable. </span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Globally, data centres consumed between 1.1 and 1.5 percent of electricity in 2010 (Source: <a href="http://www.analyticspress.com/datacenters.html">Koomey</a>). The economic and ecological importance of minimizing that number is fairly obvious.</span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span></span></b><br />
<a name='more'></a><b id="internal-source-marker_0.7957186095882207"><span style="font-family: inherit;"><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The broad trend towards increasing amounts of computing being performed within large data centres, with consolidated infrastructure, sold as a service rather than a product is undeniable. Of course, the term “cloud computing” is often applied to this phenomenon. That’s a term that I try to avoid, as it’s fairly ambiguous. </span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><span style="font-family: inherit;"><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">There has been considerable effort to reduce wake-ups when idle in software in general, including everything from web browsers to word processors, which is related to the increasing importance of mobile and embedded platforms. However, this effort is most pronounced among developers of software that is expected to be deployed in virtualised environment on many servers, as wake-ups prevent CPUs from entering various idle states that allow them to save electricity, and when these wakeups are multiplied by thousands of VM instances, they add up very quickly.</span></span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">As part of 4CaaSt, a research project funded by the European Commission's Seventh Framework programme, that brings together members of industry and academia with the collective goal of producing an innovative </span><span style="font-style: italic; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">platform-as-as-service</span><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> offering, I spent time reducing the idle wake-ups per second in PostgreSQL. Postgres services firm 2ndQuadrant, where I work as a database architect, has had the development of several PostgreSQL features sponsored by 4CaaSt in furtherance of that goal, of which this is only one.</span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Historically, PostgreSQL has been weak in this particular area. With a standard Postgres server, with no special configuration, I have measured the wake-ups when idle at 11.5 per second, using <a href="http://www.lesswatts.org/projects/powertop/">Intel’s powertop utility</a>, as of the current 9.1 release. This was thought to be unacceptably high, for 4CaaSt, other solutions that leverage virtualisation extensively, and for embedded systems too.</span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">CPUs have a number of methods of reducing power consumption. These are specified by the ACPI standard (which covers discoverability, configuration and power-management), which in case you hadn't heard, is an open specification that makes minimal assumptions about the architecture or platform in use, and was written to help authors of operating system kernels.</span></span></b><br />
<span style="font-family: inherit; white-space: pre-wrap;"> </span></div><div><span style="font-family: inherit;"><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Briefly, ACPI describes the following states (I’ve avoided mentioning other states that have more to do with thin</span><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">gs like managing laptop hibernation):</span></span><br />
<ul><b><span style="font-family: inherit;"> </span></b></ul><b><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span></b><br />
<ul><li style="list-style-type: disc; text-decoration: none; vertical-align: baseline;"><b><span style="font-family: inherit; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-weight: normal;">Performance states </span>P0<span style="font-weight: normal;"> through to </span>PN<span style="font-weight: normal;"> (i.e. the exact number of states is implementation-defined). Dynamic CPU frequency scaling states. This might be better known under marketing names for specific implementations, like “Intel SpeedStep technology”. Ever notice how the frequency reported for your CPU under /proc/cpuinfo varies from one moment to the next on Linux? This is why! This state tends to be a bit sticky, in that it might take a few seconds to observe changes in frequency, as it is increased to meet demand. </span></span></b></li>
<b><span style="font-family: inherit;"> </span></b></ul><b><span style="font-family: inherit;"><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span></span></b><br />
<ul><li style="list-style-type: disc; text-decoration: none; vertical-align: baseline;"><span style="font-family: inherit;"><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Processor states </span><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C0</span><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> through to </span><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C3</span><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. Processors will change this state very quickly, and we basically want to keep this as high as possible, as higher values are associated with using less power.</span></span></li>
<b><span style="font-family: inherit;"> <ul><li style="font-weight: normal; list-style-type: circle; text-decoration: none; vertical-align: baseline;"><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C0 is the operating state.</span></li>
<li style="font-weight: normal; list-style-type: circle; text-decoration: none; vertical-align: baseline;"><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C1, or the halt state, is a state where the processor is not executing instructions, but can return to an executing state essentially instantaneously.</span></li>
<li style="font-weight: normal; list-style-type: circle; text-decoration: none; vertical-align: baseline;"><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C2, or the Stop-Clock state, is a state where the processor maintains all software-visible state, but may take longer to wake up. </span></li>
<li style="font-weight: normal; list-style-type: circle; text-decoration: none; vertical-align: baseline;"><span style="text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C3, or the sleep state, is a state where the processor does not need to maintain cache coherency, but does maintain some other state. There can even be graduations of how deep a sleep this state represents, depending on the implementation - the Intel Core i5 chip in my laptop has a C4 state, for example.</span></li>
</ul></span></b></ul><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Postgres has a multi-process architecture, which includes at a minimum a number of “auxiliary processes”: processes that perform a single, well defined task across the installation. There is also a process associated with each connection, and autovacuum daemon. Out-of-the-box, you’ll see just the following processes, once the PostgreSQL server becomes idle:</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Postmaster</span><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. A “clearing-house process”, that manages all other processes, and is minimally exposed to installation-wide failures, so that it has a good chance of recovering the server in the event of an unanticipated failure. To simulate this, you can kill another auxiliary process, and watch as the postmaster starts it again.</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Background writer</span><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. A process that is charged with writing out “dirty”, or unwritten buffers, in the hope of preventing individual connection backends from ever having to.</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">WAL Writer</span><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. A process that writes out WAL, log files that describe changes made to data in PostgreSQL databases. This is part of a whole subsystem through which the server efficiently maintains its crash-safety/durability guarantees. </span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Autovacuum launcher</span><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. This process notices if there is a need to vacuum dead rows, which are an artifact of the Postgres MVCC implementation. It launches autovacuum worker processes as needed, to perform this garbage collection.</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Statistics collector</span><span style="vertical-align: baseline;"><span style="font-family: inherit;"><span style="white-space: pre-wrap;">. This process collects statistics on tables and queries, both to facilitate how autovacuum </span></span><span style="white-space: pre-wrap;">apportions</span> work to vacuum dead rows and build more detailed statistics for the planner, and for general instrumentation.</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Checkpointer (new to 9.2)</span><span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. This process is responsible for managing checkpoints - smoothed writing of all data to disk, so that WAL files that describe those changes in sequence before a certain point can finally be truncated. This used to be an additional responsibility of the background writer.</span></div><div><span style="font-family: inherit;"><span style="white-space: pre-wrap;"> </span> <span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The reason that all these wake-ups had to occur within each auxiliary process was because they needed to check if the Postmaster was still alive very regularly, or if they had work assigned to them. If they took too long to notice that the Postmaster was dead (a major failure that neccessitates all processes immediately exiting), they would take too long to detach from shared memory, which would prevent the DBA from starting a new instance, as Postgres will refuse to start when it notices this to avoid data corruption.</span></span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The solution was to amend the latch, a low-level facility to wait-sleep on an event that was already used for synchronous replication, to also monitor Postmaster death. This infrastructure was committed first. I then proceeded to write patches for each auxiliary process, most recently the Background writer, which was particularly tricky, though accounted for most of the wake-ups when idle among auxiliary processes - usually 5 per second.</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="vertical-align: baseline;"><span style="font-family: inherit; white-space: pre-wrap;">Some considerable progress has been made. Additional variability has been added to the number of wake-ups per second, but if you monitor the wake-ups per second using powertop at a sufficiently high granularity, it stabilises at:</span></span><br />
<span style="vertical-align: baseline;"><span style="font-family: 'Courier New', Courier, monospace; white-space: pre-wrap;"> </span></span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 3.8% ( 35.0) SignalSender</span></span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 3.0% ( 27.2) [kernel scheduler] Load balancing tick</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 2.8% ( 25.6) kworker/0:0</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><span style="font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 0.8% ( 7.6) postgres</span></span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 0.6% ( 5.7) [TLB shootdowns] <kernel IPI></span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> 0.6% ( 5.6) [kernel core] hrtimer_start (tick_sched_ti</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">To give you some notion of how this relates to CPU states, this is an account of the time my laptop’s CPU spends in each of the states at one moment in time, according to powertop:</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Cn Avg residency P-states (frequencies)</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C0 (cpu running) ( 1.5%) Turbo Mode 3.0%</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">polling 0.0ms ( 0.0%) 2.00 Ghz 0.1%</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C1 mwait 1.0ms ( 1.3%) 1.80 Ghz 0.1%</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C2 mwait 1.5ms ( 1.8%) 1200 Mhz 0.2%</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C3 mwait 1.4ms ( 0.4%) 800 Mhz 96.7%</span><br />
<span style="font-family: 'Courier New', Courier, monospace; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">C4 mwait 7.9ms (95.0%)</span><br />
<span style="font-family: inherit; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><br />
<b><span style="font-family: inherit; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span></b><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">There is still some more work to do though. Simon Riggs and I submitted a patch to add group commit to PostgreSQL, which is being reviewed in the ongoing commitfest. This feature is anticipated to be very valuable to workloads that are bound by their commit rate, and a number of benchmarks that have been performed are very promising. That patch included support for allowing the WAL Writer to sleep. However, the exact details of group commit’s implementation have yet to be agreed upon, and it is not yet completely clear how effectively we will be able to reduce the WAL writer's idle wake-ups. However, I am hopeful that we will be able to eliminate them entirely, bringing the total number down to 2.6 per second for an idle Postgres 9.2 installation with standard settings. The WAL writer, much like the background writer, accounts for a relatively large 5 wake-ups per second (assuming default settings), and is similarly a bit tricky to adjust in this way.</span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span><br />
<span style="font-family: inherit; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">I’d previously measured the idle wake-ups per second for my distro’s mysqld at 2.2 (mysql-server version 5.1.56, Fedora 14), though when I check now, with mysql-server 5.5.19 on Fedora 16, that’s way up at consistently over 20 wake-ups per second. I’m not sure why that might be, but I welcome input as to what a fair, objective comparison would look like. I have made every effort to be fair here, and I'd speculate that this may have something to do with the storage engine in use in each case.</span></div>Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com1tag:blogger.com,1999:blog-7609611625311126307.post-55522486508494767082011-08-06T13:14:00.000-07:002012-08-05T11:14:03.699-07:00Clang now builds Postgres without additional warningsI'm happy to report that as of this evening, Clang builds PostgreSQL without any warnings, apart from a single remaining warning that also occurs when building with GCC, which is actually a bug in GNU Flex that the Flex developers don't seem to want to fix. On GCC 4.6, the warning looks like this:<br />
<br />
In file included from gram.y:12962:0:<br />
scan.c: In function ‘yy_try_NUL_trans’:<br />
scan.c:16246:23: warning: unused variable ‘yyg’ [-Wunused-variable]<br />
<br />
With Clang, however, it looks like this:<br />
<br />
<b>scan.c:16246:23: <span class="Apple-style-span" style="color: magenta;">warning:</span> unused variable 'yyg' [-Wunused-variable]</b><br />
struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var may be unused depending upon options. */<br />
<span class="Apple-style-span" style="color: lime;">^</span><br />
<span class="Apple-style-span">Note that the "</span><span class="Apple-style-span" style="color: lime;">^</span><span class="Apple-style-span">" is directly underneath the offending variable "yyg" on the terminal emulator that generated this warning.</span><br />
<div>
<br />
<a name='more'></a><br /></div>
Note also that Clang usefully gives the context of the warning, and as a result a comment is displayed that suggests that the warning is spurious.<br />
<br />
The Clang developers finally <a href="http://llvm.org/viewvc/llvm-project?view=rev&revision=136724">committed a fix to remove spurious warnings</a> that occured when building Postgres as a result of it being statically detected that there are assignments past what appears to be the end of a single element array at the end of a struct. That doesn't happen now, although only under <a href="http://stackoverflow.com/questions/4559558/one-element-array-in-struct">circumstances exactly consistent with the use of a popular idiom</a> that is seen quite a bit in the Postgres code.<br />
<br />
In working towards removing all Clang warnings, we detected a bug; we were assigning an enum constant from one enum to a variable that was actually another type of enum, which represented a potentially dangerous misuse of an abstraction that the Postgres code uses to represent nodes. This all occurred within a nested macro. Without Clang, it probably would have taken a long time for the problem to be noticed.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com5tag:blogger.com,1999:blog-7609611625311126307.post-79376670865748599512011-07-28T09:12:00.000-07:002012-08-05T11:13:01.357-07:00Could Clang displace GCC generally? Part II: Performance of PostgreSQL binariesThis is the second in a two-part series on Clang. If you haven't already, you'll want to read my original post on the topic, <a href="http://pgeoghegan.blogspot.com/2011/07/could-clang-displace-gcc-among.html">Could Clang displace GCC among PostgreSQL developers? Part I: Intro and compile times</a>.<br />
<br />
So, what about the performance of PostgreSQL binaries themselves when built with each compiler? I had heard contradictory reports of the performance of binaries built with Clang. In Belgium, Chris Lattner said that Clang built binaries could perform better, but a number of independent benchmarks suggested that Clang was generally behind, with some notable exceptions. I asked <a href="http://2ndquadrant.com/">2ndQuadrant colleague</a> and <a href="http://www.2ndquadrant.com/books/postgresql-9-0-high-performance/">PostgreSQL performance expert Greg Smith</a> to suggest a useful benchmark to serve as a good starting point for comparing Postgres performance when built with Clang to performance when built with GCC. He suggested that I apply Jeff Janes' recent patch for pgbench that he'd reviewed. It stresses the executor, and therefore the CPU quite effectively, rather than table locks or IPC mechanisms. The results of this benchmark were very interesting.<br />
<br />
<a name='more'></a><br /><br />
Greg provided me with shell access to a beefy server, the same server that he used in his review of Jeff’s patch, which added the -P option:<a href="http://archives.postgresql.org/message-id/4DFE788F.5020704@2ndQuadrant.com"> http://archives.postgresql.org/message-id/4DFE788F.5020704@2ndQuadrant.com </a>. I hacked together a shell script to run pgbench for this purpose. Binaries were built using GCC and Clang, each with exactly the same flags - Clang accepts the same flags as GCC. To smooth the results out, and to get a conclusive outcome, I decided on 16 10 minute -P runs with 4 connections, that alternated between using each set of binaries, lasting a total of 3 hours. Here’s a summary of the results:<br />
<br />
<b>1) GCC test:</b><br />
tps = 34.242839 (including connections establishing)<br />
<b>2) Clang test:</b><br />
tps = 34.370732 (including connections establishing)<br />
<b>3) GCC test:</b><br />
tps = 34.186687 (including connections establishing)<br />
<b>4) Clang test:</b><br />
tps = 34.922954 (including connections establishing)<br />
<b>5) GCC test:</b><br />
tps = 32.393383 (including connections establishing)<br />
<b>6) Clang test:</b><br />
tps = 34.994233 (including connections establishing)<br />
<b>7) GCC test:</b><br />
tps = 33.019546 (including connections establishing)<br />
<b>8) Clang test:</b><br />
tps = 34.234937 (including connections establishing)<br />
<b>9) GCC test:</b><br />
tps = 33.233653 (including connections establishing)<br />
<b>10) Clang test:</b><br />
tps = 35.233373 (including connections establishing)<br />
<b>11) GCC test:</b><br />
tps = 33.962637 (including connections establishing)<br />
<b>12) Clang test:</b><br />
tps = 33.869868 (including connections establishing)<br />
<b>13) GCC test:</b><br />
tps = 33.488347 (including connections establishing)<br />
<b>14) Clang test:</b><br />
tps = 33.005470 (including connections establishing)<br />
<b>15) GCC test:</b><br />
tps = 33.600023 (including connections establishing)<br />
<b>16) Clang test:</b><br />
tps = 34.770840 (including connections establishing)<br />
<br />
The total transactions per second with the Clang binaries was marginally ahead of the GCC binaries. While further analysis is certainly needed, it is a remarkable achievement for Clang to have been able to hold its own against, or even slightly outperform a compiler as mature and popular as GCC here.<br />
<br />
So, is it ready for prime-time? Well, not quite. Even my bleeding edge Fedora 15 system only comes with Clang 2.8, and only Clang 2.9 is listed as supported in 9.1. These build time figures are only obtainable on very recent revisions of Clang. While I’m extremely encouraged by our pgbench benchmark results, I hesitate to recommend the use of Clang for building production PostgreSQL binaries just yet. The benchmark is quite synthetic, and may not be a great proxy for general performance, although it certainly wasn’t cherry-picked. I encourage others to independently reproduce my work here, and to suggest alernative benchmarks.<br />
<br />
I can heartily recommend Clang for hacking on PostgreSQL today. I was also impressed with how accessible the Clang developers are if you have a problem. Sometimes you have to be a bit persistent, but in general the Clang community are quite responsive to end-users concerns.<br />
<br />
Watch this space.Peter Geogheganhttp://www.blogger.com/profile/02874568372191778321noreply@blogger.com15