Tuesday, May 10, 2016

amcheck: Verify the logical consistency of PostgreSQL B-Tree indexes

I've created a project page on Github for amcheck, a tool for verifying the logical consistency of PostgreSQL B-Tree indexes:

https://github.com/petergeoghegan/amcheck

The tool is primarily useful for detecting index corruption in production database systems. It can do this with low overhead; most verification requires only a non-disruptive lock on the index as it is verified. The strength of the lock taken on an index as it is verified matches that of simple SELECT statements (unless the highest level of verification is requested). The locking involved will generally not block concurrent reads or writes, and will not prevent VACUUM from running concurrently.

Saturday, November 14, 2015

Suggesting a corrected column name/spelling in the event of a column misspelling


One small PostgreSQL 9.5 feature I worked on is the new hinting mechanism feature, which sometimes hints, based on a score, what you might have meant to type following misspelling a column name in an SQL query. The score heavily weighs Levenshtein distance. A HINT message is sent to the client, which psql and other client tools will display by default.

It's common to not quite recall offhand if a column name is pluralized, or where underscores are used to break up words that make up the name of a column. This feature is targeted at that problem, providing guidance that allows the user to quickly adjust their query without mental context switching. For example:

postgres=# select * from orders where order_id = 5;
ERROR:  42703: column "order_id" does not exist
LINE 1: select * from orders where order_id = 5;
                                   ^
HINT:  Perhaps you meant to reference the column "orders"."orderid".

Friday, October 2, 2015

Avoid naming a constraint directly when using ON CONFLICT DO UPDATE

PostgreSQL 9.5 will have support for a feature that is popularly known as "UPSERT" - the ability to either insert or update a row according to whether an existing row with the same key exists. If such a row already exists, the implementation should update it. If not, a new row should be inserted. This is supported by way of a new high level syntax (a clause that extends the INSERT statement) that more or less relieves the application developer from having to give any thought to race conditions. This common operation for client applications is set to become far simpler and far less error-prone than legacy ad-hoc approaches to UPSERT involving subtransactions.

When we worked on UPSERT, many edge-cases were carefully considered. A technique called "unique index inference" allows DML statement authors to be very explicit about what condition they want to take the alternative (UPDATE or NOTHING) path on. That alternative path can only be taken in the event of a would-be duplicate violation in an "arbiter" unique index (for the DO NOTHING variant, a would-be exclusion violation is also a possible reason to take the alternative NOTHING path). The ability to write UPSERT statements explicitly and safely while also having lots of flexibility is an important differentiator for PostgreSQL's UPSERT in my view.

Saturday, April 4, 2015

Abbreviated keys for numeric to accelerate numeric sorts

Andrew Gierth's numeric abbreviated keys patch was committed recently. This commit added abbreviation/sortsupport for the numeric type (the PostgreSQL type which allows practically arbitrary precision, typically recommended for representing monetary values).

The encoding scheme that Andrew came up with is rather clever - it has an excellent tendency to concentrate entropy from the original values into the generated abbreviated keys in real world cases. As far as accelerating sorts goes, numeric abbreviation is at least as effective as the original text abbreviation scheme. I easily saw improvements of 6x-7x with representative queries that did not spill to disk (i.e. that used quicksort). In essence, the patch makes sorting numeric values almost as cheap as sorting simple integers, since that is often all that is actually required during sorting proper (the abbreviated keys compare as integers, except that the comparison is inverted to comport with how abbreviation builds abbreviated values from numerics as tuples are copied into local memory ahead of sorting - see the patch for exact details).