This enhancement is significant because it seems much more likely to catch corruption in the wild. In general, inconsistencies between a table and its indexes are more likely to occur than inconsistencies between blocks within an index for many reasons. There is simply a much larger window for an inconsistency to arise when something is amiss with database storage that breaks the assumptions PostgreSQL makes during crash recovery, for example.
The enhancement is also significant because it played a role in identifying a PostgreSQL data corruption bug that will be fixed in the next point release, scheduled for November 9th, 2017. This bug affects all supported PostgreSQL versions. It was informally dubbed the “freeze-the-dead” bug.
November 6 2017 update: The fix was reverted due to additional concerns that came to light. The community is working on a new, more comprehensive fix for the next point release.
December 15 2017 update: A new fix has been committed, and will appear in 9.3.21, 9.4.16, 9.5.11, 9.6.7, and 10.2 point releases, scheduled for February 8th, 2018.
Packages for v1.2 are available from the community Debian/Ubuntu apt repository, as well as packages from the community Redhat/CentOS/SLES yum repository. Full details on installing these packages are available from the README:
https://github.com/petergeoghegan/amcheck/
"Freeze-the-dead" corruption detection
amcheck is effective in detecting corruption caused by the “freeze-the-dead” bug because the corruption results in a logical inconsistency in a heap page. There could be multiple versions of the same row visible at once in cases where a prune never took place. In other scenarios, there could be a HOT chain that is pruned prematurely, leading to wrong answers to query plans that use an index scan, while sequential scan plans (plans that should get the same answer without using an index) still give correct answers. Even though amcheck was not written with these specific inconsistencies in mind, it still seems to reliably detect them.
Managing risk
I think that it will prove useful to have an immediate way of mechanically detecting corruption caused by the “freeze-the-dead” bug. We may actually hear reports of corruption that it has caused in the wild (if you happen to have been affected, please let the pgsql-hackers list know about it). I also believe that the newly enhanced amcheck will detect corruption caused by other historic bugs, including the CREATE INDEX CONCURRENTLY bug detected in February of 2017, as well as a similar CREATE INDEX CONCURRENTLY bug from back in 2012. Estimating the prevalence of corruption of this general nature is a very tricky business, though; it largely comes down to workload, and those are incredibly varied. And, even if an issue is on average very unlikely to strike, that doesn’t help those that are affected despite those odds. In short, there is no such thing as the average PostgreSQL database, and there are many practical problems with applying statistical models to complex domains.
I would venture to predict that a more detailed picture of how prevalent corruption like this is across all PostgreSQL installations will emerge over time, as amcheck is used more widely, and that that picture will be fairly boring. It still seems important to make every effort here, though. Going forward, we’ll have more to go on than educated guesses. And, those unlucky few that turn out to be affected by bugs that lead to corruption will have a relatively simple, non-disruptive tool to isolate the problem. DBAs can use amcheck to help with managing the risk of data corruption, including but not limited to corruption originating from bugs in PostgreSQL.
No comments:
Post a Comment