One of the issues of major concern to CPU vendors is optimising the power consumption of their devices. In a world where increasingly, computing resources are purchased in terms of fairly abstract units of work, and where, when selecting the location of a major data-centre, the local price of a kilowatt hour is likely to be weighed just as heavily as the wholesale price of bandwidth, this is quite understandable.
Globally, data centres consumed between 1.1 and 1.5 percent of electricity in 2010 (Source: Koomey). The economic and ecological importance of minimizing that number is fairly obvious.
The broad trend towards increasing amounts of computing being performed within large data centres, with consolidated infrastructure, sold as a service rather than a product is undeniable. Of course, the term “cloud computing” is often applied to this phenomenon. That’s a term that I try to avoid, as it’s fairly ambiguous.
There has been considerable effort to reduce wake-ups when idle in software in general, including everything from web browsers to word processors, which is related to the increasing importance of mobile and embedded platforms. However, this effort is most pronounced among developers of software that is expected to be deployed in virtualised environment on many servers, as wake-ups prevent CPUs from entering various idle states that allow them to save electricity, and when these wakeups are multiplied by thousands of VM instances, they add up very quickly.
As part of 4CaaSt, a research project funded by the European Commission's Seventh Framework programme, that brings together members of industry and academia with the collective goal of producing an innovative platform-as-as-service offering, I spent time reducing the idle wake-ups per second in PostgreSQL. Postgres services firm 2ndQuadrant, where I work as a database architect, has had the development of several PostgreSQL features sponsored by 4CaaSt in furtherance of that goal, of which this is only one.
Historically, PostgreSQL has been weak in this particular area. With a standard Postgres server, with no special configuration, I have measured the wake-ups when idle at 11.5 per second, using Intel’s powertop utility, as of the current 9.1 release. This was thought to be unacceptably high, for 4CaaSt, other solutions that leverage virtualisation extensively, and for embedded systems too.
CPUs have a number of methods of reducing power consumption. These are specified by the ACPI standard (which covers discoverability, configuration and power-management), which in case you hadn't heard, is an open specification that makes minimal assumptions about the architecture or platform in use, and was written to help authors of operating system kernels.
Globally, data centres consumed between 1.1 and 1.5 percent of electricity in 2010 (Source: Koomey). The economic and ecological importance of minimizing that number is fairly obvious.
The broad trend towards increasing amounts of computing being performed within large data centres, with consolidated infrastructure, sold as a service rather than a product is undeniable. Of course, the term “cloud computing” is often applied to this phenomenon. That’s a term that I try to avoid, as it’s fairly ambiguous.
There has been considerable effort to reduce wake-ups when idle in software in general, including everything from web browsers to word processors, which is related to the increasing importance of mobile and embedded platforms. However, this effort is most pronounced among developers of software that is expected to be deployed in virtualised environment on many servers, as wake-ups prevent CPUs from entering various idle states that allow them to save electricity, and when these wakeups are multiplied by thousands of VM instances, they add up very quickly.
As part of 4CaaSt, a research project funded by the European Commission's Seventh Framework programme, that brings together members of industry and academia with the collective goal of producing an innovative platform-as-as-service offering, I spent time reducing the idle wake-ups per second in PostgreSQL. Postgres services firm 2ndQuadrant, where I work as a database architect, has had the development of several PostgreSQL features sponsored by 4CaaSt in furtherance of that goal, of which this is only one.
Historically, PostgreSQL has been weak in this particular area. With a standard Postgres server, with no special configuration, I have measured the wake-ups when idle at 11.5 per second, using Intel’s powertop utility, as of the current 9.1 release. This was thought to be unacceptably high, for 4CaaSt, other solutions that leverage virtualisation extensively, and for embedded systems too.
CPUs have a number of methods of reducing power consumption. These are specified by the ACPI standard (which covers discoverability, configuration and power-management), which in case you hadn't heard, is an open specification that makes minimal assumptions about the architecture or platform in use, and was written to help authors of operating system kernels.
Briefly, ACPI describes the following states (I’ve avoided mentioning other states that have more to do with things like managing laptop hibernation):
Postmaster. A “clearing-house process”, that manages all other processes, and is minimally exposed to installation-wide failures, so that it has a good chance of recovering the server in the event of an unanticipated failure. To simulate this, you can kill another auxiliary process, and watch as the postmaster starts it again.
Background writer. A process that is charged with writing out “dirty”, or unwritten buffers, in the hope of preventing individual connection backends from ever having to.
WAL Writer. A process that writes out WAL, log files that describe changes made to data in PostgreSQL databases. This is part of a whole subsystem through which the server efficiently maintains its crash-safety/durability guarantees.
Autovacuum launcher. This process notices if there is a need to vacuum dead rows, which are an artifact of the Postgres MVCC implementation. It launches autovacuum worker processes as needed, to perform this garbage collection.
Statistics collector. This process collects statistics on tables and queries, both to facilitate how autovacuum apportions work to vacuum dead rows and build more detailed statistics for the planner, and for general instrumentation.
Checkpointer (new to 9.2). This process is responsible for managing checkpoints - smoothed writing of all data to disk, so that WAL files that describe those changes in sequence before a certain point can finally be truncated. This used to be an additional responsibility of the background writer.
- Performance states P0 through to PN (i.e. the exact number of states is implementation-defined). Dynamic CPU frequency scaling states. This might be better known under marketing names for specific implementations, like “Intel SpeedStep technology”. Ever notice how the frequency reported for your CPU under /proc/cpuinfo varies from one moment to the next on Linux? This is why! This state tends to be a bit sticky, in that it might take a few seconds to observe changes in frequency, as it is increased to meet demand.
- Processor states C0 through to C3. Processors will change this state very quickly, and we basically want to keep this as high as possible, as higher values are associated with using less power.
- C0 is the operating state.
- C1, or the halt state, is a state where the processor is not executing instructions, but can return to an executing state essentially instantaneously.
- C2, or the Stop-Clock state, is a state where the processor maintains all software-visible state, but may take longer to wake up.
- C3, or the sleep state, is a state where the processor does not need to maintain cache coherency, but does maintain some other state. There can even be graduations of how deep a sleep this state represents, depending on the implementation - the Intel Core i5 chip in my laptop has a C4 state, for example.
Postmaster. A “clearing-house process”, that manages all other processes, and is minimally exposed to installation-wide failures, so that it has a good chance of recovering the server in the event of an unanticipated failure. To simulate this, you can kill another auxiliary process, and watch as the postmaster starts it again.
Background writer. A process that is charged with writing out “dirty”, or unwritten buffers, in the hope of preventing individual connection backends from ever having to.
WAL Writer. A process that writes out WAL, log files that describe changes made to data in PostgreSQL databases. This is part of a whole subsystem through which the server efficiently maintains its crash-safety/durability guarantees.
Autovacuum launcher. This process notices if there is a need to vacuum dead rows, which are an artifact of the Postgres MVCC implementation. It launches autovacuum worker processes as needed, to perform this garbage collection.
Statistics collector. This process collects statistics on tables and queries, both to facilitate how autovacuum apportions work to vacuum dead rows and build more detailed statistics for the planner, and for general instrumentation.
Checkpointer (new to 9.2). This process is responsible for managing checkpoints - smoothed writing of all data to disk, so that WAL files that describe those changes in sequence before a certain point can finally be truncated. This used to be an additional responsibility of the background writer.
The reason that all these wake-ups had to occur within each auxiliary process was because they needed to check if the Postmaster was still alive very regularly, or if they had work assigned to them. If they took too long to notice that the Postmaster was dead (a major failure that neccessitates all processes immediately exiting), they would take too long to detach from shared memory, which would prevent the DBA from starting a new instance, as Postgres will refuse to start when it notices this to avoid data corruption.
The solution was to amend the latch, a low-level facility to wait-sleep on an event that was already used for synchronous replication, to also monitor Postmaster death. This infrastructure was committed first. I then proceeded to write patches for each auxiliary process, most recently the Background writer, which was particularly tricky, though accounted for most of the wake-ups when idle among auxiliary processes - usually 5 per second.
Some considerable progress has been made. Additional variability has been added to the number of wake-ups per second, but if you monitor the wake-ups per second using powertop at a sufficiently high granularity, it stabilises at:
3.8% ( 35.0) SignalSender
3.0% ( 27.2) [kernel scheduler] Load balancing tick
2.8% ( 25.6) kworker/0:0
0.8% ( 7.6) postgres
0.6% ( 5.7) [TLB shootdowns] <kernel IPI>
0.6% ( 5.6) [kernel core] hrtimer_start (tick_sched_ti
To give you some notion of how this relates to CPU states, this is an account of the time my laptop’s CPU spends in each of the states at one moment in time, according to powertop:
Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 1.5%) Turbo Mode 3.0%
polling 0.0ms ( 0.0%) 2.00 Ghz 0.1%
C1 mwait 1.0ms ( 1.3%) 1.80 Ghz 0.1%
C2 mwait 1.5ms ( 1.8%) 1200 Mhz 0.2%
C3 mwait 1.4ms ( 0.4%) 800 Mhz 96.7%
C4 mwait 7.9ms (95.0%)
There is still some more work to do though. Simon Riggs and I submitted a patch to add group commit to PostgreSQL, which is being reviewed in the ongoing commitfest. This feature is anticipated to be very valuable to workloads that are bound by their commit rate, and a number of benchmarks that have been performed are very promising. That patch included support for allowing the WAL Writer to sleep. However, the exact details of group commit’s implementation have yet to be agreed upon, and it is not yet completely clear how effectively we will be able to reduce the WAL writer's idle wake-ups. However, I am hopeful that we will be able to eliminate them entirely, bringing the total number down to 2.6 per second for an idle Postgres 9.2 installation with standard settings. The WAL writer, much like the background writer, accounts for a relatively large 5 wake-ups per second (assuming default settings), and is similarly a bit tricky to adjust in this way.
I’d previously measured the idle wake-ups per second for my distro’s mysqld at 2.2 (mysql-server version 5.1.56, Fedora 14), though when I check now, with mysql-server 5.5.19 on Fedora 16, that’s way up at consistently over 20 wake-ups per second. I’m not sure why that might be, but I welcome input as to what a fair, objective comparison would look like. I have made every effort to be fair here, and I'd speculate that this may have something to do with the storage engine in use in each case.
The solution was to amend the latch, a low-level facility to wait-sleep on an event that was already used for synchronous replication, to also monitor Postmaster death. This infrastructure was committed first. I then proceeded to write patches for each auxiliary process, most recently the Background writer, which was particularly tricky, though accounted for most of the wake-ups when idle among auxiliary processes - usually 5 per second.
Some considerable progress has been made. Additional variability has been added to the number of wake-ups per second, but if you monitor the wake-ups per second using powertop at a sufficiently high granularity, it stabilises at:
3.8% ( 35.0) SignalSender
3.0% ( 27.2) [kernel scheduler] Load balancing tick
2.8% ( 25.6) kworker/0:0
0.8% ( 7.6) postgres
0.6% ( 5.7) [TLB shootdowns] <kernel IPI>
0.6% ( 5.6) [kernel core] hrtimer_start (tick_sched_ti
To give you some notion of how this relates to CPU states, this is an account of the time my laptop’s CPU spends in each of the states at one moment in time, according to powertop:
Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 1.5%) Turbo Mode 3.0%
polling 0.0ms ( 0.0%) 2.00 Ghz 0.1%
C1 mwait 1.0ms ( 1.3%) 1.80 Ghz 0.1%
C2 mwait 1.5ms ( 1.8%) 1200 Mhz 0.2%
C3 mwait 1.4ms ( 0.4%) 800 Mhz 96.7%
C4 mwait 7.9ms (95.0%)
There is still some more work to do though. Simon Riggs and I submitted a patch to add group commit to PostgreSQL, which is being reviewed in the ongoing commitfest. This feature is anticipated to be very valuable to workloads that are bound by their commit rate, and a number of benchmarks that have been performed are very promising. That patch included support for allowing the WAL Writer to sleep. However, the exact details of group commit’s implementation have yet to be agreed upon, and it is not yet completely clear how effectively we will be able to reduce the WAL writer's idle wake-ups. However, I am hopeful that we will be able to eliminate them entirely, bringing the total number down to 2.6 per second for an idle Postgres 9.2 installation with standard settings. The WAL writer, much like the background writer, accounts for a relatively large 5 wake-ups per second (assuming default settings), and is similarly a bit tricky to adjust in this way.
I’d previously measured the idle wake-ups per second for my distro’s mysqld at 2.2 (mysql-server version 5.1.56, Fedora 14), though when I check now, with mysql-server 5.5.19 on Fedora 16, that’s way up at consistently over 20 wake-ups per second. I’m not sure why that might be, but I welcome input as to what a fair, objective comparison would look like. I have made every effort to be fair here, and I'd speculate that this may have something to do with the storage engine in use in each case.
I know nothing about Postgres internals so this is written from a point of ignorance...
ReplyDeleteIt seems the parties communicate over shared memory for the most part because you mentioned the need to detach on postmaster death to allow a restart, and it's this polling for postmaster death that's an issue.
Could each poller not have their own pipe to read from postmaster, which never writes anything, and then the OS, on postmaster death, would provide an EOF to each reader for each pipe and that is sufficient to make select(2) et al return?