时间:2021-07-01 10:21:17 帮助过:14人阅读
Thanks to this there is no locking on writes to WAL, just simple, plain appending of data.
Now, if we‘d continue the process for long time, we would have lots of modified pages in memory, and lots of records in WAL.
So, when is data written to actual disk pages of tables?
Two situations:
Page swap is very simple process – let‘s assume we had shared_buffers set to 10, and all these buffers are taken by 10 different pages, and all are modified. And now, due to user activity PostgreSQL has to load another page to get data from it. What will happen? It‘s simple – one of the pages will get evicted from memory, and new page will be loaded. If the page that got removed was “dirty" ( which means that there were some changes in it that weren‘t yet saved to table file ), it will be first written to table file.
Checkpoint is much more interesting. Before we will go into what it is, let‘s think about theoretical scenario. You have database that is 1GB in size, and your server has 10GB of RAM. Clearly you can keep all pages of database in memory, so the page swap never happens.
What will happen if you‘d let the database run, with writes, changes, additions, for long time? Theoretically all would be OK – all changes would get logged to WAL, and memory pages would be modified, all good. Now imagine, that after 24 hours of work, the system gets killed – again – power failure.
On next start PostgreSQL would have to read, and apply, all changes from all WAL segments that happened in last 24 hours! That‘s a lot of work, and this would cause startup of PostgreSQL to take loooooooong time.
To solve this problem, we get checkpoints. These happen usually automatically, but you can force them to happen at will, by issuing CHECKPOINT command.
So, what is checkpoint? Checkpoint does very simple thing: it writes all dirty pages from memory to disk, marks them as “clean" in shared_buffers, and stores information that all of wal up to now is applied. This happens without any locking of course. So, the immediate information from here, is that amount of work that newly started PostgreSQL has to do is related to how much time passed before last CHECKPOINT and the moment PostgreSQL got stopped.
This brings us back to – when it happens. Manual checkpoints are not common, usually one doesn‘t even think about it – it all happens in background. How does PostgreSQL know when to checkpoint, then? Simple, thanks to two configuration parameters:
And here, we have to learn a bit about segments.
As I wrote earlier – WAL is (in theory) infinite file, that gets only new data (appended), and never overwrites.
While this is nice in theory, practice is a bit more complex. For example – there is not really any use for WAL data that was logged before last checkpoint. And files of infinite size are (at least for now) not possible.
PostgreSQL developers decided to segment this infinite WAL into segments. Each segment has it‘s consecutive number, and is 16MB in size. When one segment will be full, PostgreSQL simply switches to next.
Now, that we know what is segments we can understand what checkpoint_segments is about. This is number (default: 3) which means: if that many segments for filled since last checkpoint, issue new checkpoint.
With defaults, it means that if you‘d insert data that would take (in PostgreSQL format) 6144 pages ( 16MB of segment is 2048 pages, so 3 segments are 6144 pages) – it would automatically issue checkpoint.
Second parameter – checkpoint_timeout, is a time interval (defaults to 5 minutes), and if this time passes from last checkpoint – new checkpoint will be issued. It has to be understood that (generally) the more often you make the checkpoints, the less invasive they are.
This comes from simple fact – generally, over time, more and more different pages get dirty. If you‘d checkpoint every minute – only pages from minute would have to be written to disk. 5 minutes – more pages. 1 hour – even more pages.
While checkpointing doesn‘t lock anything, it has to be understood that checkpoint of (for example) 30 segments, will cause high-intensive write of 480 MB of data to disk. And this might cause some slowdowns for concurrent reads.
So far, I hope, it‘s pretty clear.
Now, next part of the jigsaw – wal segments.
These files (wal segments) reside in pg_xlog/ directory of PostgreSQL PGDATA:
=$ ls -l data/pg_xlog/ total 131076 -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:39 000000010000000800000058 -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 000000010000000800000059 -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 00000001000000080000005A -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005B -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005C -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005D -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005E -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005F drwx------ 2 pgdba pgdba 4096 2011-07-14 01:14 archive_status/
Each segment name contains 3 blocks of 8 hex digits. For example: 00000001000000080000005C means:
Last part goes only from 00000000 to 000000FE (not FF!).
2nd part of filename plus the 2 characters at the end of 3rd part give us location is this theoretical infinite WAL file.
Within PostgreSQL we can always check what is current WAL location:
$ select pg_current_xlog_location(); pg_current_xlog_location -------------------------- 8/584A62E0 (1 row)
This means that we are using now file xxxxxxxx000000080000058, and PostgreSQL is writing at offset 4A62E0 in it – which is 4874976, which, since the WAL segment is 16MB means that the wal segment is filled in ~ 25% now.
The most mysterious thing is timeline. Timeline starts from 1, and increments (by one) everytime you make WAL-slave from server, and this Slave is promoted to standalone. Generally – within given working server this value doesn‘t change.
All of these information we can also get using pg_controldata program:
=$ pg_controldata data pg_control version number: 903 Catalog version number: 201107031 Database system identifier: 5628532665370989219 Database cluster state: in production pg_control last modified: Thu 14 Jul 2011 01:49:12 AM CEST Latest checkpoint location: 8/584A6318 Prior checkpoint location: 8/584A6288 Latest checkpoint‘s REDO location: 8/584A62E0 Latest checkpoint‘s TimeLineID: 1 Latest checkpoint‘s NextXID: 0/33611 Latest checkpoint‘s NextOID: 28047 Latest checkpoint‘s NextMultiXactId: 1 Latest checkpoint‘s NextMultiOffset: 0 Latest checkpoint‘s oldestXID: 727 Latest checkpoint‘s oldestXID‘s DB: 1 Latest checkpoint‘s oldestActiveXID: 33611 Time of latest checkpoint: Thu 14 Jul 2011 01:49:12 AM CEST Minimum recovery ending location: 0/0 Backup start location: 0/0 Current wal_level setting: hot_standby Current max_connections setting: 100 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value
This has some interesting information – for example location (in WAL infinite-file) of last checkpoint, previous checkpoint, and REDO location.
REDO location is very important – this is the place in WAL that PostgreSQL will have to read from if it got killed, and restarted.
Values above don‘t differ much because this is my test system which doesn‘t have any traffic now, but we can see on another machine:
=> pg_controldata data/ ... Latest checkpoint location: 623C/E07AC698 Prior checkpoint location: 623C/DDD73588 Latest checkpoint‘s REDO location: 623C/DE0915B0 ...
The last thing that‘s important is to understand what happens with obsolete WAL segments, and how “new" wal segments are created.
Let me show you one thing again:
=$ ls -l data/pg_xlog/ total 131076 -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:39 000000010000000800000058 -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 000000010000000800000059 -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 00000001000000080000005A -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005B -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005C -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005D -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005E -rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005F drwx------ 2 pgdba pgdba 4096 2011-07-14 01:14 archive_status/
This was on system with no writes, and REDO location of 8/584A62E0.
Since on start PostgreSQL will need to read from this location, all WAL segments before 000000010000000800000058 (i.e. 000000010000000800000057, 000000010000000800000056 and so on) are obsolete.
On the other hand – please note that we have ready seven files for future use.
PostgreSQL works in this way: whenever WAL segment gets obsolete (i.e. REDO location is later in WAL than this segment) the file is renamed. That‘s right. It‘s not removed, it‘s renamed. Renamed to what? To next file in WAL. So when I‘ll do some writes, and then there will be checkpoint in 8/59* location, file 000000010000000800000058 will get renamed to 000000010000000800000060.
This is one of the reasons why your checkpoint_segments shouldn‘t be too low.
Let‘s think for a while about what would happen if we had very long checkpoint_timeout, and we would fill all checkpoint_segments. To record new write PostgreSQL would have to either do checkpoint (which it will do), but at the same time – it wouldn‘t have any more ready segments left to use. So it would have to create new file. New file, 16MB of data (\x00 probably) – it would have to be written to disk before PostgreSQL could write anything that user requested. Which means that if you‘ll ever reach the checkpoint_segments concurrent user activity will be slowed down, because PostgreSQL will have to create new files to accommodate writes of data requested by users.
Usually it‘s not a problem, you just set checkpoint_segments to some relatively high number, and you‘re done.
Anyway. When looking at pg_xlog/ directory, current WAL segment (the one that gets the writes) is usually somewhere in the middle. Which might cause some confusion, because mtime of the files will not change in the same direction as numbers in filenames. Like here:
$ ls -l total 704512 -rw------- 1 postgres postgres 16777216 Jul 13 16:51 000000010000002B0000002A -rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002B -rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002C -rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002D -rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002E -rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002F -rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B00000030 -rw------- 1 postgres postgres 16777216 Jul 13 17:01 000000010000002B00000031 -rw------- 1 postgres postgres 16777216 Jul 13 17:16 000000010000002B00000032 -rw------- 1 postgres postgres 16777216 Jul 13 17:21 000000010000002B00000033 -rw------- 1 postgres postgres 16777216 Jul 13 14:31 000000010000002B00000034 -rw------- 1 postgres postgres 16777216 Jul 13 14:32 000000010000002B00000035 -rw------- 1 postgres postgres 16777216 Jul 13 14:19 000000010000002B00000036 -rw------- 1 postgres postgres 16777216 Jul 13 14:36 000000010000002B00000037 -rw------- 1 postgres postgres 16777216 Jul 13 14:37 000000010000002B00000038 -rw------- 1 postgres postgres 16777216 Jul 13 14:38 000000010000002B00000039 -rw------- 1 postgres postgres 16777216 Jul 13 14:39 000000010000002B0000003A -rw------- 1 postgres postgres 16777216 Jul 13 14:40 000000010000002B0000003B -rw------- 1 postgres postgres 16777216 Jul 13 14:41 000000010000002B0000003C -rw------- 1 postgres postgres 16777216 Jul 13 14:41 000000010000002B0000003D -rw------- 1 postgres postgres 16777216 Jul 13 14:42 000000010000002B0000003E -rw------- 1 postgres postgres 16777216 Jul 13 14:43 000000010000002B0000003F -rw------- 1 postgres postgres 16777216 Jul 13 14:33 000000010000002B00000040 -rw------- 1 postgres postgres 16777216 Jul 13 14:34 000000010000002B00000041 -rw------- 1 postgres postgres 16777216 Jul 13 14:45 000000010000002B00000042 -rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000043 -rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000044 -rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000045 -rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000046 -rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000047 -rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000048 -rw------- 1 postgres postgres 16777216 Jul 13 15:09 000000010000002B00000049 -rw------- 1 postgres postgres 16777216 Jul 13 15:25 000000010000002B0000004A -rw------- 1 postgres postgres 16777216 Jul 13 15:35 000000010000002B0000004B -rw------- 1 postgres postgres 16777216 Jul 13 15:51 000000010000002B0000004C -rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B0000004D -rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B0000004E -rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B0000004F -rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B00000050 -rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B00000051 -rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B00000052 -rw------- 1 postgres postgres 16777216 Jul 13 16:19 000000010000002B00000053 -rw------- 1 postgres postgres 16777216 Jul 13 16:35 000000010000002B00000054 drwx------ 2 postgres postgres 96 Jun 4 23:28 archive_status
Please note that newest file – 000000010000002B00000033 is neither the first, nor the last. And the oldest file – is quote close after newest – 000000010000002B00000036.
This is all natural. All files before current, are the ones that are still needed, and their mtimes will be going in the same direction as WAL segments numbering.
Last file (based on filenames) – *54 has mtime just before *2A – which tells us that it previously was *29, but got renamed when REDO location moved somewhere to file *2A.
Hope that it‘s clear from above explanation, if not – please state your questions/concerns in comments.
So, to wrap it up. WAL exists to save your bacon in case of emergency. Thanks to WAL it is very hard to get any problems with data – I would even say impossible, but it‘s still possible in case your hardware misbehaves – like: lies about actual disk writes.
WAL is stored in a number of files in pg_xlog/ directory, and the files get reused, so the directory shouldn‘t grow. Number of these files is usually 2 * checkpoint_segments + 1.
Whoa? Why 2* checkpoint_segments?
Reason is very simple. Let‘s assume you have checkpoint_segments set to 5. You filled them all, and checkpoint is called. Checkpoint is called in WAL segment #x. In #x + 5 we will have another checkpoint. But PostgreSQL always keeps (at least) checkpoint_segments ahead of current location, to avoid need to create new segments for data from user queries. So, at any given moment, you might have:
Sometimes, when you have more writes than checkpoint_segments, in which case PostgreSQL will create new segments (as I described above). Which will inflate number of files in pg_xlog/. But this will get restored after some time – simply some obsolete segments will not get renamed, but instead will be removed.
Finally, last thing. GUC “checkpoint_warning". It is also (like checkpoint_timeout) interval, usually much shorter – by default 30 seconds. This is used to log (not to WAL, but normal log) information if the automated checkpoints happen too often.
Since checkpoint_timeout is supposedly larger than checkpoint_warning, this usually means that it alerts if you filled more than checkpoint_segments worth of log in checkpoint_timeout time.
Such information looks like this:
2011-07-14 01:03:22.160 CEST @ 7370 LOG: checkpoint starting: xlog 2011-07-14 01:03:26.175 CEST @ 7370 LOG: checkpoint complete: wrote 1666 buffers (40.7%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=3.225 s, sync=0.720 s, total=4.014 s; sync files=5, longest=0.292 s, average=0.144 s 2011-07-14 01:03:34.904 CEST @ 7370 LOG: checkpoints are occurring too frequently (12 seconds apart) 2011-07-14 01:03:34.904 CEST @ 7370 HINT: Consider increasing the configuration parameter "checkpoint_segments". 2011-07-14 01:03:34.904 CEST @ 7370 LOG: checkpoint starting: xlog 2011-07-14 01:03:39.239 CEST @ 7370 LOG: checkpoint complete: wrote 1686 buffers (41.2%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=3.425 s, sync=0.839 s, total=4.334 s; sync files=5, longest=0.267 s, average=0.167 s 2011-07-14 01:03:48.077 CEST @ 7370 LOG: checkpoints are occurring too frequently (14 seconds apart) 2011-07-14 01:03:48.077 CEST @ 7370 HINT: Consider increasing the configuration parameter "checkpoint_segments". 2011-07-14 01:03:48.077 CEST @ 7370 LOG: checkpoint starting: xlog
Please note the “HINT" lines.
These are hints only (that is not warnings, or fatals) because too low checkpoint_segments doesn‘t cause any risk to your data – it just might slow down interaction with clients, if user will send modification query, that will have to wait for new WAL segment to be created (i.e. 16MB written to disk).
As a last note – if you‘re having some kind of monitoring of your PostgreSQL (like cacti, or ganglia, or munin or some commercial, like circonus) you might want to add graph that will show you your WAL progress in time.
To do it, you‘d need to convert the current xlog location to some normal decimal number, and then draw differences. For example like this:
=$ psql -qAtX -c "select pg_current_xlog_location()" | awk -F/ ‘BEGIN{print "ibase=16"} {printf "%s%08s\n", $1, $2}‘ | bc 108018127223360
Or, if the numbers get too big, just decimal “number" of the file:
=$ ( echo "ibase=16" psql -qAtX -c "select pg_xlogfile_name(pg_current_xlog_location())" | cut -b 9-16,23-24 ) | bc 6438394
Drawing increments of the 2nd value (6438394) in 5 minute increments will tell you what‘s the optimal checkpoint_segments (although always remember to make it a bit larger than the actually needed, just in case of sudden spike in traffic).
参考:
http://www.depesz.com/2011/07/14/write-ahead-log-understanding-postgresql-conf-checkpoint_segments-checkpoint_timeout-checkpoint_warning/
UNDERSTANDING POSTGRESQL.CONF: CHECKPOINT_SEGMENTS, CHECKPOINT_TIMEOUT, CHECKPOINT_WARNING
标签: