Curiosity: what is PostgreSQL doing with data when "nothing" is happening?

Discussion:

(too old to reply)

Gavan Schneider

2012-11-24 01:52:17 UTC

Context:
After an unfortunate experience I have implemented WAL archive
backup strategy as per documentation with the minor extensions that:
1. The archive files are compressed;
2. Only 8Mb WAL files (it's a low activity setup);
3. Purge WAL segment every 10 minutes (still under
examination); and,
4. Able to rotate archive directory when desired, i.e.,
when doing a new baseline archive.

With this debugged and working I restored my salvaged data.

Assumptions (please correct me here):

From my reading I understand the WAL log file sequence as
recording all changes to the cluster such that replaying them
will recreate the cluster to a consistent state as at the time
the last WAL file was written to disk. From this I have
extrapolated that no activity should mean no need to put
alterations into a WAL file, i.e., minimal information in the
WAL file when PostgreSQL is idle.
I am also assuming the compressed size of a WAL file is a
rough measure of real information, as opposed to padding of the
file. And I did see compressed WAL files of less than 300 bytes
(0.3Kb!) before my data was loaded into the database, and much
bigger (compressed) files reflecting the data getting imported.
All very sensible and understandable.

My surprise is that with the data on board, and nothing
happening, those WAL files now appear to have a lot more real
content, i.e., I am assuming this means the data and/or
configuration are getting altered even when there are no
connections. I know vacuum can change things but surely it
should run out of things to do after a few passes.

This is a snapshot of the WAL archive files. There is a file
every ten minutes. No data has been changed for 12 hours, no
connections open to the database for at least 30 minutes. No
automatic processes other than default vacuum settings, etc. The
archive commit process always seems to be 6-8 files behind.

dir:Archive postgres$ ls -ls *21 active_wal_files/
20121121:
total 351744
...
584 -rw-------+ 1 postgres _postgres 295904 Nov 24 10:46 000000010000000100000087.bz2
232 -rw-------+ 1 postgres _postgres 116480 Nov 24 10:56 000000010000000100000088.bz2
3104 -rw-------+ 1 postgres _postgres 1585487 Nov 24 11:06 000000010000000100000089.bz2
1224 -rw-------+ 1 postgres _postgres 624755 Nov 24 11:16 00000001000000010000008A.bz2
744 -rw-------+ 1 postgres _postgres 377394 Nov 24 11:26 00000001000000010000008B.bz2
80 -rw-------+ 1 postgres _postgres 38856 Nov 24 11:36 00000001000000010000008C.bz2
32 -rw-------+ 1 postgres _postgres 16147 Nov 24 11:46 00000001000000010000008D.bz2
48 -rw-------+ 1 postgres _postgres 21710 Nov 24 11:56 00000001000000010000008E.bz2
6816 -rw-------+ 1 postgres _postgres 3487066 Nov 21 07:56 base_archive_20121121_075602.tbz

active_wal_files/:
total 131080
8 -rw-------+ 1 postgres _postgres 381 Nov 21 07:56 0000000100000000000000C5.00000020.backup
16384 -rw-------+ 1 postgres _postgres 8388608 Nov 24 11:46 00000001000000010000008D
16384 -rw-------+ 1 postgres _postgres 8388608 Nov 24 11:56 00000001000000010000008E
16384 -rw-------+ 1 postgres _postgres 8388608 Nov 24 10:46 00000001000000010000008F
16384 -rw-------+ 1 postgres _postgres 8388608 Nov 24 10:56 000000010000000100000090
16384 -rw-------+ 1 postgres staff 8388608 Nov 24 11:06 000000010000000100000091
16384 -rw-------+ 1 postgres staff 8388608 Nov 24 11:16 000000010000000100000092
16384 -rw-------+ 1 postgres _postgres 8388608 Nov 24 11:26 000000010000000100000093
16384 -rw-------+ 1 postgres _postgres 8388608 Nov 24 11:36 000000010000000100000094
0 drwx------+ 2 postgres staff 170 Nov 24 11:56 archive_status
dir:Archive postgres$

Question: what is behind the idle content in the WAL files?

Thank you for your time.
Regards
Gavan
--
Gavan Schneider <***@galeel.org>
Ph. +61 2 6355 6256 "Pendari" Anarel Road
Mb. +61 405 124 883 Sodwalls NSW 2790
Australia

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Tom Lane

2012-11-24 17:18:26 UTC

Permalink

Post by Gavan Schneider
My surprise is that with the data on board, and nothing
happening, those WAL files now appear to have a lot more real
content, i.e., I am assuming this means the data and/or
configuration are getting altered even when there are no
connections. I know vacuum can change things but surely it
should run out of things to do after a few passes.

In the base configuration, an idle Postgres installation won't generate
any fresh WAL; but this isn't true if you've got hot-standby replication
configured, because of the effect described here:
http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php

The quick-and-dirty patch that was proposed in that thread was later
found to have problems of its own and reverted, so right at the moment
we're back to the behavior Fujii-san described. I'm not sure if anyone
is thinking about other ways to avoid it; a few bytes of WAL every ten
minutes is not the end of the world exactly, especially since hot
standby is typically used in installations that are far from idle.

However, you've got a bigger problem because it appears you're getting
a WAL file switch after each checkpoint, which does *not* happen in a
minimal replication configuration. I suspect you have archive_timeout
activated. You might want to reconsider those settings: the combination
of hot standby and archive timeout seems a bit redundant to me. If
you're relying on a hot standby then you don't really need to worry
about keeping the WAL archive fully up-to-date, and contrarily if you
aren't using a hot standby then you should back off the wal_level
setting to eliminate unnecessary WAL traffic.

regards, tom lane

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Gavan Schneider

2012-11-25 12:54:45 UTC

Permalink

Post by Tom Lane

In the base configuration, an idle Postgres installation won't generate
any fresh WAL; but this isn't true if you've got hot-standby replication
http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php

....

Post by Tom Lane
However, you've got a bigger problem because it appears you're getting
a WAL file switch after each checkpoint, which does *not* happen in a
minimal replication configuration. I suspect you have archive_timeout
activated.

Correct

Post by Tom Lane
You might want to reconsider those settings: the combination
of hot standby and archive timeout seems a bit redundant to me. If
you're relying on a hot standby then you don't really need to worry
about keeping the WAL archive fully up-to-date, and contrarily if you
aren't using a hot standby then you should back off the wal_level
setting to eliminate unnecessary WAL traffic.

But no hot standby...

My current settings (with some bulk stripped out):

./configure \
--prefix=${VOL} \
--datarootdir=${VOL}/configuration \
--htmldir=${VOL} \
--with-pgport=${PORT} \
--with-perl --with-tcl --with-python\
--with-openssl --with-pam with-ldap \
--with-bonjour \
--with-libxml --with-libxslt \
--with-system-tzdata=/usr/share/zoneinfo \
--with-wal-segsize=8

#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------
wal_level = archive # minimal, archive, or hot_standby
#fsync = on # turns forced synchronization on
or off
#synchronous_commit = on # synchronization level;
#wal_sync_method = fsync # the default is the first option
#full_page_writes = on # recover from partial page writes
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
#wal_writer_delay = 200ms # 1-10000 milliseconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
#checkpoint_segments = 3 # in logfile segments, min
1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target
duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
archive_mode = on # allows archiving to be done
archive_command =
'/Volumes/SQL/postgres/Archive/bin/archive_command %p %f'
archive_timeout = 600 # force a logfile segment switch
after this

#------------------------------------------------------------------------------
# REPLICATION
#------------------------------------------------------------------------------
#max_wal_senders = 0 # max number of walsender processes
#wal_keep_segments = 0 # in logfile segments, 16MB
each; 0 disables
#replication_timeout = 60s # in milliseconds; 0 disables
#synchronous_standby_names = '' # standby servers that
provide sync rep
#vacuum_defer_cleanup_age = 0 # number of xacts by which
cleanup is delayed
#hot_standby = off # "on" allows queries during recovery
#max_standby_archive_delay = 30s # max delay before
canceling queries
#max_standby_streaming_delay = 30s # max delay before
canceling queries
#wal_receiver_status_interval = 10s # send replies at least
this often
#hot_standby_feedback = off # send info from standby to prevent

I think my observation of WAL content is explained by Depesz:
<http://www.depesz.com/2011/07/14/write-ahead-log-understanding-postgresql-conf-checkpoint_segments-checkpoint_timeout-checkpoint_warning/>
or <http://tinyurl.com/bpvjv6z>

In summary: I now realize the WAL files are recycled and not
zeroed between rewrites. So what I am seeing is past activity
possibly overwritten by new activity. So there is always going
to be something in those files from now on. When the server is
idle most of the file will be the old stuff with a minimum
amount of new material overwritten at the beginning.

Therefore I probably need to turn off the archive_timeout and
let each WAL file fill with relevant new material before it cycles.

This leaves me with the need to find the best settings to allow
the WAL files to be properly filled, but not lose (too much)
data if the machine goes down before the WAL file is rotated.

Since this is mostly a low volume setup I figure a smaller WAL
file is sensible.

Therefore I wonder if I should just go down to 4MB, no
archive_timeout, and accept the (max) 4MB of updated pages as my
(worst case) loss if everything should go wrong at once? And,
maybe do the segment downsize when a new version needs
installing :)

Regards
Gavan

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Tom Lane

2012-11-25 19:36:43 UTC

Permalink

Post by Gavan Schneider

Post by Tom Lane
In the base configuration, an idle Postgres installation won't generate
any fresh WAL; but this isn't true if you've got hot-standby replication
http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php
....
You might want to reconsider those settings: the combination
of hot standby and archive timeout seems a bit redundant to me. If
you're relying on a hot standby then you don't really need to worry
about keeping the WAL archive fully up-to-date, and contrarily if you
aren't using a hot standby then you should back off the wal_level
setting to eliminate unnecessary WAL traffic.

But no hot standby...

Yeah, I took a second look and realized that you'd get this effect even
without hot_standby --- it's the alternation of checkpoint records and
forced segment switches from archive_timeout that creates the effect.

If you're concerned about minimizing WAL traffic at idle, you might want
to think about setting up a streaming replication arrangement instead of
relying on archive segment switches. archive_timeout is really kind
of a hack.

regards, tom lane

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Kevin Grittner

2012-11-25 17:27:38 UTC

Permalink

Post by Gavan Schneider
Therefore I wonder if I should just go down to 4MB, no
archive_timeout, and accept the (max) 4MB of updated pages as my
(worst case) loss if everything should go wrong at once? And,
maybe do the segment downsize when a new version needs
installing :)

You might want to consider grabbing the source for pg_clearxlogtail
from pgfoundry and piping through that followed by gzip in your
archive script. An "empty" WAL file tends to be about 16KB that
way, rather than (for us) ranging between 4MB and 8MB. That way you
can keep the archive_timeout fairly short.

-Kevin

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Gavan Schneider

2012-11-26 06:07:33 UTC

Permalink

Tom & Kevin

Thank you both for your inputs

Post by Tom Lane
If you're concerned about minimizing WAL traffic at idle, you might want
to think about setting up a streaming replication arrangement instead of
relying on archive segment switches. archive_timeout is really kind
of a hack.
You might want to consider grabbing the source for pg_clearxlogtail
from pgfoundry and piping through that followed by gzip in your
archive script. An "empty" WAL file tends to be about 16KB that
way, rather than (for us) ranging between 4MB and 8MB. That way you
can keep the archive_timeout fairly short.

3096 -rw-------+ 1 postgres _postgres 1583556 Nov 26
13:46 0000000100000001000001B9.bz2
1216 -rw-------+ 1 postgres _postgres 619973 Nov 26
13:56 0000000100000001000001BA.bz2
736 -rw-------+ 1 postgres _postgres 375341 Nov 26
14:06 0000000100000001000001BB.bz2
80 -rw-------+ 1 postgres _postgres 37941 Nov 26
14:16 0000000100000001000001BC.bz2
88 -rw-------+ 1 postgres _postgres 42718 Nov 26
14:26 0000000100000001000001BD.bz2
48 -rw-------+ 1 postgres _postgres 21046 Nov 26
14:36 0000000100000001000001BE.bz2
584 -rw-------+ 1 postgres _postgres 298291 Nov 26
14:46 0000000100000001000001BF.bz2
240 -rw-------+ 1 postgres _postgres 120852 Nov 26
14:56 0000000100000001000001C0.bz2
3096 -rw-------+ 1 postgres _postgres 1583557 Nov 26
15:06 0000000100000001000001C1.bz2
8 -rw-------+ 1 postgres _postgres 161 Nov 26
15:26 0000000100000001000001C2.bz2
8 -rw-------+ 1 postgres _postgres 158 Nov 26
15:26 0000000100000001000001C3.bz2
8 -rw-------+ 1 postgres _postgres 162 Nov 26
15:36 0000000100000001000001C4.bz2
8 -rw-------+ 1 postgres _postgres 157 Nov 26
15:46 0000000100000001000001C5.bz2
8 -rw-------+ 1 postgres _postgres 163 Nov 26
15:56 0000000100000001000001C6.bz2
8 -rw-------+ 1 postgres _postgres 162 Nov 26
16:06 0000000100000001000001C7.bz2
8 -rw-------+ 1 postgres _postgres 165 Nov 26
16:16 0000000100000001000001C8.bz2
8 -rw-------+ 1 postgres _postgres 165 Nov 26
16:26 0000000100000001000001C9.bz2

PostgreSQL server idle throughout, and spitting out a WAL file
for archive every 10 minutes:
archive_timeout = 600 # force a logfile segment
switch after this

Impressive what happens when pg_clearxlogtail gets into the mix
halfway through! :)

Basically this resolves my original question about WAL output
when the system is idle... there isn't any. And Kevin's WAL tail
docker nicely isolates the good data from the rubbish. I can now
move forward exploring options for incremental uploads to an
internet (i.e., well off site) data repository, since the data
traffic will only carry stuff that actually matters.

Getting philosophical...

.... and taking Tom at his word that "archive_timeout is really
kind of a hack"
.... and noting Kevin's pg_clearxlogtail is something of the
counter hack,

Would the universe as we know it be upset if there was a
postgresql.conf option such as:

archive_zero_fill_on_segment_switch = on|off # default off

This would achieve the idle compression result much more
elegantly (I know, it's still a hack) for those who have the
need, without, as far as I can tell, breaking anything else.

Regards
Gavan Schneider

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Kevin Grittner

2012-11-26 13:15:45 UTC

Permalink

Post by Kevin Grittner
You might want to consider grabbing the source for
pg_clearxlogtail from pgfoundry and piping through that followed
by gzip in your archive script. An "empty" WAL file tends to be
about 16KB that way, rather than (for us) ranging between 4MB
and 8MB. That way you can keep the archive_timeout fairly short.

240 -rw-------+ 1 postgres _postgres 120852 Nov 26 14:56 0000000100000001000001C0.bz2
3096 -rw-------+ 1 postgres _postgres 1583557 Nov 26 15:06 0000000100000001000001C1.bz2
8 -rw-------+ 1 postgres _postgres 161 Nov 26 15:26 0000000100000001000001C2.bz2
8 -rw-------+ 1 postgres _postgres 158 Nov 26 15:26 0000000100000001000001C3.bz2
PostgreSQL server idle throughout, and spitting out a WAL file
for archive every 10 minutes
Impressive what happens when pg_clearxlogtail gets into the mix
halfway through! :)

I think I can spot where you changed the script. :-)

Impressive that bzip2 does two orders of magnitude better than gzip
with this. In my benchmarks I found that piping WAL files through
pg_clearxlogtail followed by gzip was consistently faster than just
gzip -- even when the file was full. I put it down to a difference
in how buffering was handled. I'll have to check the performance of
bzip2 compared to gzip.

Would the universe as we know it be upset if there was a
archive_zero_fill_on_segment_switch = on|off # default off
This would achieve the idle compression result much more
elegantly (I know, it's still a hack) for those who have the
need, without, as far as I can tell, breaking anything else.

The problem is that this would involve 16MB of writes to the OS for
every WAL-file switch, which might even result in that much actual
disk activity in many environments. The filter I wrote doesn't add
any disk I/O and (as I mentioned above), actually decreased CPU
time, at least in my benchmarks on SLES 10. I guess in environments
where the hit wouldn't be noticed and the WAL files were being
compressed it might be convenient, but the difference between that
and piping through the extra filter in the archive script is
minimal enough to make it questionable whether it is worth it.

-Kevin

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Gavan Schneider

2012-11-26 23:57:21 UTC

Permalink

Post by Kevin Grittner
Impressive that bzip2 does two orders of magnitude better than gzip
with this. ...

bzip2 has a few bytes of overhead for each additional large
block of the original file so some/all of the difference may
only reflect my 8Mb vs the usual 16 Mb WAL files. And this is
not a real saving until the results are put into a tarball since
each tiny/not-as-tiny file still consumes an inode and disk segment.

The differences that matter are the part filled WAL files with
real data... I didn't test, rather I'm using my experience that
bzip2 distribution archive files are always smaller than their
gzip alternatives. And bzip2 was available.

Post by Kevin Grittner

Post by Gavan Schneider
Would the universe as we know it be upset if there was a
archive_zero_fill_on_segment_switch = on|off # default off
This would achieve the idle compression result much more
elegantly (I know, it's still a hack) for those who have the
need, without, as far as I can tell, breaking anything else.

Point taken.

More musings...

Maybe an optimisation that could work for many is if the initial
block of the WAL file carried the information as to how much of
the WAL file has useful data. And short (i.e., valid data only)
WAL files were acceptable to postgres on restore/replication.
(Obviously this is not for the cluster's pool of
working/overwritten WAL files.)

During normal operations where the WAL file is being overwritten
the proposed flag in the initial file segment would be set to
zero to indicate all page segments have to be checked on replay
(i.e., existing behavior in crash recovery), but at WAL
switchover the first file segment gets updated with the proposed
flag set to specify the extent of valid data that follows. Only
the indicated data is restored/replicated when the file is read
by postgres. Tools such as pg_clearxlogtail would then only need
to inspect the first part of the WAL file, calculate the
end_of_valid_data offset, and copy exactly that much to output.
This would save reading/checking the rest of the WAL file and
outputting the padding zero's.

The big world people could benefit since smaller update files
(esp. if compressed) can move around a network a lot faster in
replication environments.

AFAIK the downside would be one extra disc write per WAL
changeover. The worst case scenario is only the status quo,
i.e., the whole WAL file has to be processed since it has been
filled with data. Note in this case you may as well leave the
proposed flag in the first file segment as zero (no extra write
needed) since this correctly indicates the whole file has to be processed.

The upside, at least for those who need to rollout incomplete
WAL files to satisfy timing needs, is they could work with files
that are only as big as they need to be.

And a postgresql.conf switch could isolate the cost of that
minor extra file update per WAL changeover to those who can
benefit from it.

Worth any further thoughts?

Regards
Gavan Schneider

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice