The Write Ahead Log: Essentials

Updated: 08/08/2022

By: Joshua D. Drake

What is WAL

WAL (Write Ahead Log) is the mechanism that Postgres uses to implement durability (the D in ACID) of data changes in the face of a system crash. WAL is also a critical component for Postgres to provide binary replication, Point in Time Recovery as well as online binary backups.

The central philosophy of Postgres durability is that data must be written to the WAL and marked as written to permanent storage before the data is written to data pages. The guarantee of this durability is accomplished by the fsync system call or equivalent mechanism and is controlled by three primary postgresql.conf parameters: fsync, wal_sync_method, and synchronous_commit. The benefit of this process is that Postgres does not need to write data pages to permanent storage on every transaction commit because in the event of a crash we will be able to recover Postgres to a consistent state using the WAL. If a crash occurs, pages (data) that have not yet been written but are consistent will be recovered from the WAL and written to the data pages.

Note: WAL files reside in the directory $PGDATA/pg_wal and data pages reside under the directory structure $PGDATA/base/$database/$datapages

A performance benefit of the WAL design is a significant reduction in the number of disk flushes (writes) as the WAL only needs to be flushed to permanent storage at the time of transaction commit. In a multiuser environment, commits of many transactions may be accomplished with a single flush of the WAL file. As the log file is written sequentially, the cost of the flushing is much less than that cost of writing to the data pages. This is because data pages are written in a random-write fashion. This performance benefit is especially true in instances where you have a lot of small transactions working on different parts of the database. Fortunately, with the prevalence of memory based permanent storage (SSD, NVME, etc…) the benefit of WAL being written sequentially is no longer often realized.

Configuring WAL

The five primary parameters for configuring WAL are wal_level, fsync, wal_sync_method, synchronous_commit, and full_page_writes.

wal_level

The wal_level determines the level of information that will be written to the WAL. Different levels provide different feature capabilities within Postgres. The three levels of WAL are minimal, replica and logical. It requires a Postgres full restart to change this parameter.

minimal

This is the minimum level of WAL which only keeps enough information to recover from a crash. You will be unable to use the Binary Replication, Logical Replication or Point In Time Recovery features of Postgres with wal_level configured to this setting. Use this setting only when working locally as you significantly limit further capabilities. A benefit of this setting is that the WAL will record less information and thus require less I/O than the other two settings.

replica

This level provides enough information for all Binary Replication and Point in Time Recovery but not Logical Replication.

logical

This level records the most information to the WAL. It also enables the feature Logical Replication and Logical Decoding. If you have the I/O it is recommended that you use this setting as a default. Logical Decoding allows the use of external capabilities such as replicating from Postgres or WAL2JSON.

fsync

This parameter determines if Postgres will call an fsync() on every commit of a transaction.

The options for this parameter are [on|off]. Leave this on at all times. Turning this parameter off will eventually lead to data corruption as Postgres will have zero guarantees if the data being written to disk properly.

wal_sync_method

This is the type of fsync() that will be called. Generally speaking it is safe to leave this alone as the initialization of the Postgres cluster by initdb will perform a quick test to see which one is the most efficient. You can also test your platform for the best setting by using pg_test_sync. When using Linux derived operating systems, open_datasync is generally the most efficient.

Example use of pg_test_sync

The following is the output of using pg_test_sync on a laptop running Ubuntu 22.04 with an NVME drive. It is a consumer level drive so it is nowhere near as fast as what you would expect in a server based installation. Of note is that the test uses 8kB and 16kB sizes for writes. This is because PostgreSQL shared_buffers and page size are 8kB. The higher the ops/sec the better.

./pg_test_fsync -f /tmp/pg_test_sync.out

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
       open_datasync                      3127.328 ops/sec     320 usecs/op
       fdatasync                          2236.094 ops/sec     447 usecs/op
       fsync                              1972.384 ops/sec     507 usecs/op
       fsync_writethrough                              n/a
       open_sync                          1962.019 ops/sec     510 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
       open_datasync                      1390.982 ops/sec     719 usecs/op
       fdatasync                          2160.199 ops/sec     463 usecs/op
       fsync                              1915.200 ops/sec     522 usecs/op
       fsync_writethrough                              n/a
       open_sync                           957.680 ops/sec    1044 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
        1 * 16kB open_sync write          1930.955 ops/sec     518 usecs/op
        2 *  8kB open_sync writes          969.495 ops/sec    1031 usecs/op
        4 *  4kB open_sync writes          469.447 ops/sec    2130 usecs/op
        8 *  2kB open_sync writes          234.476 ops/sec    4265 usecs/op
       16 *  1kB open_sync writes          111.828 ops/sec    8942 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
       write, fsync, close                1909.474 ops/sec     524 usecs/op
       write, close, fsync                1910.523 ops/sec     523 usecs/op

Non-sync'ed 8kB writes:
       write                            876074.096 ops/sec       1 usecs/op


The Non-sync’ed result can be tempting. It means that an fsync() was not called after each write and is the effective result of running with fsync = off. It is critical that one understand that this is not a safe way to operate PostgreSQL, you will eventually corrupt your database.

sychronous_commit

This parameter determines if Postgres is going to write to WAL in a synchronous or asynchronous fashion. There are multiple valid options: remote_apply, on (the default), remote_write, local, and off.

We are only going to discuss the options of on or off. The other options are for an environment where Postgres is utilizing replication. The most significant user-visible difference when writing data in synchronous (on) or asynchronous (off) fashion is two-fold. First is performance: writing synchronously is slower, however it guarantees that your data is written to permanent storage at transaction commit. While synchronous_commit is enabled, the Postgres backend that is submitting data will wait for a return of TRUE that the WAL process has flushed the data to disk. When synchronous_commit is off, the Postgres process will only wait for the WAL process to confirm that it has the data, not that it has been written to disk. Although rare, if you write asynchronously there is an opportunity that you may lose data that was considered committed by your application on a crash of the Postgres server.

Consistency

Whether synchronous_commit is on or off does not affect the consistency or validity of your data. This means that even in the event of a database crash all of your foreign keys, rows, index entries etc.. are protected from violation or corruption by the WAL.

Full page writes

By default and for data safety reasons, Postgres will write the complete page to WAL if it is the first modification of the page. This is necessary because if Postgres crashes during a page write, Postgres may end up with a page that contains old and new data. The data stored in WAL if full_page_writes is off will not be enough to properly restore the page during crash recovery. This parameter is similar to fsync in that turning it off will increase performance with inevitable data corruption and worse, it may be silent corruption.

Checkpoints

A checkpoint is a point in the transaction sequence where Postgres will write all data in memory and changes that have been reflected in WAL to the data files. In a practical sense, the checkpoint process is used to help manage the size of WAL and to ensure that at some point in time, data is guaranteed to be written to the data files.

There are four (of six) primary parameters that control checkpoint behavior. They are checkpoint_timeout, checkpoint_completion_target, max_wal_size, and min_wal_size.

checkpoint_timeout

This is the total time between an automatic checkpoint. It has a valid range of 30 seconds to one day. The default is five minutes. If you have an application that utilizes a lot of writes (insert/update/delete) it is almost always best to change the default. A reasonable recommendation is to set this to 30 to 60 minutes. However, increasing this parameter will increase the WAL size on disk, potentially causing an overflow of max_wal_size and thus forcing a checkpoint.

checkpoint_completion_timeout

This is the target of completing a checkpoint as a fraction of time between checkpoints. The default is 0.9 and should not be changed. In earlier versions of Postgres the default was 0.5 and almost universally caused performance degradation as you would essentially perform two checkpoints between checkpoint_timeout. Decreasing this parameter will increase your overall I/O load.

max_wal_size

This is the total amount of disk space that WAL may consume before a checkpoint is forced. The default is 1GB. This is a soft limit as there are a number of parameters and workload conditions that can cause the WAL to exceed the value without causing a checkpoint. Some of this include a failing archive_command, heavy server load, or use of the deprecated wal_keep_segments setting. If you have available storage this is a reasonable setting to increase. However, without proper tuning of the wal writer this can cause an unreasonable amount of data to be written during a checkpoint which may cause performance issues.

Note: Increasing the size of max_wal_size will increase recovery times if Postgres crashes

min_wal_size

Postgres will recycle WAL files for future use during a checkpoint rather than remove them if the storage space used for WAL is below this size on disk. This can help reduce unexpected spikes in WAL usage. For example, when running long transactions with heavy writes (batch jobs). If you specify a value without units, it is interpreted as megabytes. The default is 80MB. If your transactional load often contains heavy writes within a single transaction, increasing this parameter may help your overall write performance.

References

All of the information found in this article can also be found within the official Postgres documentation: http://www.postgresql.org/docs/current/static/wal-configuration.html and links therein.

The original author of this blog was Alvaro Herrera. Substantial changes have been made since then.