Tuesday, 21 May 2019

You should NEVER use pg_resetwal / pg_resetxlog

I realise it's a blanket statement to say NEVER to use pg_resetwal / pg_resetxlog, but in almost 100% of cases this will be true.  One common reason cited is:

"My disk filled up and my database crashed, so I ran pg_resetwal to clear some space."





If you have seen this:

postgres=# INSERT INTO full_disk SELECT generate_series(1,1000000);
PANIC:  could not write to file "pg_wal/xlogtemp.11818": No space left on device
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.


or this:

2019-05-21 13:30:20.080 BST [11818] PANIC:  could not write to file "pg_wal/xlogtemp.11818": No space left on device
2019-05-21 13:30:20.080 BST [11818] STATEMENT:  insert into full_disk select generate_series(1,1000000);
2019-05-21 13:30:20.100 BST [11708] LOG:  server process (PID 11818) was terminated by signal 6: Aborted
2019-05-21 13:30:20.100 BST [11708] DETAIL:  Failed process was running: insert into full_disk select generate_series(1,1000000);
2019-05-21 13:30:20.100 BST [11708] LOG:  terminating any other active server processes
2019-05-21 13:30:20.101 BST [11715] WARNING:  terminating connection because of crash of another server process
2019-05-21 13:30:20.101 BST [11715] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-05-21 13:30:20.101 BST [11715] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-05-21 13:30:20.103 BST [11708] LOG:  all server processes terminated; reinitializing
2019-05-21 13:30:20.117 BST [11849] LOG:  database system was interrupted; last known up at 2019-05-21 13:27:43 BST
2019-05-21 13:30:20.119 BST [11850] FATAL:  the database system is in recovery mode
2019-05-21 13:30:20.177 BST [11849] LOG:  database system was not properly shut down; automatic recovery in progress
2019-05-21 13:30:20.179 BST [11849] LOG:  redo starts at 0/16B0250
2019-05-21 13:30:20.468 BST [11849] LOG:  redo done at 0/1FFFF88
2019-05-21 13:30:20.468 BST [11849] LOG:  last completed transaction was at log time 2019-05-21 13:29:41.186055+01
2019-05-21 13:30:20.500 BST [11849] PANIC:  could not write to file "pg_wal/xlogtemp.11849": No space left on device
2019-05-21 13:30:20.522 BST [11708] LOG:  startup process (PID 11849) was terminated by signal 6: Aborted
2019-05-21 13:30:20.522 BST [11708] LOG:  aborting startup due to startup process failure

2019-05-21 13:30:20.527 BST [11708] LOG:  database system is shut down



DO NOT RUN pg_resetwal!


This will result in a corrupted database and lost data.  To fix a full disk situation you should try these, in order of preference:

1) Remove non-PostgreSQL files on the filesystem
2) Add storage to the filesystem
3) Move/remove files in the "pg_log" or "log" directory, after confirming they are all postgresql logs, and not a part of the database itself.

... then restart the database and will do crash recovery to ensure you once again have a consistent database.  Upon startup you will see crash recovery happen, and the database will then come up:

2019-05-21 13:35:02.361 BST [11977] LOG:  database system shutdown was interrupted; last known up at 2019-05-21 13:30:20 BST
2019-05-21 13:35:02.379 BST [11977] LOG:  database system was not properly shut down; automatic recovery in progress
2019-05-21 13:35:02.380 BST [11977] LOG:  redo starts at 0/16B0250
2019-05-21 13:35:02.678 BST [11977] LOG:  redo done at 0/1FFFF88
2019-05-21 13:35:02.678 BST [11977] LOG:  last completed transaction was at log time 2019-05-21 13:29:41.186055+01

2019-05-21 13:35:02.734 BST [11973] LOG:  database system is ready to accept connections


The same goes for deleting files in the pg_wal / pg_log directories, which happened so frequently that it was the motivation to rename everything from "xlog" to "wal".

The only reason you should ever consider running pg_resetwal is if you have an already corrupted database, and you have no backups available.  This will allow you to start the database, and pg_dump what's left... with the understanding that it may not be consistent.

Even if you have an already corrupt database, and you've decided that pg_resetwal is the one and only solution... ALWAYS take a file-level backup (tar, etc). of the database and all related files (tablespaces, wal) before doing anything.  It's a one-way command, that can actually make things worse.

New PostgreSQL RDS versions - v10.14 , v11.9, and v12.4

It appears that new patch versions of PostgreSQL RDS are being rolled out in AWS.. so we should expect an announcement soon: REGION: us-east...