Wednesday 28 March 2018

Don't use "kill" on PostgreSQL... or you're gonna have a bad time!

Have you ever been tempted to "kill -9a non-behaving PostgreSQL client session that won't die via "pg_cancel_backend()" or "pg_terminate_backend()"?  Let's just say that's at first you may believe (incorrectly) that it will affect only the single session you've killed, but in reality it's an incredibly bad idea... as the result will be ALL of your client sessions being reset and the database crashing.





Here's an example.  Let's say there is one session you need to kill, so you use kill -9:

postgres=# select * from pg_stat_activity;
 datid | datname  |  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event | state  | backend_xid | backend_xmin |              query
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------+--------------+---------------------------------
 13323 | postgres | 15658 |       10 | postgres | psql             |             |                 |          -1 | 2018-03-28 16:46:27.924226+00 | 2018-03-28 16:46:34.663799+00 | 2018-03-28 16:46:34.663799+00 | 2018-03-28 16:46:34.663808+00 |                 |            | active |             |         1757 | select * from pg_stat_activity;
(1 row)


postgres=# \! kill -9 15658


The postmaster detects that a process has exited abnormally:

16:46:44 UTC [15642]: LOG:  server process (PID 15658) was terminated by signal 9: Killed
16:46:44 UTC [15642]: DETAIL:  Failed process was running: select * from pg_stat_activity;


And even though it wasn't in a transaction and was only running a simple select, the postmaster terminates "any other active service processes":


16:46:44 [15642]: LOG:  terminating any other active server processes


The end result is that the entire PostgreSQL cluster/server resets itself, and does crash recovery to cleanup.  The time taken to do this can be significant, depending on how busy your database is and how long it's been since the last checkpoint.

16:46:44 [15648]: DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
16:46:44 [15648]: HINT:  In a moment you should be able to reconnect to the database and repeat your command.
16:46:44 [15642]: LOG:  archiver process (PID 15649) exited with exit code 1
16:46:44 [15642]: LOG:  all server processes terminated; reinitializing
16:46:44 [15661]: LOG:  database system was interrupted; last known up at 2018-03-28 16:46:16 UTC
16:46:44 [15661]: LOG:  database system was not properly shut down; automatic recovery in progress
16:46:44 [15661]: LOG:  redo starts at 0/155C230
16:46:44 [15661]: LOG:  invalid record length at 0/155C2A0: wanted 24, got 0
16:46:44 [15661]: LOG:  redo done at 0/155C268
16:46:44 [15661]: LOG:  checkpoint starting: end-of-recovery immediate
16:46:45 [15661]: LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.030 s, sync=0.000 s, total=0.035 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
16:46:45 [15661]: LOG:  MultiXact member wraparound protections are now enabled
16:46:45 [15665]: LOG:  autovacuum launcher started

16:46:45 [15642]: LOG:  database system is ready to accept connections


So in short, if you can't kill a session using the standard PostgreSQL Server Signalling Functions, then the only sensible option is to issue a restart during a maintenance window:

psql -c "checkpoint;"
pg_ctl restart -m fast

... or if that still won't work, then the last resort is:

pg_ctl restart -m immediate


Finally, it may seem like a more complicated process to follow when your database has gone awry, but if you don't use the PostgreSQL supported methods then your users will end up in a heap... just like they used to back in the Ingres days:



Stay safe, and keep your fingers off the trigger!

Greg.

No comments:

Post a Comment

New PostgreSQL RDS versions - v10.14 , v11.9, and v12.4

It appears that new patch versions of PostgreSQL RDS are being rolled out in AWS.. so we should expect an announcement soon: REGION: us-east...