As a PostgreSQL Database Administrator, there are the everyday expectations to check on backups, apply DDL changes, make sure the logs don’t have any game breaking ERROR’s, and answer panicked calls from developers who’s reports are running twice as long as normal and they have a meeting in ten minutes.
Even with a good understanding of the health of managed databases, there will always be new cases and new issues popping up relating to performance and how the database “feels”. Whether it’s a panicked email, or an open ticket for “the database feels slow”, this common task can generally be followed with a few steps to check whether or not there is a problem with PostgreSQL, and what that problem may be.
This is by no extent an exhaustive guide, nor do the steps need to be done in any specific order. But it’s rather a set of initial steps that can be taken to help find the common offenders quickly, as well as gain new insight as to what the issue may be. A developer may know how the application acts and responds, but the Database Administrator knows how the database acts and responds to the application, and together, the issue can be found.
NOTE: The queries to be executed should be done as a superuser, such as ‘postgres’ or any database user granted the superuser permissions. Limited users will either be denied or have data omitted.
Step 0 - Information Gathering
Get as much information as possible from whoever says the database seems slow; specific queries, applications connected, timeframes of the performance slowness, etc. The more information they give the easier it will be to find the issue.
Step 1 - Check pg_stat_activity
The request may come in many different forms, but if “slowness” is the general issue, checking pg_stat_activity is the first step to understand just what’s going on. The pg_stat_activity view (documentation for every column in this view can be found here) contains a row for every server process / connection to the database from a client. There is a handful of useful information in this view that can help.
NOTE: pg_stat_activity has been known to change structure over time, refining the data it presents. Understanding of the columns themselves will help build queries dynamically as needed in the future.
Notable columns in pg_stat_activity are:
- query: a text column showing the query that’s currently being executed, waiting to be executed, or was last executed (depending on the state). This can help identify what query / queries a developer may be reporting are running slowly.
- client_addr: The IP address for which this connection and query originated from. If empty (or Null), it originated from localhost.
- backend_start, xact_start, query_start: These three provide a timestamp of when each started respectively. Backend_start represents when the connection to the database was established, xact_start is when the current transaction started, and query_start is when the current (or last) query started.
- state: The state of the connection to the database. Active means it’s currently executing a query, ‘idle’ means it’s waiting further input from the client, ‘idle in transaction’ means it’s waiting for further input from the client while holding an open transaction. (There are others, however their likelihood is rare, consult the documentation for more information).
- datname: The name of the database the connection is currently connected to. In multiple database clusters, this can help isolate problematic connections.
- wait_event_type and wait_event: These columns will be null when a query isn’t waiting, but if it is waiting they will contain information on why the query is waiting, and exploring pg_locks can identify what it’s waiting on. (PostgreSQL 9.5 and before only has a boolean column called ‘waiting’, true if waiting, false if not.
1.1. Is the query waiting / blocked?
If there is a specific query or queries that are “slow” or “hung”, check to see if they are waiting for another query to complete. Due to relation locking, other queries can lock a table and not let any other queries to access or change data until that query or transaction is done.
PostgreSQL 9.5 and earlier:
SELECT * FROM pg_stat_activity WHERE waiting = TRUE;
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;
PostgreSQL 10 and later (?):
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL AND backend_type = 'client backend';
The results of this query will show any connections currently waiting on another connection to release locks on a relation that is needed.
If the query is blocked by another connection, there are some ways to find out just what they are. In PostgreSQL 9.6 and later, the function pg_blocking_pids() allows the input of a process ID that’s being blocked, and it will return an array of process ID’s that are responsible for blocking it.
PostgreSQL 9.6 and later:
SELECT * FROM pg_stat_activity WHERE pid IN (SELECT pg_blocking_pids(<pid of blocked query>));
PostgreSQL 9.5 and earlier:
SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS current_statement_in_blocking_process FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;
(Available from the PostgreSQL Wiki).
These queries will point to whatever is blocking a specific PID that’s provided. With that, a decision can be made to kill the blocking query or connection, or let it run.
Step 2 - If the queries are running, why are they taking so long?
2.1. Is the planner running queries efficiently?
If a query (or set of queries) in question has the status of ‘active’, then it’s actually running. If the whole query isn’t available in pg_stat_activity, fetch it from the developers or the postgresql log and start exploring the query planner.
EXPLAIN SELECT * FROM postgres_stats.table_stats t JOIN hosts h ON (t.host_id = h.host_id) WHERE logged_date >= '2018-02-01' AND logged_date < '2018-02-04' AND t.india_romeo = 569; Nested Loop (cost=0.280..1328182.030 rows=2127135 width=335) -> Index Scan using six on victor_oscar echo (cost=0.280..8.290 rows=1 width=71) Index Cond: (india_romeo = 569) -> Append (cost=0.000..1306902.390 rows=2127135 width=264) -> Seq Scan on india_echo romeo (cost=0.000..0.000 rows=1 width=264) Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569)) -> Seq Scan on juliet victor_echo (cost=0.000..437153.700 rows=711789 width=264) Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569)) -> Seq Scan on india_papa quebec_bravo (cost=0.000..434936.960 rows=700197 width=264) Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569)) -> Seq Scan on two oscar (cost=0.000..434811.720 rows=715148 width=264) Filter: ((logged_date >= '2018-02-01'::timestamp with time zone) AND (logged_date < '2018-02-04'::timestamp with time zone) AND (india_romeo = 569))
This example shows a query plan for a two table join that also hits a partitioned table. We’re looking for anything that can cause the query to be slow, and in this case the planner is doing several Sequential Scans on partitions, suggesting that they are missing indexes. Adding indexes to these tables for column ‘india_romeo’ will instantly improve this query.
Things to look for are sequential scans, nested loops, expensive sorting, etc. Understanding the query planner is crucial to making sure queries are performing the best way possible, official documentation can be read for more information here.
2.2. Are the tables involved bloated?
If the queries are still feeling slow without the query planner pointing at anything obvious, it’s time to check the health of the tables involved. Are they too big? Are they bloated?
SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = ‘mytable’; n_live_tup | n_dead_tup ------------+------------ 15677 | 8275431 (1 row)
Here we see that there are many times more dead rows than live rows, which means to find the correct rows, the engine must sift through data that’s not even relevant to find real data. A vacuum / vacuum full on this table will increase performance significantly.
Step 3 - Check the logs
If the issue still can’t be found, check the logs for any clues.
FATAL / ERROR messages:
Look for messages that may be causing issues, such as deadlocks or long wait times to gain a lock.
Hopefully log_checkpoints is set to on, which will write checkpoint information to the logs. There are two types of checkpoints, timed and requested (forced). If checkpoints are being forced, then dirty buffers in memory must be written to disk before processing more queries, which can give a database system an overall feeling of “slowness”. Increasing checkpoint_segments or max_wal_size (depending on the database version) will give the checkpointer more room to work with, as well as help the background writer take some of the writing load.
Step 4 - What’s the health of the host system?
If there’s no clues in the database itself, perhaps the host itself is overloaded or having issues. Anything from an overloaded IO chanel to disk, memory overflowing to swap, or even a failing drive, none of these issues would be apparent with anything we looked at before. Assuming the database is running on a *nix based operating system, here are a few things that can help.
4.1. System load
Using ‘top’, look at the load average for the host. If the number is approaching or exceeding the number of cores on the system, it could be simply too many concurrent connections hitting the database bringing it to a crawl to catch up.
load average: 3.43, 5.25, 4.85
4.2. System memory and SWAP
Using ‘free’, check to see if SWAP has been used at all. Memory overflowing to SWAP in a PostgreSQL database environment is extremely bad for performance, and many DBA’s will even eliminate SWAP from database hosts, as an ‘out of memory’ error is more preferable than a sluggish system to many.
If SWAP is being used, a reboot of the system will clear it out, and increasing total system memory or re-configuring memory usage for PostgreSQL (such as lowering shared_buffers or work_mem) may be in order.
[postgres@livedb1 ~]$ free -m total used free shared buff/cache available Mem: 7986 225 1297 12 6462 7473 Swap: 7987 2048 5939
4.3. Disk access
PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it’s easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands. Faster disks, more disks and IO channels are some ways to increase the amount of work that can be done.
Tools like ‘iostat’ or ‘iotop’ can help pinpoint if there is a disk bottleneck, and where it may be coming from.
4.4. Check the logs
If all else fails, or even if not, logs should always be checked to see if the system is reporting anything that’s not right. We already discussed checking the postgresql.logs, but the system logs can give information about issues such as failing disks, failing memory, network problems, etc. Any one of these issues can cause the database to act slow and unpredictable, so a good understanding of perfect health can help find these issues.
Step 5 - Something still not make sense?
Even the most seasoned administrators will run into something new that doesn’t make sense. That’s where the global PostgreSQL community can come in to help out. Much like step #0, the more clear information given to the community, the easier they can help out.
5.1. PostgreSQL Mailing Lists
Since PostgreSQL is developed and managed by the open source community, there are thousands of people who talk through the mailing lists to discuss countless topics including features, errors, and performance issues. The mailing lists can be found here, with pgsql-admin and pgsql-performance being the most important for looking for help with performance issues.
Freenode hosts several PostgreSQL channels with developers and administrators all over the world, and it’s not hard to find a helpful person to track down where issues may be coming from. More information can be found on the PostgreSQL IRC page.