[Postgres] - tips to understand and optimize query in postgres
Truong Bui
TABLE OF CONTENTS
Explain analyze
// Source: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE
EXPLAIN ANALYZE SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121
rows=10 loops=1)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024
rows=10 loops=1)
Index Cond: (unique1 < 10)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual
time=0.021..0.022 rows=1 loops=10)
Index Cond: (unique2 = t1.unique2)
Planning time: 0.181 ms
Execution time: 0.501 ms
Explain:
-
cost:
- =0.29: estimated start-up cost.
- =7.91: estimated total cost.
- rows=1: estimated number of rows output by this plan node.
- width=244: estimated average width of rows output by this plan node (in bytes).
-
time:
- =0.021: start-up time
- =0.022: average index scan time per loop (/10)
- rows=1: average index scan rows per loop (/10)
- loops=10: index scan ran 10 times
# Source: https://www.postgresql.org/docs/current/using-explain.html
BEGIN;
EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 < 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Update on tenk1 (cost=5.08..230.08 rows=0 width=0) (actual time=3.791..3.792 rows=0 loops=1)
-> Bitmap Heap Scan on tenk1 (cost=5.08..230.08 rows=102 width=10) (actual time=0.069..0.513
rows=100 loops=1)
Recheck Cond: (unique1 < 100)
Heap Blocks: exact=90
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.05 rows=102 width=0) (actual
time=0.036..0.037 rows=300 loops=1)
Index Cond: (unique1 < 100)
Planning Time: 0.113 ms
Execution Time: 3.850 ms
ROLLBACK;
Buffers
// Source: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244) (actual time=0.323..0.342 rows=10
loops=1)
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
Buffers: shared hit=15
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0) (actual time=0.309..0.309 rows=0 loops=1)
Buffers: shared hit=7 read 1
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.043..0.043
rows=100 loops=1)
Index Cond: (unique1 < 100)
Buffers: shared hit=2
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0) (actual
time=0.227..0.227 rows=999 loops=1)
Index Cond: (unique2 > 9000)
Buffers: shared hit=5 read 2
Planning time: 0.088 ms
Execution time: 0.423 ms
Explain:
- hit=7: 7 blocks were read from postgres buffer cache
- read=1: 1 block was read from disk
Scan Strategies
Index
- Indexes with high selectivity are unlikely to be useful
-
Create index for foreign key. Foreign key constraint does not automatically create
an index
The nested loop join can use that index to optimize join. We will see them in join section
- B-tree index can support searches by equality, greater than, less than, and between conditions.
-
Using functional index for column transformations.
We have the index like below
CREATE INDEX users_last_name_idx ON users (last_name);
the following query won’t be able to take advantage of the index:
SELECT * FROM users WHERE lower(last_name)='bui';
To take advantage of the index we have to create functional indexCREATE INDEX users_last_name_idx ON users (lower(last_name));
-
Index for like operator
We have the query like below
SELECT * FROM users WHERE lower(last_name) like 'bui';
The problem with this query is that it can't take advantage of the index on lower(last_name) because B-tree index does not support like operator
To fix this problem, we will rewrite the index
CREATE INDEX users_last_name_idx ON users (lower(last_name) text_pattern_ops);
You can read more about it at Index opclass
-
Order must be considered in compound index. The first column is important.
Index on (a, b, c) will be applied when we search on a, (a, b), (a, b, c) and even (a, c) but not on b alone and not on (b, c).
-
Covering indexes
Create index using INCLUDE to make it like index-only-scan Index-Only Scans and Covering Indexes
All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap
Join Strategies
- Join strategies and performance in postgresql
- Understanding join strategies in postgres
- Series explaining the unexplainable of depesz
- Queries in PostgreSQL: 5. Nested loop
- Queries in PostgreSQL: 6. Hashing
- Queries in PostgreSQL: 7. Sort and merge
- SQL JOIN - WHERE clause vs. ON clause
- How To Write A High Performance Sql Join: Implementation And Best Practices Of Joins
- Subqueries and performance in postgresql
Join order, join_collapse_limit and geqo_threshold
When a query only involves two or three tables, there aren't many join orders to worry about. But the number of possible join orders grows exponentially as the number of tables expands. Beyond ten or so input tables it's no longer practical to do an exhaustive search of all the possibilities, and even for six or seven tables planning might take an annoyingly long time. When there are too many input tables, the PostgreSQL planner will switch from exhaustive search to a genetic probabilistic search through a limited number of possibilities. (The switch-over threshold is set by the geqo_threshold run-time parameter.) The genetic search takes less time, but it won't necessarily find the best possible plan.
When the query involves outer joins, the planner has less freedom than it does for plain (inner) joins. For example, consider:
SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
A row must be emitted for each row of A that has no matching row in the join of B and C. Therefore the planner has no choice of join order here: it must join B to C and then join A to that result. Accordingly, this query takes less time to plan than the previous query. In other cases, the planner might be able to determine that more than one join order is safe. For example, given:
SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);
it is valid to join A to either B or C first. Currently, only FULL JOIN completely constrains the join order. Most practical cases involving LEFT JOIN or RIGHT JOIN can be rearranged to some extent.
Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the same as listing the input relations in FROM, so it does not constrain the join order.
PostgresSQL's query planner by default will exhaust every possibility trying to determine the fastest way to execute a query you request. The complexity of this is exponential with the amount of joins added. Attempting to plan the join of many tables together when the join order is not constrained could take longer than doing the actual query.
To avoid this planner's analysis paralysis, PostgreSQL also has a genetic algorithm-based query optimizer , which attempts to find a heuristic solution. Most of the time, it's pretty good but sometimes the results are pretty bad. The decision to use heuristic planning instead of deterministic planning is determined by the geqo_threshold. The amount of work the heuristic planner does is determined by geqo_effort , which is a simple 1-10 scale with a default to 5.
Additionally, it pays to increase the values of join_collapse_limit and from_collapse_limit. These parameters determine the maximum number of tables for which PostgreSQL is willing to optimize the join order. And if you have a lot of tables, get the query plan right so you can save far more time executing the query.
-- Optimizing with the PostgreSQL deterministic query planner --
The optimizer rearranges the join order of a query. With an inner join of two tables, there are usually seven choices: PostgreSQL can opt for a nested loop, hash or merge join, and for the first two of these, the order of the tables makes a difference as well. With more tables, the number of options explodes, since the result of an inner join is independent of the order in which you join the tables. For three tables, there can be up to 147 combinations.
However, while the optimizer tries to find the best execution plan, it is also important that it does not take too long for planning. After all, PostgreSQL normally does not cache execution plans. To keep planning time moderate, the optimizer draws the line somewhere: if a query joins many tables, the optimizer will only consider all possible combinations for the first eight tables. It joins the remaining tables just like you wrote them in the statement. You can adjust that limit with the parameters join_collapse_limit and from_collapse_limit. The first one is for statements written with the explicit JOIN syntax, and the second applies to joins written in the form
FROM a, b WHERE a.col1 = b.col2
If the number of tables reaches 12 (the default value of the parameter geqo_threshold ), PostgreSQL uses an entirely different approach: it randomly generates a number of query plans and plays evolution by recombining the most promising plans over several generations. This genetic query optimizer can result in non-deterministic query plans, which is not always what you want.
- Always keep number of rows needed to compute its output is small, no matter how large the involved tables are.
- This means that the optimization goal is to reduce the size of the result set as early as possible. When we execute a JOIN, we prioritize joining the table that reduces the number of rows in the output the least first
- The most likely approach involves joining a small table first, so that the result set remains small even as subsequent larger tables are processed. Join the next smallest table, then the next smallest, and so on.
- When using ORM, always remember that postgres only optimizes some join tables not all (before optimizer moves to another approach). You have to get the explain query to see if the order of joins is proficient.
Query execution stages
Statistics
I hope you have gained an overview of how queries run in PostgreSQL; there is still much more related knowledge on this topic. However, I believe that with the foundational knowledge provided above, we can easily delve into more advanced concepts. And always remember that before optimizing anything, we need to measure it first.
Hope you find this article useful. If you have any questions, please let me know in the comment section below. 👇