Problem: The Queried Table Did Not Appear in the Execution Plan#
SQL:
SELECT
*
FROM
(
SELECT
A.column1 as "column1",
-- many A columns omitted in between
A.column99 as "column99"
from
table_a A
left join (
SELECT
lzl_id
from
table_a AA
inner join table_b BB ON AA.lzl_key = BB.lzl_id
where
AA.column_code = '1'
GROUP BY
lzl_id
) B ON B.lzl_id = A.lzl_key
where
A.flagflagflag = '1'
AND A.typetypetype = '2'
) TEMP
limit
100
offset
1000Execution plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.84..5.68 rows=1 width=1105) (actual time=0.038..0.039 rows=0 loops=1)
Buffers: shared hit=2
-> Seq Scan on table_a a (cost=0.00..2.84 rows=1 width=1105) (actual time=0.036..0.037 rows=0 loops=1)
Filter: (((flagflagflag)::text = '1'::text) AND ((typetypetype)::text = '2'::text))
Rows Removed by Filter: 38
Buffers: shared hit=2
Planning Time: 0.184 ms
Execution Time: 0.066 msAs you can see, the SQL itself is fairly complex. Logically, the SQL queries 3 tables / accesses 2 tables total. I can understand table_a appearing in the execution plan, but table_b, which needed to be queried, wasn’t in the execution plan at all! The execution plan was simply a sequential scan of table_a.
The Analytical Journey#
In the middle of the analysis, I actually considered many possibilities, but the most likely one was logical optimization — that is, the PostgreSQL optimizer determined that table_b didn’t need to be queried.
Observing the SQL, I noticed that the final query only selected columns from table_a, without any columns from table_b. Adding any column from the intermediate table B made the SQL execution plan appear “normal” — it accessed table_b:
explain SELECT
*
FROM
(
SELECT
A.column1 as "column1",
-- many A columns omitted in between
A.column99 as "column99",
B.lzl_id -- added a column from intermediate table B
from
table_a A
left join (
SELECT
lzl_id
from
table_a AA
inner join table_b BB ON AA.lzl_key = BB.lzl_id
where
AA.column_code = '1'
GROUP BY
lzl_id
) B ON B.lzl_id = A.lzl_key
where
A.flagflagflag = '1'
AND A.typetypetype = '2'
) TEMP
limit
100
offset
1000---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=14.69..17.67 rows=1 width=1113)
-> Nested Loop Left Join (cost=11.72..14.69 rows=1 width=1113)
Join Filter: (bb.lzl_id = a.lzl_key)
-> Seq Scan on table_a a (cost=0.00..2.84 rows=1 width=1113)
Filter: (((flagflagflag)::text = '1'::text) AND ((typetypetype)::text = '2'::text))
-> Group (cost=11.72..11.74 rows=5 width=8)
Group Key: bb.lzl_id
-> Sort (cost=11.72..11.73 rows=5 width=8)
Sort Key: bb.lzl_id
-> Nested Loop (cost=0.15..11.66 rows=5 width=8)
-> Seq Scan on table_a aa (cost=0.00..2.70 rows=1 width=8)
Filter: ((company_code)::text = '1'::text)
-> Index Only Scan using idx_table_b_lzl_id on table_b bb (cost=0.15..8.83 rows=13 width=8)
Index Cond: (lzl_id = aa.lzl_key)This seems related to LEFT JOIN, but a quick thought makes it seem incorrect — after all, the results from the right table should affect the final query result, so the right table shouldn’t be skipped. Let’s try a simple LEFT JOIN:
explain select lzlleft.a from lzlleft left join lzlright on lzlleft.a=lzlright.a;
QUERY PLAN
--------------------------------------------------------------------
Hash Left Join (cost=1.04..15.47 rows=320 width=4)
Hash Cond: (lzlleft.a = lzlright.a)
-> Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4)
-> Hash (cost=1.02..1.02 rows=2 width=4)
-> Seq Scan on lzlright (cost=0.00..1.02 rows=2 width=4)The right table is scanned. But, in intermediate table B, there’s the keyword GROUP BY. If we remove GROUP BY, then table_b is accessed regardless of whether we query columns from B.
Let’s add a GROUP BY in our test table and see the result:
> select * from lzlleft;
a | b
---+-----
1 | zzz
(1 row)
Time: 0.259 ms
> select * from lzlright;
a | b
---+-------
1 | qwer
1 | poiuy
> select lzlright.b from lzlleft full join lzlright on lzlleft.b=lzlright.b group by lzlright.b;
b
--------
[null]
poiuy
qwer
(3 rows)This is where I realized that the result set from GROUP BY must have a certain property — uniqueness.
Let’s add GROUP BY in the test table:
explain select lzlleft.a from lzlleft left join (select a from lzlright group by a) c on lzlleft.a=c.a;
QUERY PLAN
----------------------------------------------------------
Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4)The right table is not queried!
Based on the principle of right-table uniqueness, we can also have some fun variations:
-- distinct ensures right-table uniqueness
> explain select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a;
QUERY PLAN
----------------------------------------------------------
Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4) -- unique index ensures right-table uniqueness, even with just select a from lzlright
> explain select lzlleft.a from lzlleft left join (select a from lzlright) c on lzlleft.a=c.a;
QUERY PLAN
-----------------------------------------------------------------------
Hash Left Join (cost=17.20..49.12 rows=512 width=4)
Hash Cond: (lzlleft.a = lzlright.a)
-> Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4)
-> Hash (cost=13.20..13.20 rows=320 width=4)
-> Seq Scan on lzlright (cost=0.00..13.20 rows=320 width=4)
(5 rows)
Time: 0.510 ms
> create unique index idx_right on lzlright(a);
CREATE INDEX
Time: 3.576 ms
> explain select lzlleft.a from lzlleft left join (select a from lzlright) c on lzlleft.a=c.a;
QUERY PLAN
----------------------------------------------------------
Seq Scan on lzlleft (cost=0.00..13.20 rows=320 width=4)
(1 row)Here’s a summary of the analysis: when the right table’s data is unique and only the left table’s data is being queried, there’s no need to actually access the right table. So this is not a bug, but a feature of the PostgreSQL optimizer — and it makes logical sense.
Source Code Analysis#
No source code analysis this time~
The optimizer source code is just too difficult. I only looked at some optimizer source code comments. Search for the keyword unique-ify, and you’ll find this:
* Also, this routine and others in this module accept the special JoinTypes
* JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER to indicate that we should
* unique-ify the outer or inner relation and then apply a regular inner
* join. These values are not allowed to propagate outside this module,
* however. Path cost estimation code may need to recognize that it's
* dealing with such a case --- the combination of nominal jointype INNER
* with sjinfo->jointype == JOIN_SEMI indicates that.
Special JoinTypes: JOIN_UNIQUE_INNER and JOIN_UNIQUE_OUTER — they try to unique-ify the outer and inner relations and then treat them as an inner join. Path cost estimation needs to consider this scenario.
Comparison with Oracle and MySQL Optimizers#
Let’s compare whether Oracle and MySQL optimizers have similar logical optimization improvements.
-- Oracle
create table lzlleft(a number);
create table lzlright(a number);
select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a;-- GROUP BY uniqueness
SQL> select lzlleft.a from lzlleft left join (select a from lzlright group by a) c on lzlleft.a=c.a;
no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 3533354041
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 5 (20)| 00:00:01 |
|* 1 | HASH JOIN OUTER | | 1 | 26 | 5 (20)| 00:00:01 |
| 2 | TABLE ACCESS FULL | LZLLEFT | 1 | 13 | 2 (0)| 00:00:01 |
| 3 | VIEW | | 1 | 13 | 3 (34)| 00:00:01 |
| 4 | HASH GROUP BY | | 1 | 13 | 3 (34)| 00:00:01 |
| 5 | TABLE ACCESS FULL| LZLRIGHT | 1 | 13 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("LZLLEFT"."A"="C"."A"(+))-- DISTINCT uniqueness
SQL> select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a;
no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 3859658234
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 5 (20)| 00:00:01 |
|* 1 | HASH JOIN OUTER | | 1 | 26 | 5 (20)| 00:00:01 |
| 2 | TABLE ACCESS FULL | LZLLEFT | 1 | 13 | 2 (0)| 00:00:01 |
| 3 | VIEW | | 1 | 13 | 3 (34)| 00:00:01 |
| 4 | HASH UNIQUE | | 1 | 13 | 3 (34)| 00:00:01 |
| 5 | TABLE ACCESS FULL| LZLRIGHT | 1 | 13 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("LZLLEFT"."A"="C"."A"(+))-- MySQL
create table lzlleft(a int primary key);
create table lzlright(a int primary key);-- GROUP BY uniqueness
explain select lzlleft.a from lzlleft left join (select a from lzlright group by a) c on lzlleft.a=c.a;
+----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+
| 1 | PRIMARY | lzlleft | NULL | index | NULL | PRIMARY | 4 | NULL | 1 | 100.00 | Using index |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | lzldb.lzlleft.a | 2 | 100.00 | Using index |
| 2 | DERIVED | lzlright | NULL | index | PRIMARY | PRIMARY | 4 | NULL | 1 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+-- DISTINCT uniqueness
explain select lzlleft.a from lzlleft left join (select distinct a from lzlright) c on lzlleft.a=c.a;
+----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+
| 1 | PRIMARY | lzlleft | NULL | index | NULL | PRIMARY | 4 | NULL | 1 | 100.00 | Using index |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | lzldb.lzlleft.a | 2 | 100.00 | Using index |
| 2 | DERIVED | lzlright | NULL | index | PRIMARY | PRIMARY | 4 | NULL | 1 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+-------------+---------+-----------------+------+----------+-------------+In summary, neither Oracle nor MySQL performs the optimization of eliminating the right table in a LEFT JOIN when only left-table columns are queried and the right table is unique — they both access the right table.
The PostgreSQL optimizer really has some impressive tricks.