indexes across joins not used for count

Discussion:

(too old to reply)

Jeremy Wells

2012-10-16 03:08:20 UTC

I'm running a query to do a count with two joins in it. I've added
indexes to the tables for the join columns, but the explain of the query
doesn't seem to be using the indexes:

Table 1:
invites (id:int)

Table 2:
sms_requests (id:int, invoker_id:int, invoker_type:string,
sms_message_id:int)
Indexes:
"sms_requests_pkey" PRIMARY KEY, btree (id)
"index_sms_requests_on_invoker_id_and_invoker_type" btree
(invoker_id, invoker_type)
"index_sms_requests_on_sms_message_id" btree (sms_message_id)

Table 3:
sms_messages (id:int, sent_at:timestamp)
Indexes:
"sms_messages_pkey" PRIMARY KEY, btree (id)
"index_sms_messages_on_sent_at_partial" btree (sent_at) WHERE
sent_at IS NULL
"index_sms_messages_on_sent_at" btree (sent_at)

Query:

SELECT COUNT(*) FROM "invites" INNER JOIN "sms_requests" ON
"sms_requests"."invoker_id" = "invites"."id" AND
"sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON
"sms_messages"."id" = "sms_requests"."sms_message_id" WHERE
"sms_messages"."sent_at" IS NOT NULL

Explain:

Aggregate (cost=165914.42..165914.43 rows=1 width=0)
-> Hash Join (cost=92326.82..163534.87 rows=951821 width=0)
Hash Cond: (sms_requests.sms_message_id = sms_messages.id)
-> Hash Join (cost=32692.53..83674.38 rows=951821 width=4)
Hash Cond: (invites.id = sms_requests.invoker_id)
-> Seq Scan on invites (cost=0.00..27525.48
rows=1238948 width=4)
-> Hash (cost=20794.76..20794.76 rows=951821 width=8)
-> Seq Scan on sms_requests (cost=0.00..20794.76
rows=951821 width=8)
Filter: ((invoker_type)::text = 'Invite'::text)
-> Hash (cost=48180.24..48180.24 rows=916324 width=4)
-> Seq Scan on sms_messages (cost=0.00..48180.24
rows=916324 width=4)
Filter: (sent_at IS NOT NULL)

This is pretty slow, ~5000ms on my development machine. I would have
expected it to be able to make use of the indexes I've created. Any
ideas on what I can do to make this perform better?

Jeremy

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Jeff Davis

2012-10-20 18:21:22 UTC

Permalink

Post by Jeremy Wells
I'm running a query to do a count with two joins in it. I've added
indexes to the tables for the join columns, but the explain of the query

Can you post the output of EXPLAIN ANALYZE? Did you do an ANALYZE of the
tables already?

You can often force an index scan by doing:

SET enable_seqscan=false;

So also try setting that, and run EXPLAIN ANALYZE again, and see if it
uses the indexes, and if so, if it's faster.

Regards,
Jeff Davis

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Jeremy Wells

2012-10-20 19:37:51 UTC

Permalink

Post by Jeff Davis
Can you post the output of EXPLAIN ANALYZE? Did you do an ANALYZE of the
tables already?

EXPLAIN ANALYZE SELECT COUNT(*) FROM "invites" INNER JOIN "sms_requests"
ON "sms_requests"."invoker_id" = "invites"."id" AND
"sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON
"sms_messages"."id" = "sms_requests"."sms_message_id" WHERE
"sms_messages"."sent_at" IS NOT NULL;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=129556.14..129556.14 rows=1 width=0) (actual
time=21530.146..21530.147 rows=1 loops=1)
-> Hash Join (cost=81261.64..128882.20 rows=1347872 width=0)
(actual time=6961.957..20437.245 rows=1340639 loops=1)
Hash Cond: (sms_requests.sms_message_id = sms_messages.id)
-> Hash Join (cost=22033.10..62240.37 rows=1347872 width=4)
(actual time=3618.615..11410.322 rows=1340639 loops=1)
Hash Cond: (invites.id = sms_requests.invoker_id)
-> Seq Scan on invites (cost=0.00..33216.78
rows=1683927 width=4) (actual time=0.009..1983.440 rows=1683927 loops=1)
-> Hash (cost=17315.55..17315.55 rows=1347872 width=8)
(actual time=3617.847..3617.847 rows=1347872 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 52652kB
-> Seq Scan on sms_requests (cost=0.00..17315.55
rows=1347872 width=8) (actual time=0.027..1651.786 rows=1347872 loops=1)
Filter: ((invoker_type)::text = 'Invite'::text)
-> Hash (cost=55408.40..55408.40 rows=1091467 width=4)
(actual time=3342.029..3342.029 rows=1107628 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 38941kB
-> Seq Scan on sms_messages (cost=0.00..55408.40
rows=1091467 width=4) (actual time=0.009..1737.765 rows=1107628 loops=1)
Filter: (sent_at IS NOT NULL)
Total runtime: 21530.247 ms

Post by Jeff Davis
SET enable_seqscan=false;
So also try setting that, and run EXPLAIN ANALYZE again, and see if it
uses the indexes, and if so, if it's faster.

=> SET enable_seqscan=false;
SET
=> EXPLAIN ANALYZE SELECT COUNT(*) FROM "invites" INNER JOIN
"sms_requests" ON "sms_requests"."invoker_id" = "invites"."id" AND
"sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON
"sms_messages"."id" = "sms_requests"."sms_message_id" WHERE
"sms_messages"."sent_at" IS NOT NULL;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=198896.67..198896.67 rows=1 width=0) (actual
time=21923.378..21923.378 rows=1 loops=1)
-> Hash Join (cost=90165.47..198222.73 rows=1347872 width=0)
(actual time=5465.116..20552.881 rows=1340639 loops=1)
Hash Cond: (sms_requests.invoker_id = invites.id)
-> Merge Join (cost=7.29..101325.20 rows=1347872 width=4)
(actual time=0.035..9630.580 rows=1347872 loops=1)
Merge Cond: (sms_messages.id = sms_requests.sms_message_id)
-> Index Scan using sms_messages_pkey on sms_messages
(cost=0.00..69875.42 rows=1091467 width=4) (actual time=0.011..3009.455
rows=1107628 loops=1)
Filter: (sent_at IS NOT NULL)
-> Index Scan using
index_sms_requests_on_sms_message_id on sms_requests
(cost=0.00..26190.25 rows=1347872 width=8) (actual time=0.014..2257.271
rows=1347872 loops=1)
Filter: ((invoker_type)::text = 'Invite'::text)
-> Hash (cost=84264.43..84264.43 rows=1683927 width=4)
(actual time=5462.709..5462.709 rows=1683927 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 59201kB
-> Index Scan using invites_pkey on invites
(cost=0.00..84264.43 rows=1683927 width=4) (actual time=0.159..2763.626
rows=1683927 loops=1)
Total runtime: 21923.504 ms

Jeff Davis

2012-10-21 06:43:08 UTC

Permalink

Post by Jeremy Wells

Post by Jeff Davis
Can you post the output of EXPLAIN ANALYZE? Did you do an ANALYZE of the
tables already?

Interesting... it looks like the two plans take about the same amount of
time, so the planner is not making a mistake.

I tried making three tables about the same size as yours, and then doing
a three-way join, and on my machine it took closer to a second. So
there's clearly something else going on with your data. Maybe the data
distribution is skewed (some values having many matches in another
table, others having none)?

Regards,
Jeff Davis

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Jeremy Wells

2012-10-21 08:49:12 UTC

Permalink

Post by Jeff Davis
I tried making three tables about the same size as yours, and then doing
a three-way join, and on my machine it took closer to a second. So
there's clearly something else going on with your data. Maybe the data
distribution is skewed (some values having many matches in another
table, others having none)?

Every row in sms_requests will match a row in sms_messages (but maybe
the same row as sms_requests n<-->1 sms_messages). Each row in invites
will have either 0 or 1 row in sms_requests.

Given that the number of sms_messages with sent_at IS NULL will remain
pretty constant will the query continue to run at about the same speed
or will it slow significantly as each table grows in size? Although the
time of the count is slow it's not giving me problems yet. Can I speed
it up? Would it make sense to denormalize the sent_at time into a
boolean on the sms_requests table?

Thanks for your help so far,
Jeremy

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Tom Lane

2012-10-21 16:59:16 UTC

Permalink

Post by Jeremy Wells
Given that the number of sms_messages with sent_at IS NULL will remain
pretty constant will the query continue to run at about the same speed
or will it slow significantly as each table grows in size?

At some point the IS NULL condition will become selective enough that
it's worth using an index on sent_at to pull out just those rows.
The planner evidently thinks you are not there yet. (Or, perhaps, you
are using a version too old to be able to use an index for IS NULL?
You didn't mention your PG version, which is a fundamental mistake in
any performance-related question.)

My overall take on this is that it's folly to imagine that indexes are a
magic bullet for full-table joins: if the query has to read most/all of
the table anyway, an index doesn't help much. The plans you've shown
look perfectly sane from here.

If you were using 9.2, and the tables were relatively static, index-only
scans might make sense. But they're no magic bullet either.

regards, tom lane

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Jeremy Wells

2012-10-21 21:30:39 UTC

Permalink

Post by Tom Lane
(Or, perhaps, you
are using a version too old to be able to use an index for IS NULL?
You didn't mention your PG version, which is a fundamental mistake in
any performance-related question.)

I'm using 9.1

--
Sent via pgsql-novice mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice