07-14-2022 07:28 AM
I have a large table which contains a date_time column.
The table contains 2 generated columns year, and month which are extracted from the date_time values and are used for partitioning.
I have the following question.
If I run the query
SELECT *
FROM table
WHERE date_time > '2022-07-01' and date_time < '2022-07-09'
This query will scan all the files
If I modify the query to
SELECT *
FROM table
WHERE date_time > '2022-07-01' and date_time < '2022-07-09'
AND year = 2022 and month = 7
Now pruning will get applied and the query will run ~ 20 times faster.
I would be expecting that given that there is a relationship defined between date_time and columns year and month, pruning would be applied even if only date_time is specified in the where clause.
Am I missing something in my config or is my understanding incorrect?
Thanks,
Andrej
07-14-2022 07:56 AM
Partition pruning will only happen when using the generated columns i.e. ‘year’ and ‘month’ as predicates.
You can consider file pruning by zordering or using bloom filter index.
07-14-2022 07:58 AM
no your understanding is correct.
However there are some restrictions, which you can find here (the interesting part starts at the paragraph starting with "In Databricks Runtime 8.4 and above with Photon support, Delta Lake may be able to generate partition filters...")
07-14-2022 08:14 AM
Hi, thank you for replies.
@Werner Stinckens i read that exact article, but after re-reading it I realise that Photon support is required.
Will try again with that. Thanks!
09-04-2022 07:04 AM
Hi @Andrej Znidarsic
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now