Databricks Community

andrej · ‎07-14-2022

I have a large table which contains a date_time column.

The table contains 2 generated columns year, and month which are extracted from the date_time values and are used for partitioning.

I have the following question.

If I run the query

SELECT *

FROM table

WHERE date_time > '2022-07-01' and date_time < '2022-07-09'

This query will scan all the files

If I modify the query to

SELECT *

FROM table

WHERE date_time > '2022-07-01' and date_time < '2022-07-09'

AND year = 2022 and month = 7

Now pruning will get applied and the query will run ~ 20 times faster.

I would be expecting that given that there is a relationship defined between date_time and columns year and month, pruning would be applied even if only date_time is specified in the where clause.

Am I missing something in my config or is my understanding incorrect?

Thanks,

Andrej

Anonymous · ‎07-14-2022

Partition pruning will only happen when using the generated columns i.e. ‘year’ and ‘month’ as predicates.

You can consider file pruning by zordering or using bloom filter index.

-werners- · ‎07-14-2022

no your understanding is correct.

However there are some restrictions, which you can find here (the interesting part starts at the paragraph starting with "In Databricks Runtime 8.4 and above with Photon support, Delta Lake may be able to generate partition filters...")

andrej · ‎07-14-2022

Hi, thank you for replies.

@Werner Stinckens i read that exact article, but after re-reading it I realise that Photon support is required.

Will try again with that. Thanks!

Vidula · ‎09-04-2022

Hi @Andrej Znidarsic

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!