(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):
Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column `timestamp` and a derived column `date` (which is the date of `timestamp`), and we choose to `partitionby` date. When we query we use `timestamp` in the filter, **not** date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont *explicitly* use in the query.
1. Is this correct, or have I missunderstood something?
2. What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
3. Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?