Genrated partition column not being used by optimizer

AlexDavies — Thu, 19 Aug 2021 16:16:57 GMT

We have created a table using the new generated column feature (https://docs.microsoft.com/en-us/azure/databricks/delta/delta-batch#deltausegeneratedcolumns)

CREATE TABLE ingest.MyEvent(
 data  binary,
 topic string,
 timestamp timestamp,
 date date GENERATED ALWAYS AS (CAST(timestamp AS DATE))
)
USING DELTA
PARTITIONED BY (topic,date)
LOCATION '/mnt/data/myevent'

When running

SELECT * FROM WHERE timestamp > '2021-06-01T12:34'

We were expecting it to push down a filter "date >= '2021-06-01'" so that it can make use of the partitioned column without having to explicit query it. However it does not push down a filter on date, so doesn't benefit from partition filtering

Is it expected that generated columns would be used to optimize queries? Is it possible to make generated columns work the way i'm expecting or am I expecting too much?

Re: Genrated partition column not being used by optimizer

-werners- — Mon, 23 Aug 2021 09:30:44 GMT

I think you have to pass a date in your select query instead of a timestamp.

The generated column will indeed derive a data from the timestamp and partition by it.

But the docs state:

When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values.

So the date column will be filled while writing. But afaik it is not the case with reading.

So "Select timestamp" will not be translated to "Select date".

When you query on the timestamp column, spark will read the timestamp column, and the table is not partitioned on this.

try this f.e.:

SELECT * FROM WHERE date  = '2021-06-01'

topic Genrated partition column not being used by optimizer in Data Engineering

Genrated partition column not being used by optimizer

Re: Genrated partition column not being used by optimizer