Databricks Community

RyanHager · ‎04-28-2022

Benefit: This will help simplify the where clauses of the consumers of the tables? Just query on the main date field if I need all the data for a day. Not an extra day field we had to make.

Hubert-Dudek · ‎05-07-2022

@Ryan Hager , yes it is possible using AUTO GENERATED COLUMNS since delta lake 1.2

For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column.

(DeltaTable.create(spark)
.tableName("default.people10m") 
 .addColumn("id", "INT") 
 .addColumn("birthDate", "TIMESTAMP") 
 .addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") 
 .partitionedBy("dateOfBirth") 
 .execute())

RyanHager · ‎05-13-2022

Does this mean the execution plan for the following query that uses the original timestamp column will only scan 3 partitions and we don't have to use the dateOfBirth column in the where clause?

select id,birthDate from default.people10m
where birthDate  > cast('2022-05-01 08:00:00.000000 America/Chicago' as timestamp)
and birthDate  < cast('2022-05-03 08:00:00.000000 America/Chicago' as timestamp)

RyanHager · ‎08-19-2022

@Kaniz Fatma Can you help me get clarification on this?

RyanHager · ‎02-27-2023

Just to update the post, this does work:

You have a timestamp column
Generate a date column from the timestamp column
Partition on that generated date column
Write a query that filters on the original timestamp column
- Databricks will only scan partitions within the date range.