cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Add CreatedDate to Delta Live Table

tommyhmt
New Contributor II

Hi all,

I have a very simple DLT set up using the following code:

@dlt.view(
  name="view_name",
  comment="comments"
)
def vw_DLT():
  return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").load(file_location)

dlt.create_streaming_table(
  name="table_name",
  comment="comments"
)

dlt.apply_changes(
  target = "table_name", 
  source = "view_name",
  keys = ["id"], 
  sequence_by = col("Date")
)

I would like to create an extra column called CreatedDate which would be when the record was first inserted into the table, I can't put withColumn(CreatedDate, current_timestamp()) into vw_DLT() above because that would just get updated every load, so I've tried a couple of things:

First I added a schema to dlt.create_streaming_table which includes CreatdDate TIMESTAMP GENERATED ALWAYS AS (NOW()), but that fails with this error when I run the DLT pipeline:

tommyhmt_0-1723758970190.png

Secondly I was able to:

1. Alter the __apply_changes_storage_ table and adding the CreatedDate column:

2. Alter the CreatedDate column and set default current_timestamp()

3. Update the existing records to current_timestamp() manually

All this works in a notebook and when I run the pipeline again, the new records do indeed have a new CreatedDate value, and old records did not update that column.  But I'm not able to run the notebook above as part of the DLT job, since I get this error:

tommyhmt_1-1723759255869.png

If that works, then all I'd need to do is run the notebook just once after the initial load, and the rest will work fine.

2 REPLIES 2

Mounika_Tarigop
Databricks Employee
Databricks Employee

To add a CreatedDate column that captures the timestamp when a record is first inserted into the table, you can modify your Delta Live Tables (DLT) pipeline setup as follows:

1) Define the schema for your streaming table to include the CreatedDate column. This column will be populated with the current timestamp when a new record is inserted.

dlt.create_streaming_table(
name="table_name",
comment="comments",
schema="""
id STRING,
Date TIMESTAMP,
CreatedDate TIMESTAMP
"""
)
2) Ensure that the apply_changes function does not overwrite the CreatedDate column on subsequent updates.

dlt.apply_changes(
target="table_name",
source="view_name",
keys=["id"],
sequence_by=col("Date"),
except_column_list=["CreatedDate"]
)

Hi Mounika,

Why would defining the CreatedDate in the schema automatically mean "the column will be populated with the current timestamp when a new record is inserted"?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group