Simple append for a DLT
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-25-2025 02:34 PM
Looking for some help getting unstuck re: appending to DLTs in Databricks. I have successfully extracted data via API endpoint, done some initial data cleaning/processing, and subsequently stored that data in a DLT. Great start. But I noticed that each time the pipeline runs, all of the previous rows are overwritten. The AI assistant and separate google searches have proven worthless thus far to help me understand why I cannot simply append data from each run to the DLT. I manually added a timestamp column to ensure that each run's data is unique. And each time it runs, I can verify that the data is fresh. I just only see the new data (old is overwritten). According to my research, append is supposedly the default behavior when writing to a DLT, but that's not happening and I don't understand why. Attempts to explicitly define the append properties for the DLT (both in the notebook and pipeline settings) have not helped. Here is an simple example of what I'm trying (and failing) to do:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a month ago
Hi @jrod123 , Can you please try the below method?
1. Create a DLT view to store the api data first. If possible, get only incremental data from the API
@dlt.view
def api_data_view():
return api_df
@dlt.table
def target_table():
df=sparkread.table("api_data_view") #append view data
return df
This way we are separating the api transformations in a view and then appending the data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a month ago
Creating a view first & then a table as you suggested still produces the same result: data in the table is overwritten (rather than appended) with each run of the pipeline. Here's a simple code example that I used:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import datetime
import dlt
# Initialize Spark session
spark = SparkSession.builder.appName("Data Ingestion").getOrCreate()
from pyspark.sql.functions import current_timestamp
# Function to generate sample data
def generate_data():
data = [
(1, "A"),
(2, "B"),
(3, "C")
]
df = spark.createDataFrame(data, ["id", "value"])
df = df.withColumn("timestamp", lit(datetime.datetime.now()))
return df
# Define DLT view and table
@Dlt.view(
name="example_view"
)
def create_example_view():
return generate_data()
# # Define the Delta Live Table
@Dlt.table(
name="example_table"
)
def create_example_table():
df = spark.read.table("example_view")
return generate_data()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a month ago
for reference, here are the json pipeline settings:
{
"id": "96e670ba-....",
"pipeline_type": "WORKSPACE",
"development": true,
"continuous": false,
"channel": "CURRENT",
"photon": true,
"libraries": [
{
"notebook": {
"path": "/Users/.../dummy_dlt"
}
}
],
"name": "dlt_view_to_table",
"serverless": true,
"catalog": "tabular",
"schema": "dataexpert",
"data_sampling": false
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
I am likewise struggling with this. All DLT configurations that I've tried (including

