topic Re: Simple append for a DLT in Data Engineering

Simple append for a DLT

jrod123 — Tue, 25 Feb 2025 22:34:44 GMT

Looking for some help getting unstuck re: appending to DLTs in Databricks. I have successfully extracted data via API endpoint, done some initial data cleaning/processing, and subsequently stored that data in a DLT. Great start. But I noticed that each time the pipeline runs, all of the previous rows are overwritten. The AI assistant and separate google searches have proven worthless thus far to help me understand why I cannot simply append data from each run to the DLT. I manually added a timestamp column to ensure that each run's data is unique. And each time it runs, I can verify that the data is fresh. I just only see the new data (old is overwritten). According to my research, append is supposedly the default behavior when writing to a DLT, but that's not happening and I don't understand why. Attempts to explicitly define the append properties for the DLT (both in the notebook and pipeline settings) have not helped. Here is an simple example of what I'm trying (and failing) to do:

import dlt

from pyspark.sql.functions import current_timestamp

# Function to generate sample data

def generate_data():

data = [

(1, "A"),

(2, "B"),

(3, "C")

]

df = spark.createDataFrame(data, ["id", "value"])

df = df.withColumn("timestamp", current_timestamp())

return df

# Define the Delta Live Table

@dlt.table(

name="example_table",

comment="A simple example table",

table_properties={"pipelines.appendOnly": "true"}

)

def create_example_table():

return generate_data()

Re: Simple append for a DLT

KaranamS — Wed, 26 Feb 2025 23:43:10 GMT

Hi @jrod123 , Can you please try the below method?

1. Create a DLT view to store the api data first. If possible, get only incremental data from the API

@dlt.view

def api_data_view():

return api_df

2. Define your DLT table and append the view to your target table

@dlt.table

def target_table():

df=sparkread.table("api_data_view") #append view data

return df

This way we are separating the api transformations in a view and then appending the data.

Re: Simple append for a DLT

jrod123 — Thu, 27 Feb 2025 05:33:17 GMT

Creating a view first & then a table as you suggested still produces the same result: data in the table is overwritten (rather than appended) with each run of the pipeline. Here's a simple code example that I used:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
import datetime
import dlt

# Initialize Spark session
spark = SparkSession.builder.appName("Data Ingestion").getOrCreate()

from pyspark.sql.functions import current_timestamp

# Function to generate sample data
def generate_data():
data = [
(1, "A"),
(2, "B"),
(3, "C")
]
df = spark.createDataFrame(data, ["id", "value"])
df = df.withColumn("timestamp", lit(datetime.datetime.now()))
return df

# Define DLT view and table

@Dlt.view(
name="example_view"
)
def create_example_view():
return generate_data()

# # Define the Delta Live Table
@Dlt.table(
name="example_table"
)
def create_example_table():
df = spark.read.table("example_view")
return generate_data()

Re: Simple append for a DLT

jrod123 — Thu, 27 Feb 2025 05:38:43 GMT

for reference, here are the json pipeline settings:

{
"id": "96e670ba-....",
"pipeline_type": "WORKSPACE",
"development": true,
"continuous": false,
"channel": "CURRENT",
"photon": true,
"libraries": [
{
"notebook": {
"path": "/Users/.../dummy_dlt"
}
}
],
"name": "dlt_view_to_table",
"serverless": true,
"catalog": "tabular",
"schema": "dataexpert",
"data_sampling": false
}

Re: Simple append for a DLT

tastefulSamurai — Fri, 14 Mar 2025 22:51:55 GMT

I am likewise struggling with this. All DLT configurations that I've tried (including

spark_conf={"pipelines.autoOptimize.appendOnly": "true"}) just yield overwrites of the existing data.

Any luck @jrod123