According to the document https://docs.databricks.com/en/structured-streaming/delta-lake.html#complete-mode, the “complete” option seems to “replace the entire table with every batch”. However, it is not working in my case.
Here is how I reproduce the issue:
Firstly I prepared a single file in the ADLS named `employee_01.csv`. Then I use the python code to read data from it and generate a table
outputMode = 'complete'
default_spark_options = {
"cloudFiles.format": "csv",
"delimiter": "\x01",
"inferSchema": "true"
}
@Dlt.table(
name = table_01,
)
def create_raw_table():
path = source_path
df = (spark.readStream
.outputMode(outputMode)
.format("cloudFiles")
.options(**spark_options)
.load(path))
return df
I can load the data and create the table successfully
Then I upload another file in the ADLS and trigger the DLT pipeline again.
However, when the DLT pipeline finished running. The table result seems contains the two running result together
Do I understanding the `complete` outputMode incorrectly