Databricks Community

JissMathew · ‎11-12-2024

source_df = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "csv")

.option("header", "true")

.option("timestampFormat", "d-M-y H.m")

.option("cloudFiles.schemaLocation", f"{landing_folder_path}/Opportunity_schema")

#.option("cloudFiles.inferColumnTypes", "true")

.schema(schema) # Explicitly define schema to avoid invalid characters

.load(f"{landing_folder_path}/Opportunity")

)

source_df = source_df.filter("Id IS NOT NULL") # Example of filtering out corrupt data

write_query = (source_df.writeStream

.format("delta")

.option("checkpointLocation", f"{landing_folder_path}/Opportunity/checkpoint")

.option("mergeSchema", "false")

.outputMode("append")

.trigger(availableNow=True)

.toTable("dev.demo_db.Opportunity_raw")

)

write_query.awaitTermination() # Ensure the stream is running

ingest()
i have issue with auto loader that i can't getting incremental load on this . if rerun the ingest() the some unstructured data ingecting into Opportunity_raw

cgrant · ‎11-21-2024

Hi JissMatew,

As long as your checkpointLocation is not deleted and does not change, you should receive an incremental feed when loading data. Please verify that the checkpoint is not being deleted or moved between runs. The checkpoint is how the stream keeps track of its progress.

If you're still noticing this, please give more details about the duplicate unstructured data that you are ingesting.

Databricks Community

auto loader

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks