08-31-2024 12:02 PM
I am using Delta Live Tables and have my pipeline defined using the code below. My understanding is that a checkpoint is automatically set when using Delta Live Tables. I am using the Unity Catalog and Schema settings in the pipeline as the storage destination.
Since I am reading JSON messages and many files are being created, I want to eventually run a cleanup process to delete the old files that have already been written to the streaming table. I thought I could do this by looking at the checkpoint file. But I am unable to find where the checkpoints are being written or how i can access them. When i try to manually set a checkpoint directory, nothing gets created when the pipeline runs.
@Dlt.table(
name="newdata_raw",
table_properties={"quality": "bronze"},
temporary=False,
)
def create_table():
query = (
spark.readStream.format("cloudFiles")
.schema(schema)
.option("cloudFiles.format", "json")
.load(sink_dir + "partition=*/")
.selectExpr("newRecord.*")
.withColumn("LOAD_DT", to_timestamp(current_timestamp()))
)
return query
09-01-2024 12:26 PM
Hi @ggsmith ,
If you use Delta Live Tables then checkpoints are stored under the storage location specified in the DLT settings. Each table gets a dedicated directory under storage_location/checkpoints/<dlt_table_name.
Friday
@szymon_dybczak how can I access the checkpoint? is there any way i can delete the checkpoints stored in the storage location ? The reason I want to cleanup checkpoint is because spark.sql.shuffle.partition change is not taking effect and as per some discussions on the community, any change in above parameters takes effect after cleaning up existing checkpoints since the value of this parameter is saved there.
Friday
Hi @PushkarDeole ,
You can just go to that location in delete it manually. Or you can use dbutils. Whichever you prefer.
Friday
Thanks for the quick response @szymon_dybczak and appreciate it. Probably I am missing something. I will check the dbutils part to access the location,
however on your first point, I am not sure how can I directly go to the location and delete it manually. I think that's the main question I have is how can I access that location directly without using any utility ?
Friday
We are using Unity Catalog, so I don't see this storage location option. Just the catalog & target schema.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group