Databricks Community

guangyi · a month ago

All my questions is around this code block

@Dlt.append_flow(target=”target_table”):
def flow_01():
  df = spark.readStream.table(“table_01”)

@dlt.append_flow(target=”target_table”):
def flow_02():
  df = spark.readStream.table(“table_02”)

The first question is can I manually check or update or delete the checkpoint of stream table reading in the above code?

I suppose there is no way of specifing the checkpoint location because I cannot find any document about this feature. I try to find resource under the Structured Streaming checkpoints or Configure a Delta Live Tables pipeline but nothing there.

Why I want to do this is because I want to do some troubleshooting. For example, I want to monitor the daily data reading from a specify streaming table, like how much data has been read this time, start from where ended at where. Also if something wrong goes wrong, I can delete the checkpoint as a reset measurement. There is a full refresh can solve this problem but if I can access to the checkpoint it may give me lots insight about the pipeline is running.

The second question is similar, is there a way to monitor the append_flow behavior, like how much data flow the source table to the target table on the daily job?

The reason I want this is because, after all the append flow process accomplished what I got from the target table is only a total incremental number for this time. I cannot differentiate each individual flow contributing how much data or how much time they cost individually

Is there any feature I ignored or I can referenced to accomplish my goal? Or do we have any plan to achieve these feature in the future?

Nam_Nguyen · a month ago

Hi @guangyi , I'll be looking into this, and I'll get back to you with an answer

Nam_Nguyen · a month ago

Hello @guangyi , I am getting back to you with some insights

Regarding your first question about checkpointing
- You can manually check the checkpointing location of your stream table. The checkpoints of your Delta Live Tables are under Storage location in Destination of Pipeline settings. Each table gets a dedicated directory under <storage_location>/checkpoints/<dlt_table_name>
- As for update or delete the checkpointing, it's not technically feasible AFAIK, and it's not something that we would recommend either. It could create some unexpected behavior of your stream pipelines, and it'll be difficult to troubleshoot this type of error.
- To closely monitor your streaming pipelines, there are some useful information that you can find in the event log of Delta Live Tables (some example here https://docs.databricks.com/en/delta-live-tables/observability.html). For example, you can query the event log (it can either be in Hive metastore or Unity Catalog) and find the timestamp where each events is processed. In addition, you can create your custom monitoring rules via event hooks https://docs.databricks.com/en/delta-live-tables/event-hooks.html
For your second question on the append flow, there isn't any out-of-the-box solution to differentiate the contribution of different flows, but I encourage you to look into the event log of the target table. You can check the event log of your table in Unity Catalog by querying
```
SELECT * FROM event_log(TABLE(my_catalog.my_schema.table1))
```

I hope that my answers are clear and can address some of your points. If you want to discuss further, please kindly let me know!

Databricks Community

DLT pipeline observability questions (and maybe suggestions)

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences