cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the best practice for tracing databricks - observe and writestream data record flow

sbux
New Contributor

Trying to connect dots on method below through a new event on Azure eventhub, storage, partition, avro records (those I can monitor) to my delta table? How do I trace observe, writeStream and the trigger?

...
 elif TABLE_TYPE == "live":
    print("DEBUG: TABLE_TYPE is live observe table")
    print(f"DEBUG: observe {YEAR}, {MONTH}, {DATE} writeStream queryName {EVENTHUB_NAME} CHECKPOINT_PATH {CHECKPOINT_PATH} start ADLS_MOUNT_PATH {ADLS_MOUNT_PATH}")
 table.observe("metric", lit(f"{YEAR}-{MONTH}-{DATE}").alias("batchTime")).writeStream.queryName(EVENTHUB_NAME).format("delta").trigger(processingTime="210 seconds").option("checkpointLocation", CHECKPOINT_PATH).start(ADLS_MOUNT_PATH)

I've verified that my upstream app events are captrued by target Azure eventhub, I see the new avro files in Azure Storage, although streaming snippet above does not write new events - in code below I can write that event data (say in a batch mode). I'm looking for some help and suggestions on best way to trace and troubleshoot getting live streaming to work.

print("DEBUG:  test this write to test_live target")
   spark.catalog.refreshTable(TARGET_TABLE)
   table.write.format("delta").mode("overwrite").option("mergeSchema", "true").saveAsTable(TARGET_TABLE)
 

Thanks, New Databricks dev

David

2 REPLIES 2

Anonymous
Not applicable

@David Martin​ :

To troubleshoot and debug your streaming pipeline, you can use the following steps:

  1. Check the logs: The print statements in the code you provided can help you to debug and troubleshoot issues in your pipeline. You can check the logs in the Databricks workspace or use a log aggregation tool like Azure Monitor to monitor the logs.
  2. Monitor the streaming job: You can monitor the status of your streaming job using the Databricks UI. Go to the "Streaming" tab and click on your streaming job to see the job status, number of input rows, and output rows.
  3. Verify the schema: Make sure that the schema of the data in the event hub and the observed table match the schema of the target Delta table. You can use the Databricks "Schema Registry" feature to manage and verify the schema.
  4. Check the checkpoint location: Verify that the checkpoint location specified in the code exists and has the correct permissions. You can use the Databricks "DBFS" (Databricks File System) to manage and monitor the checkpoint files.
  5. Test the trigger: Verify that the trigger interval specified in the code is correct and that the streaming job is triggered at the expected intervals. You can use the Databricks "Jobs" feature to schedule and monitor the trigger.
  6. Check the partitioning: Verify that the partitioning of the data is correct and that the data is evenly distributed among the partitions. You can use the Databricks "Data" tab to view the partitions and check the distribution of the data.
  7. Check the writeStream output: Verify that the writeStream method is writing the data to the expected output location (ADLS_MOUNT_PATH). You can use the Databricks "Data" tab to view the output and check the schema and format of the data.

By following these steps and monitoring the pipeline at each stage, you can identify and troubleshoot any issues in your streaming pipeline and get it to work with live data.

Does this help you think? Please let us know.

Anonymous
Not applicable

Hi @David Martin​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!