Data Volume Read/Processed for a Databricks Workflow Job
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2025 10:25 PM
Hello All, I have a DBx instance hosted on Azure and I am using the Diagnostic Settings to collect Databricks Jobs related logs in log analytics workspace. So far, from the DatabricksJobs table in Azure Loganalytics, I am able to fetch basic job related data like status, duration etc. I am also looking forward to gather some insights about the total (volume of) data read/processed, or throughput through a Job run, something similar to what we get in the ADFActivityRun table for ADF pipeline runs (Copy activities).
I need help to understand the following:
- Is my expectation of such a data w.r.t to a DBX job even appropriate? If yes, how can I fetch this kind of a data?
- If not, how can I gain such data read/processed insights w.r.t Databricks entities? Is it applicable to only delta tables, or something like that?
Also please note: I am using a non Unity-Catalog enabled, multi-node cluster for running the Jobs on (for the time being), but if there are any specific cluster requirements to implement any kind of solution, please do let me know about it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2025 01:02 AM - edited 01-17-2025 01:03 AM
you can opt one of the following ways:
Enable Spark Metrics:
- Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics."spark.metrics.conf.*.sink.azureloganalytics.class":"org.apache.spark.metrics.sink.AzureLogAnalyticsSink","spark.metrics.conf.*.sink.azureloganalytics.workspaceId": "<your-log-analytics-workspace-id>","spark.metrics.conf.*.sink.azureloganalytics.primaryKey": "<your-log-analytics-primary-key>","spark.metrics.conf.*.sink.azureloganalytics.period": "10"
- 2. Use Spark Listener:
- Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.
- Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
class CustomSparkListener(SparkListener):
def onTaskEnd(self, taskEnd):
metrics = taskEnd.taskMetrics()
print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")
spark = SparkSession.builder \
.appName("CustomSparkListener") \
.config("spark.extraListeners", "com.example.CustomSparkListener") \
.getOrCreate()
3. Use Databricks REST API:
- Use the Databricks REST API to fetch detailed metrics and logs for job runs.
import requests
databricks_instance = "https://<databricks-instance>"
token = "<your-databricks-token>"
job_id = "<your-job-id>"
headers = {
"Authorization": f"Bearer {token}"
}
response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)
job_run_details = response.json()
print(job_run_details
4. Monitor Delta Tables:
- If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-19-2025 06:43 PM - edited 01-19-2025 06:44 PM
Hi @saurabh18cs , thank you for your response.
Could you please share documentation on the first method you recommended: "Enable Spark Metrics: Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics."?
I am a bit unsure about the usage of this, so some reference material would really help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2025 05:23 AM
Hi @sahasimran98 I think you're right this is more valid for synapse where such configuration exist but you can still give a try for databricks and let us know here the results. otherwise try to find some spark-monitoring package in github for databricks.

