Databricks Community

sahasimran98 · ‎01-16-2025

Hello All, I have a DBx instance hosted on Azure and I am using the Diagnostic Settings to collect Databricks Jobs related logs in log analytics workspace. So far, from the DatabricksJobs table in Azure Loganalytics, I am able to fetch basic job related data like status, duration etc. I am also looking forward to gather some insights about the total (volume of) data read/processed, or throughput through a Job run, something similar to what we get in the ADFActivityRun table for ADF pipeline runs (Copy activities).

I need help to understand the following:

Is my expectation of such a data w.r.t to a DBX job even appropriate? If yes, how can I fetch this kind of a data?
If not, how can I gain such data read/processed insights w.r.t Databricks entities? Is it applicable to only delta tables, or something like that?

Also please note: I am using a non Unity-Catalog enabled, multi-node cluster for running the Jobs on (for the time being), but if there are any specific cluster requirements to implement any kind of solution, please do let me know about it.

saurabh18cs · ‎01-17-2025

Hi @sahasimran98

you can opt one of the following ways:

Enable Spark Metrics:
- Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.
  "spark.metrics.conf.*.sink.azureloganalytics.class":"org.apache.spark.metrics.sink.AzureLogAnalyticsSink",
  "spark.metrics.conf.*.sink.azureloganalytics.workspaceId": "<your-log-analytics-workspace-id>",
  "spark.metrics.conf.*.sink.azureloganalytics.primaryKey": "<your-log-analytics-primary-key>",
  "spark.metrics.conf.*.sink.azureloganalytics.period": "10"
- 2. Use Spark Listener:
  - Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

class CustomSparkListener(SparkListener):
def onTaskEnd(self, taskEnd):
metrics = taskEnd.taskMetrics()
print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")

spark = SparkSession.builder \
.appName("CustomSparkListener") \
.config("spark.extraListeners", "com.example.CustomSparkListener") \
.getOrCreate()

3. Use Databricks REST API:

Use the Databricks REST API to fetch detailed metrics and logs for job runs.

import requests

databricks_instance = "https://<databricks-instance>"
token = "<your-databricks-token>"
job_id = "<your-job-id>"

headers = {
"Authorization": f"Bearer {token}"
}

response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)
job_run_details = response.json()

print(job_run_details

4. Monitor Delta Tables:

If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.

delta_table = DeltaTable.forPath(spark, "path/to/delta/table")

history = delta_table.history()

history.show()

sahasimran98 · ‎01-19-2025

Hi @saurabh18cs , thank you for your response.

Could you please share documentation on the first method you recommended: "Enable Spark Metrics: Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics."?

I am a bit unsure about the usage of this, so some reference material would really help!

saurabh18cs · ‎01-20-2025

Hi @sahasimran98 I think you're right this is more valid for synapse where such configuration exist but you can still give a try for databricks and let us know here the results. otherwise try to find some spark-monitoring package in github for databricks.

Databricks Community

Data Volume Read/Processed for a Databricks Workflow Job

Photos

Join Us as a Local Community Builder!

Virtual Learning Festival: 9 April - 30 April

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks Community Champion - March 2025 - Takuya Omi