topic Re: Data Volume Read/Processed for a Databricks Workflow Job in Data Engineering

Data Volume Read/Processed for a Databricks Workflow Job

sahasimran98 — Fri, 17 Jan 2025 06:25:14 GMT

Hello All, I have a DBx instance hosted on Azure and I am using the Diagnostic Settings to collect Databricks Jobs related logs in log analytics workspace. So far, from the DatabricksJobs table in Azure Loganalytics, I am able to fetch basic job related data like status, duration etc. I am also looking forward to gather some insights about the total (volume of) data read/processed, or throughput through a Job run, something similar to what we get in the ADFActivityRun table for ADF pipeline runs (Copy activities).

I need help to understand the following:

Is my expectation of such a data w.r.t to a DBX job even appropriate? If yes, how can I fetch this kind of a data?
If not, how can I gain such data read/processed insights w.r.t Databricks entities? Is it applicable to only delta tables, or something like that?

Also please note: I am using a non Unity-Catalog enabled, multi-node cluster for running the Jobs on (for the time being), but if there are any specific cluster requirements to implement any kind of solution, please do let me know about it.

Re: Data Volume Read/Processed for a Databricks Workflow Job

saurabh18cs — Fri, 17 Jan 2025 09:03:55 GMT

Hi @sahasimran98

you can opt one of the following ways:

Enable Spark Metrics:
- Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.
  "spark.metrics.conf.*.sink.azureloganalytics.class":"org.apache.spark.metrics.sink.AzureLogAnalyticsSink",
  "spark.metrics.conf.*.sink.azureloganalytics.workspaceId": "<your-log-analytics-workspace-id>",
  "spark.metrics.conf.*.sink.azureloganalytics.primaryKey": "<your-log-analytics-primary-key>",
  "spark.metrics.conf.*.sink.azureloganalytics.period": "10"
- 2. Use Spark Listener:
  - Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

class CustomSparkListener(SparkListener):
def onTaskEnd(self, taskEnd):
metrics = taskEnd.taskMetrics()
print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")

spark = SparkSession.builder \
.appName("CustomSparkListener") \
.config("spark.extraListeners", "com.example.CustomSparkListener") \
.getOrCreate()

3. Use Databricks REST API:

Use the Databricks REST API to fetch detailed metrics and logs for job runs.

import requests

databricks_instance = "https://<databricks-instance>"
token = "<your-databricks-token>"
job_id = "<your-job-id>"

headers = {
"Authorization": f"Bearer {token}"
}

response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)
job_run_details = response.json()

print(job_run_details

4. Monitor Delta Tables:

If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.

delta_table = DeltaTable.forPath(spark, "path/to/delta/table")

history = delta_table.history()

history.show()

Re: Data Volume Read/Processed for a Databricks Workflow Job

sahasimran98 — Mon, 20 Jan 2025 02:44:19 GMT

Hi @saurabh18cs , thank you for your response.

Could you please share documentation on the first method you recommended: "Enable Spark Metrics: Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics."?

I am a bit unsure about the usage of this, so some reference material would really help!

Re: Data Volume Read/Processed for a Databricks Workflow Job

saurabh18cs — Mon, 20 Jan 2025 13:23:18 GMT

Hi @sahasimran98 I think you're right this is more valid for synapse where such configuration exist but you can still give a try for databricks and let us know here the results. otherwise try to find some spark-monitoring package in github for databricks.