cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Volume Read/Processed for a Databricks Workflow Job

sahasimran98
New Contributor II

Hello All, I have a DBx instance hosted on Azure and I am using the Diagnostic Settings to collect Databricks Jobs related logs in log analytics workspace. So far, from the DatabricksJobs table in Azure Loganalytics, I am able to fetch basic job related data like status, duration etc. I am also looking forward to gather some insights about the total (volume of) data read/processed, or throughput through a Job run, something similar to what we get in the ADFActivityRun table for ADF pipeline runs (Copy activities).

I need help to understand the following:

  1. Is my expectation of such a data w.r.t to a DBX job even appropriate? If yes, how can I fetch this kind of a data?
  2. If not, how can I gain such data read/processed insights w.r.t Databricks entities? Is it applicable to only delta tables, or something like that?

Also please note: I am using a non Unity-Catalog enabled, multi-node cluster for running the Jobs on (for the time being), but if there are any specific cluster requirements to implement any kind of solution, please do let me know about it.

3 REPLIES 3

saurabh18cs
Valued Contributor III

Hi @sahasimran98 

you can opt one of the following ways:

  1. Enable Spark Metrics:

    • Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.
      "spark.metrics.conf.*.sink.azureloganalytics.class":"org.apache.spark.metrics.sink.AzureLogAnalyticsSink",
        "spark.metrics.conf.*.sink.azureloganalytics.workspaceId": "<your-log-analytics-workspace-id>",
        "spark.metrics.conf.*.sink.azureloganalytics.primaryKey": "<your-log-analytics-primary-key>",
        "spark.metrics.conf.*.sink.azureloganalytics.period": "10"
    • 2. Use Spark Listener:
       
       
      • Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

class CustomSparkListener(SparkListener):
def onTaskEnd(self, taskEnd):
metrics = taskEnd.taskMetrics()
print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")

spark = SparkSession.builder \
.appName("CustomSparkListener") \
.config("spark.extraListeners", "com.example.CustomSparkListener") \
.getOrCreate()

3. Use Databricks REST API:

  • Use the Databricks REST API to fetch detailed metrics and logs for job runs.

import requests

databricks_instance = "https://<databricks-instance>"
token = "<your-databricks-token>"
job_id = "<your-job-id>"

headers = {
"Authorization": f"Bearer {token}"
}

response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)
job_run_details = response.json()

print(job_run_details

4. Monitor Delta Tables:

  • If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.
delta_table = DeltaTable.forPath(spark, "path/to/delta/table")
history = delta_table.history()
history.show()

Hi @saurabh18cs , thank you for your response. 

Could you please share documentation on the first method you recommended: "Enable Spark Metrics: Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics."?

I am a bit unsure about the usage of this, so some reference material would really help!

saurabh18cs
Valued Contributor III

Hi @sahasimran98 I think you're right this is more valid for synapse where such configuration exist but you can still give a try for databricks and let us know here the results. otherwise try to find some spark-monitoring package in github for databricks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group