Hi @sahasimran98
you can opt one of the following ways:
Enable Spark Metrics:
- Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.
"spark.metrics.conf.*.sink.azureloganalytics.class":"org.apache.spark.metrics.sink.AzureLogAnalyticsSink",
"spark.metrics.conf.*.sink.azureloganalytics.workspaceId": "<your-log-analytics-workspace-id>",
"spark.metrics.conf.*.sink.azureloganalytics.primaryKey": "<your-log-analytics-primary-key>",
"spark.metrics.conf.*.sink.azureloganalytics.period": "10"
2. Use Spark Listener:
- Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
class CustomSparkListener(SparkListener):
def onTaskEnd(self, taskEnd):
metrics = taskEnd.taskMetrics()
print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")
spark = SparkSession.builder \
.appName("CustomSparkListener") \
.config("spark.extraListeners", "com.example.CustomSparkListener") \
.getOrCreate()
3. Use Databricks REST API:
- Use the Databricks REST API to fetch detailed metrics and logs for job runs.
import requests
databricks_instance = "https://<databricks-instance>"
token = "<your-databricks-token>"
job_id = "<your-job-id>"
headers = {
"Authorization": f"Bearer {token}"
}
response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)
job_run_details = response.json()
print(job_run_details
4. Monitor Delta Tables:
- If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.
delta_table = DeltaTable.forPath(spark, "path/to/delta/table")
history = delta_table.history()
history.show()