cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Getting data from the Spark query profiler

IONA
New Contributor III

When you navigate to Compute > Select Cluster > Spark UI > JDBC/ODBC 

There you can see grids of Session stats and SQL stats. Is there any way to get this data in a query so that I can do some analysis?

 

Thanks

2 ACCEPTED SOLUTIONS

Accepted Solutions

WiliamRosa
New Contributor II

Hi @IONA,

Totally agree with @BigRoux  โ€” to make his point actionable, here are official docs you can use:

Query history system table (system.query.history): https://docs.databricks.com/aws/en/admin/system-tables/query-history

Query History (overview/UI): https://docs.databricks.com/aws/en/sql/user/queries/query-history

Query History REST API: https://docs.databricks.com/api/workspace/queryhistory/list

Spark Thrift Server (why JDBC/ODBC UI grids arenโ€™t exposed as tables; retention settings): https://spark.apache.org/docs/latest/configuration.html#spark-sql

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

View solution in original post

szymon_dybczak
Esteemed Contributor III

 

Hi @IONA ,

As @BigRoux  correctly suggested there no native way to get stats from JDBC/ODBC Spark UI.

1. You can try to use query history system table, but it has limited number of metrics

 

%sql
SELECT *
FROM system.query.history

 

2. You can use /api/2.0/sql/history/queries endpoint with include_metrics flag enabled which should return to you following payload:

szymon_dybczak_0-1756799269340.png

 

3. Metrics can be also obtained for following:

  • Cluster metrics - you can export these with cluster logging. It's worth noting that ganglia is deprecated for newer runtimes
  • Warehouse metrics - available through the API for query metrics
  • Jobs performance - you can use the Jobs API 

4. And lastly, you can apache spark REST API monitoring endpoints which gives you access to multiple different metrics. Here, just for sake of an example I'm using it to get environment configuration of my cluster, but there are many many more metrics. Full list you can find at below location:

Monitoring and Instrumentation - Spark 4.0.0 Documentation

 

 

from dbruntime.databricks_repl_context import get_context
import requests

context = get_context()
host = context.browserHostName
cluster_id = context.clusterId


spark_ui_base_url = f"https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1/"
endpoint = 'applications/local-1756797804565/environment'

response = requests.get(
    spark_ui_base_url + endpoint,
    headers={"Authorization": f"Bearer {context.apiToken}"}
)

if response.status_code == 200:
    try:
        data = response.json()
        print(data)
    except requests.exceptions.JSONDecodeError:
        print("Response is not valid JSON:")
        print(response.text)
else:
    print(f"Request failed with status code: {response.status_code}")
    print(f"Response: {response.text}")

 

 

 

View solution in original post

6 REPLIES 6

BigRoux
Databricks Employee
Databricks Employee

Hello Iona, You cannot natively query the exact Session stats and SQL stats from the JDBC/ODBC Spark UI via a simple SQL statement in Databricks today. However, advanced users and admins can access some of the underlying data via log tables (like prod.thrift_statements), Query History API, or specialized REST endpoints. For practical analysis, using the Query History API and parsing the results into Python or SQL for your analysis is the closest workaround currently available.

 

Hope this helps, Louis.

IONA
New Contributor III

Great info. Thank you every so much.

My actual need is to find out in a programmatic manner which tables in databricks are being used by a power bi dashboard. If you open the power bi itself you can see the data model and list the tables. I would have though the one of the pbi rest api endpoints would have given this info since you can do thigs though it such as set off a refresh. But it seems that is not the case. So another approach would be to start at the datbricks end and examine what requests are made of it. Looking at the spark info can see the queries hitting the database and by looking at the user/service principle I can see what is making those requests. So by parsing the sql statement which for a refresh will be "Select * from <<SometableInPowerBI>>", I will be able to say aha, that's a table of interest to me. My aim is to then monitor the these tables that they are being refreshed so that I know all the data in our dashboard is up to date.

WiliamRosa
New Contributor II

Hi @IONA,

Totally agree with @BigRoux  โ€” to make his point actionable, here are official docs you can use:

Query history system table (system.query.history): https://docs.databricks.com/aws/en/admin/system-tables/query-history

Query History (overview/UI): https://docs.databricks.com/aws/en/sql/user/queries/query-history

Query History REST API: https://docs.databricks.com/api/workspace/queryhistory/list

Spark Thrift Server (why JDBC/ODBC UI grids arenโ€™t exposed as tables; retention settings): https://spark.apache.org/docs/latest/configuration.html#spark-sql

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

IONA
New Contributor III

This is great thanks. I will share this knowledge with my team as well.

szymon_dybczak
Esteemed Contributor III

 

Hi @IONA ,

As @BigRoux  correctly suggested there no native way to get stats from JDBC/ODBC Spark UI.

1. You can try to use query history system table, but it has limited number of metrics

 

%sql
SELECT *
FROM system.query.history

 

2. You can use /api/2.0/sql/history/queries endpoint with include_metrics flag enabled which should return to you following payload:

szymon_dybczak_0-1756799269340.png

 

3. Metrics can be also obtained for following:

  • Cluster metrics - you can export these with cluster logging. It's worth noting that ganglia is deprecated for newer runtimes
  • Warehouse metrics - available through the API for query metrics
  • Jobs performance - you can use the Jobs API 

4. And lastly, you can apache spark REST API monitoring endpoints which gives you access to multiple different metrics. Here, just for sake of an example I'm using it to get environment configuration of my cluster, but there are many many more metrics. Full list you can find at below location:

Monitoring and Instrumentation - Spark 4.0.0 Documentation

 

 

from dbruntime.databricks_repl_context import get_context
import requests

context = get_context()
host = context.browserHostName
cluster_id = context.clusterId


spark_ui_base_url = f"https://{host}/driver-proxy-api/o/0/{cluster_id}/40001/api/v1/"
endpoint = 'applications/local-1756797804565/environment'

response = requests.get(
    spark_ui_base_url + endpoint,
    headers={"Authorization": f"Bearer {context.apiToken}"}
)

if response.status_code == 200:
    try:
        data = response.json()
        print(data)
    except requests.exceptions.JSONDecodeError:
        print("Response is not valid JSON:")
        print(response.text)
else:
    print(f"Request failed with status code: {response.status_code}")
    print(f"Response: {response.text}")

 

 

 

IONA
New Contributor III

That is great. Thanks

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now