cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results for 
Search instead for 
Did you mean: 

child Subqueries/sub-statements history metrics, from a parent [CALL...] statement in QueryHistory

ADBricksExplore
New Contributor II

Hi,
I cannot find so far a way to get programmatically (SQL/Python) the Subqueries(/Sub-statements) executions history records, shown in ADBricks UI Query History/Profile, that were executed during a TaskRun of Job, as shown in [red boxes] on the attached screenshot:
Subqueries-ADBricks-UISubqueries-ADBricks-UI

 


What I've tried so far, without any luck:
- tried mimicking the browser's client app ADBricks UI requests, like [POST https://*.azuredatabricks.net/graphql/HistoryStatementPlanMetadata] with {operationName:"HistoryStatementPlanMetadata",variables:{metadataInput:{queryId="GUIDGUID-GUID-GUID-GUID-...bla-bla...}}query="query HistoryStatementPlanMetadata($metadataInput:...bla-bla...}
-- this never works reliably, returning 400 "Graph request not authentic" replies, regardless that a mimicked request (is identical to the client ADBricks app request) and is correctly authenticated & authorized within the same browser's session and the same browser's request's headers [Cookie:workspace-url=adb-***.azuredatabricks.net;DBAUTH=dbws.1.***]

- tried exploring [system.access.*/(audit,table_lineage)],  [system.query.*] tables with a Notebook queries [SELECT ... JOIN ...]

What I read about, by it looks not tightly related to historical traces:
- one may try use "injected" Python code (spark.QueryExecutionListener + spark.DataFrame + observe()) as a wrapper/neighbor statement at runtime to a [CALL] statement and immediately after the [CALL ...] invocation - at the runtime, inspect (via DataFrame.explain() in "injected" code) an object model of returned [Plan*]s against expected metrics information;
-- this, additionally needs some "automatic code-injection/-binding mechanism" into each SQL-statement of code (generated by [CREATE PROCEDURE...]) of a Function residing in a UnityCatalog, which is injecting/binding right before been invoked by a Job TaskRun.
- one may also try, to dig into SparkUI logs, but for [Serverless] computes - SparkUI announced as not available.

So, in addition to the already known way, for retrieving an originating parent query ([CALL...] statement) execution history/metrics programmatically from "Query History" (via [system.*] tables, or, via API):
 - could someone, please advise a way, how to get programmatically the child Subqueries/Sub-statements executions history/metrics records - in essential/native way using ADBricks released/public tools, or some alternatives?
A way, that can be "put into code" for periodical invocations in a Job TaskRun, for incrementally gathering Subqueries/sub-statements metrics records copies => into an organization's own, let's say [SQL/ETL-Log/History] flat table?

Thanks in advance

3 REPLIES 3

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @ADBricksExplore , 

Short answer: there isn’t a supported public API that returns the “Substatements / Subqueries” panel you see in the Query History or Profile UI. The GraphQL endpoints the UI relies on are internal and not stable or supported for automation.

What you can do—reliably and with supported surfaces—is reconstruct those child statements using the Query History system table or the Query History API / CLI, keyed off the parent statement’s session and run metadata.

What works today (supported and durable)

  • Use the system.query.history system table to pull every statement executed in the same job task run and/or session as the parent CALL. The parent CALL and each child SQL statement show up as separate rows with their own statement_id values.

  • system.query.history includes statement_id, session_id, and a rich query_source struct with job metadata, including job_task_run_id, which is the cleanest way to correlate children back to the parent.

  • You can jump from any system.query.history row to the UI profile using statement_id, but there’s no public API to programmatically fetch the operator-level DAG or profile JSON beyond manual UI export.

  • If you prefer APIs or CLI, the Query History REST API or the databricks query-history list –include-metrics command works well; you just filter client-side by session, time window, or job_task_run_id.

  • Query History system table retention is currently published as 365 days, so plan periodic ingestion if you need longer-term analysis.

  • In serverless workspaces, Query History captures SQL and Python statements from notebooks and jobs as well.

How to pull child statements for a given CALL

  1. Look up the parent context

    Once you have the parent statement_id (from the UI or logs), pull its execution metadata:

 

SELECT
  statement_id,
  session_id,
  start_time,
  end_time,
  executed_by,
  compute,
  query_source
FROM system.query.history
WHERE statement_id = '<PARENT_STATEMENT_ID>';

session_id identifies the execution session.

query_source.job_info.job_task_run_id, when present, identifies the specific Job Task Run.

2A) Preferred: correlate by Job Task Run

If job_task_run_id is populated, use it to fetch everything executed by that task and exclude the parent:

WITH parent AS (
  SELECT
    query_source.job_info.job_task_run_id AS jtr_id,
    statement_id AS parent_id
  FROM system.query.history
  WHERE statement_id = '<PARENT_STATEMENT_ID>'
)
SELECT h.*
FROM system.query.history h
JOIN parent p
  ON h.query_source.job_info.job_task_run_id = p.jtr_id
WHERE h.statement_id <> p.parent_id
ORDER BY h.start_time;

This mirrors how the UI groups sub-statements under a CALL within a job task run and is the most robust approach.

2B) Fallback: session_id plus time window

If job_task_run_id isn’t available, fall back to session and time bounds:

WITH parent AS (
  SELECT
    session_id,
    start_time AS t0,
    end_time   AS t1,
    statement_id AS parent_id
  FROM system.query.history
  WHERE statement_id = '<PARENT_STATEMENT_ID>'
)
SELECT h.*
FROM system.query.history h
JOIN parent p
  ON h.session_id = p.session_id
WHERE h.start_time >= p.t0
  AND (p.t1 IS NULL OR h.end_time <= p.t1)
  AND h.statement_id <> p.parent_id
ORDER BY h.start_time;
  1. Make it incremental and durable

    For ongoing analysis, ingest Query History deltas into your own table using update_time or start_time:

 

INSERT INTO my_org.sql_substatement_log
SELECT
  h.account_id,
  h.workspace_id,
  h.statement_id,
  h.session_id,
  h.execution_status,
  h.statement_text,
  h.statement_type,
  h.start_time,
  h.end_time,
  h.execution_duration_ms,
  h.read_rows,
  h.read_bytes,
  h.produced_rows,
  h.written_bytes,
  h.query_source
FROM system.query.history h
WHERE h.update_time > TIMESTAMPADD(MINUTE, -15, CURRENT_TIMESTAMP);

Helpful realities to be aware of

  • There is no documented public API to fetch the UI’s “Substatements” block or the full operator-level DAG programmatically. Manual UI export is the only supported way to get that JSON today.

  • For lightweight pulls, the CLI is convenient:

    databricks query-history list –include-metrics –max-results 1000 …

  • You can filter Query History by compute, user, time range, and more via UI or API, then jump to the profile using statement_id when you need deeper inspection.

  • Profiles are not generated for results served entirely from cache. If you need a profile, rerun the query with a trivial change.

  • Make sure system tables are enabled in your account and metastore; otherwise system.query.history won’t be available.

 

End-to-end pattern (SQL)

Given a parent CALL statement_id, fetch its children and persist them:

WITH parent AS (
  SELECT
    statement_id AS parent_id,
    coalesce(query_source.job_info.job_task_run_id, '') AS jtr_id,
    session_id,
    start_time AS t0,
    end_time   AS t1
  FROM system.query.history
  WHERE statement_id = '<PARENT_STATEMENT_ID>'
),
children AS (
  SELECT h.*
  FROM system.query.history h
  JOIN parent p
    ON (
         (p.jtr_id <> '' AND h.query_source.job_info.job_task_run_id = p.jtr_id)
         OR
         (p.jtr_id = ''  AND h.session_id = p.session_id
                       AND h.start_time >= p.t0
                       AND (p.t1 IS NULL OR h.end_time <= p.t1))
       )
  WHERE h.statement_id <> p.parent_id
)
INSERT INTO my_org.sql_substatement_log
SELECT
  statement_id,
  session_id,
  statement_type,
  statement_text,
  execution_status,
  start_time,
  end_time,
  execution_duration_ms,
  read_rows,
  read_bytes,
  produced_rows,
  written_bytes
FROM children
ORDER BY start_time;

API and CLI equivalents

  • REST: /api/2.0/sql/history/queries, filter by time or warehouse, then correlate by session_id or job_task_run_id after identifying the parent.

  • CLI: databricks query-history list –include-metrics, followed by client-side filtering.

Why the GraphQL approach doesn’t work

The GraphQL endpoints visible in browser network traces are internal UI interfaces. They are not supported for automation and won’t work reliably outside the browser context. The supported surfaces are the UI, REST API, CLI, and system tables—with system.query.history being the recommended option for scalable, repeatable ingestion.

 

Hope this helps, Louis.

Hi @Louis_Frolio  ,
thank you for all the thorough details you provided, but reality in our org existing setup (UC,WHs/Computes, etc.) is not so bright as you are describing 😕

I've prepared some PoC/test case to check your provided approaches:
- (never cached before) code units:
-- [PROCEDURE]+[JOB]+[TABLE] for ServerLess run
--- [TASK]+[QUERY] invoke the above
--- [TASK]+[FILE] invoke the above
- (never cached before) code units:
-- [PROCEDURE]+[JOB]+[TABLE] for WH-SQL-Pro run
--- [TASK]+[QUERY] invoke the above 
--- [TASK]+[FILE] invoke the above

After test runs of Jobs, we have:
- both [TASK]+[FILE] invocations went OK, but the traces been left in "programmatically accessible logs" - are the same, as were for [TASK]+[NOTEBOOK] in my initial question:
-- only initial [CALL...] statement invocations are recorded, no Subqueries/Sub-statements, regardless that I used all 3 suggested approaches for trying chaining/joining the parent "visible-QueryHistory-record" with "invisibly-expected-Subqueries-records" - via [statement_id], via [session_id], via [query_source.job_info.*], please see the screen:
image.png

- both [TASK]+[QUERY] failed, so, I've created a post:
-- https://community.databricks.com/t5/data-engineering/code-unit-1-line-1-statement-stored-as-query-co...

- subqueries/sub-statements got actually executed, both of [TABLE] units were filled up, as seen on the screen:
image.png

-- JFYI (off-topic observation) - if a part of code from the underlying [CALL...] is been run manually from NOTEBOOK on SERVERLESS, in comparison to the above run from TASK[SQL\File] (ServerLess/WH-SQL-Pro) - then *_version() struct components (.dbr_* & .dbsql_*) will produce different results:
image.png

So, @Louis_Frolio  if you could be so kind, to try such a POC/test code units setup on your ADBricks environment, to show the resulting screens of how it "works today (supported and durable)" - please do, I would appreciate your help, to understand what are differences in our org environment preventing to use your suggested approaches...
providing below a code template for fast "deployment":

/*
use catalog user_org_com_sandbox;use schema migration_test;
create or replace table srvless_nocache_table (time timestamp, col string, ver string);
create or replace table wrhouse_nocache_table (time timestamp, col string, ver string);
*/
/*
create or replace procedure user_org_com_sandbox.migration_test.srvless_nocache_procedure() LANGUAGE SQL SQL SECURITY INVOKER as begin declare _srvless_nocache_value string='srvless_nocache_record'; use catalog user_org_com_sandbox;use schema migration_test; insert into srvless_nocache_table values (current_timestamp(),_srvless_nocache_value,array_join(array(current_version().dbr_version,current_version().dbsql_version,current_version().u_build_hash,current_version().r_build_hash),'|','')); end;
*/
/*
create or replace procedure user_org_com_sandbox.migration_test.wrhouse_nocache_procedure() LANGUAGE SQL SQL SECURITY INVOKER as begin declare _wrhouse_nocache_value string='wrhouse_nocache_record'; use catalog user_org_com_sandbox;use schema migration_test; insert into wrhouse_nocache_table values (current_timestamp(),_wrhouse_nocache_value,array_join(array(current_version().dbr_version,current_version().dbsql_version,current_version().u_build_hash,current_version().r_build_hash),'|','')); end;
*/
/*
/Workspace/Users/user@org.com/Query-SrvLess-Log-Subquery-Test.dbquery.ipynb
CALL user_org_com_sandbox.migration_test.srvless_nocache_procedure();
/Workspace/Users/user@org.com/Query-WrHouse-Log-Subquery-Test.dbquery.ipynb
CALL user_org_com_sandbox.migration_test.wrhouse_nocache_procedure();
*/

/*
/Workspace/Users/user@org.com/File-SrvLess-Log-Subquery-Test.sql
CALL user_org_com_sandbox.migration_test.srvless_nocache_procedure();
/Workspace/Users/user@org.com/File-WrHouse-Log-Subquery-Test.sql
CALL user_org_com_sandbox.migration_test.wrhouse_nocache_procedure();
*/

/*Job_SrvLess_Log_Subquery_Test = {
name:"Job-SrvLess-Log-Subquery-Test",
tasks:[
  {task_key:"Task-SrvLess-Log-Subquery-Test",
  sql_task:{query:{
      query_id:"GuidGuid-Guid-Guid-Guid-GuidGuidGuid"} --[Query-SrvLess-Log-Subquery-Test.dbquery.ipynb]
    warehouse_id:"GuidGuidGuidGuid"}},  --[wh serverless 2xs]
  {task_key:Task-File-SrvLess-Log-Subquery-Test",
  sql_task:{file:{
      path:"/Workspace/Users/user@org.com/File-SrvLess-Log-Subquery-Test.sql",
      source:"WORKSPACE"},
    warehouse_id:"GuidGuidGuidGuid"}}  --[wh serverless 2xs]
]}*/
/*Job-WrHouse-Log-Subquery-Test = {
name:"Job-WrHouse-Log-Subquery-Test",
tasks:[
  {task_key:"Task-WrHouse-Log-Subquery-Test",
  sql_task:{query:{
      query_id:"GuidGuid-Guid-Guid-Guid-GuidGuidGuid"} --[Query-WrHouse-Log-Subquery-Test.dbquery.ipynb]
    warehouse_id:"GuidGuidGuidGuid"}},  --[wh 2xs pro test]
  {task_key:Task-File-WrHouse-Log-Subquery-Test",
  sql_task:{file:{
      path:"/Workspace/Users/user@org.com/File-WrHouse-Log-Subquery-Test.sql",
      source:"WORKSPACE"},
    warehouse_id:"GuidGuidGuidGuid"}}  --[wh 2xs pro test]
]}*/

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ADBricksExplore,

The sub-statements panel you see in the Query History UI (the child queries spawned by a CALL or multi-statement execution) is rendered from an internal GraphQL endpoint that is not part of the public Databricks API surface. That is why mimicking those browser requests returns inconsistent 400 errors. Those endpoints are undocumented, unsupported for automation, and subject to change without notice, so building on top of them is not recommended.

That said, there are supported approaches you can use to correlate child statement executions with their parent job context.

APPROACH 1: CORRELATE VIA job_task_run_id IN system.query.history

The system.query.history system table (Public Preview) contains a query_source struct with nested job_info fields: job_id, job_run_id, and job_task_run_id. When a CALL statement or stored procedure runs inside a job task, all the individual SQL statements it spawns share the same job_task_run_id. You can use this to group them:

SELECT
statement_id,
statement_type,
statement_text,
start_time,
end_time,
total_duration_ms,
execution_duration_ms,
read_rows,
produced_rows,
read_bytes,
written_bytes,
execution_status
FROM system.query.history
WHERE query_source.job_info.job_task_run_id = '<your_task_run_id>'
ORDER BY start_time

This returns every statement that executed within that specific task run, effectively giving you the parent CALL plus all child statements it triggered. You can further distinguish the parent from the children by filtering on statement_type (the CALL will have statement_type = 'CALL' or similar, while children will be SELECT, INSERT, MERGE, etc.).

APPROACH 2: CORRELATE VIA session_id + TIME WINDOW

If your workload does not run inside a job (for example, interactive notebook or SQL editor execution), you can fall back to correlating by session_id and a time window:

SELECT
statement_id,
statement_type,
statement_text,
start_time,
total_duration_ms,
read_rows,
produced_rows,
execution_status
FROM system.query.history
WHERE session_id = '<your_session_id>'
AND start_time BETWEEN '<parent_start>' AND '<parent_end>'
ORDER BY start_time

The session_id ties all statements from the same Spark session together, and the time window scopes it to the execution window of the parent statement.

APPROACH 3: REST API WITH FILTERS

The Query History REST API (GET /api/2.0/sql/history/queries) also supports filtering by time range and returns metrics like duration, rows read/produced, bytes read/written, and execution status. You can also use the Databricks CLI:

databricks query-history list --include-metrics --filter-by '{"statuses":["FINISHED"]}'

However, note that this endpoint does not expose a parent_statement_id or explicit parent-child linking either. You would still need to correlate by time window or by matching the warehouse_id and session context.

APPROACH 4: CUSTOM TAGGING WITH query_tags

If you control the code that runs inside your CALL statements, you can add custom query tags to each child query. This makes correlation straightforward regardless of execution context:

SET query_tags = map('parent_call', 'my_procedure_v1', 'run_id', '<unique_run_id>');
-- your child queries here

Then query:

SELECT *
FROM system.query.history
WHERE query_tags['parent_call'] = 'my_procedure_v1'
AND query_tags['run_id'] = '<unique_run_id>'

WHAT IS NOT AVAILABLE TODAY

- There is no parent_statement_id column in system.query.history that directly links a child statement to its parent CALL statement.
- The operator-level DAG (the query profile visualization) is not available via any public API. The GraphQL endpoints that power the UI are internal only.
- SparkUI logs are not accessible on serverless compute, as you noted.

For ongoing monitoring, the recommended pattern is to build an incremental ingestion pipeline that queries system.query.history using the update_time column as a watermark. This gives you a durable, queryable history of all statement executions with their job context and metrics.

REFERENCES

- System tables: Query history

https://docs.databricks.com/aws/en/admin/system-tables/query-history

- Monitor query history with the UI

https://docs.databricks.com/aws/en/sql/admin/query-history

- Query History REST API

https://docs.databricks.com/api/workspace/queryhistory

- Query Profile

https://docs.databricks.com/aws/en/sql/admin/query-profile

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.