cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Direct access to Databricks query history database

WWoman
New Contributor III

Hello,

I would like to know if there is direct access to the Databricks query history tables. For compliance issues, I would like to be able to create reports  for something like: who has accessed a particular column in a table in the past 6 months. The query history web interface is quite limited.  I would ideally like wo use SQL to query the history table. Is this possible?

 

9 REPLIES 9

Kaniz_Fatma
Community Manager
Community Manager

Hi @WWomanTo address your compliance requirements and track who has accessed specific columns in a table over the past 6 months, you can follow these steps:

  1. Databricks Table Access Control (TAC):
  • Databricks provides an audit and control feature called Table Access Control (TAC). TAC allows you to monitor and manage access to tables within Databricks.
  • By enabling TAC, you can audit table access and view logs related to table usage. These logs are stored in the โ€œDatabricksTableAccessControlโ€ log table.
  • To set up TAC, follow these steps:
  1. Query History:
    • Databricks also maintains a query history that records executed SQL queries.
    • To access the query history:

Remember that these features provide valuable insights into user activity, but they should be used responsibly and in accordance with your organizationโ€™s policies.

WWoman
New Contributor III

I would like to be able to query the query history tables by running my own queries. I do not want to use the Query History interface supplied by Databricks;  I want to be able to create python scripts that access the underling tables/views for TAC and Query history.   From your response, it seems like this is not possible. Can you confirm that?

josh_melton
Contributor II
Contributor II
%pip install databricks-sdk
dbutils.library.restartPython()
warehouse_id = "bc1af43449227761"
hours_back_to_check = 2
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import sql
from pyspark.sql.types import StructType, StructField, StringType, LongType
import time

w = WorkspaceClient()
current_time_ms = int(round(time.time() * 1000))
start_time = current_time_ms - (3600000 * hours_back_to_check)
# Filtering options:
# https://databricks-sdk-py.readthedocs.io/en/latest/dbdataclasses/sql.html#databricks.sdk.service.sql.QueryFilter
query_filter = sql.QueryFilter(
    query_start_time_range=sql.TimeRange(start_time_ms=start_time, end_time_ms=current_time_ms),
    warehouse_ids=[warehouse_id]
)
query_ls = [query for query in w.query_history.list(filter_by=query_filter, include_metrics=True)]
schema = StructType([
    StructField("duration", LongType(), True),
    StructField("query_start_time_ms", LongType(), True),
    StructField("query_end_time_ms", LongType(), True),
    StructField("executed_as_user_name", StringType(), True),
    StructField("query_text", StringType(), True),
])

# Extract the relevant fields from the BaseRun objects
df = spark.createDataFrame([
    (query.duration, query.query_start_time_ms, query.query_end_time_ms, query.executed_as_user_name, query.query_text) 
    for query in query_ls
    ], schema)
df.display()
print([query.metrics for query in query_ls])

This provides query history. You may also be interested in System Tables - for compliance purposes check out Audit Logs:

https://docs.databricks.com/en/administration-guide/system-tables/audit-logs.html

For posterity, there is a query history system table which contains all of this information which is in preview at the time of me writing this. If you're reading this later than May 2024, please check the documentation for the query metrics table.

WWoman
New Contributor III

A quick question on this... (First of all thanks so much for the sample code!). I'm playing around with this and I would like to get the statement_type and status. I see that duration, query_start_time_ms and query_end_time_ms are int data type and defines as LongType(); executed_as_user_name and query_text are str datatype and defined as StringType. statement_type and status are listed as data types QueryStatementType and QueryStatus respectively. How would I define the StructType for these fields?

Sorry, missed this - try the system table and query it in SQL instead, much simpler than defining StructTypes!

WWoman
New Contributor III

The problem with that is that I do not have access to the raw system tables... I am writing a python script to load data into a table, so that we can run queries against it

ReiskaS
New Contributor II

Thanks @josh_melton ! I was wondering right now about this (one day after your post!) since I only found the UI and API in the documentation and was really puzzled that there is no equivalent in unity to the Snowflake query_history table.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group