cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Restrict certain queries on SQL Warehouse

sparkplug
Contributor

We manage a databricks workspace where majority of the users are working, we haven't switched over to Serverless yet because there is lot of observability and control that is lacking. So we are still on one classic SQL warehouse which is shared among everyone, how can I implement restrictions such as
- Rate limit user queries in a short amount of time

- Not allowing certain queries to be run - example SELECT 1 -  people do this to keep the warehouse running

- Add a global restriction on statement_timeout - we have set this but it's easily overridden by the user

- Cancelling duplicate queued queries  

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @sparkplug,

If you are using a shared classic SQL warehouse, I would set expectations that the native controls are fairly limited. The publicly documented knobs are mainly around warehouse sizing and queuing behaviour, warehouse monitoring, and statement timeout, rather than per-user governance controls like rate limiting or SQL pattern blocking. You can see the documented warehouse behaviour here, the monitoring options here, and the STATEMENT_TIMEOUT parameter here.

More specifically, I do not see a documented built-in setting on classic SQL warehouses to rate limit users based on how many queries they submit in a short period, to block arbitrary read-only statements such as SELECT 1, or to automatically cancel duplicate queued queries before they run. What you can do natively is monitor queued and running queries, adjust warehouse size and cluster counts, and set a statement timeout. That timeout can also be set at the session level, so it is not a hard, admin-only ceiling in the way you are looking for.

If you need stricter controls, the pragmatic approach is usually to put a service or proxy in front of the warehouse and have that layer enforce per-user throttling, reject known keepalive patterns before submission, and detect duplicate queued requests. If you are submitting through the Statement Execution API, you can also cancel statements programmatically using the cancel endpoint once your policy determines they should be stopped.

In simple terms... on a shared classic warehouse, this is not something you can fully solve with warehouse settings alone. You can get part of the way there with timeout, monitoring, and right-sizing, but the more opinionated controls you listed generally need to be enforced outside the warehouse itself.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

3 REPLIES 3

balajij8
Contributor III

You can implement a monitoring workflow that periodically scans query history for users issuing SET STATEMENT_TIMEOUT, SELECT Alive, Duplicate Query & Rate Limit statements. This can help identify users who are bypassing the globally configured default timeout allowing you to review their usage patterns and if necessary, remove access to the affected SQL Warehouse or move them to a separate warehouse with different governance controls. You can stop the long running / duplicate queries using the Execution_api interfaces in the workflow if required.

You can move to Serverless instead of Classic warehouses to avail cost & performance benefits.

  • Statement Timeout Offenders
SELECT
    executed_by,
    statement_id,
    start_time,
    statement_text
FROM system.query.history
WHERE lower(statement_text) RLIKE 'set\\s+statement_timeout'
ORDER BY start_time DESC;
  • Select 1 statement Offenders
SELECT 
  statement_id, 
  executed_by, 
  warehouse_id
FROM 
  system.query.history
WHERE 
  AND REGEXP_LIKE(TRIM(LOWER(statement_text)), '^select\s+1\s*;?$')
  •  Duplicate Query Offenders
WITH queued_queries AS (    
    SELECT 
      statement_id,
      executed_by,
      MD5(TRIM(LOWER(statement_text))) as query_hash,
      start_time
    FROM system.query.history
    WHERE execution_status = 'QUEUED'
)
SELECT 
  statement_id
FROM (
  SELECT 
    statement_id,
    ROW_NUMBER() OVER(PARTITION BY executed_by, query_hash ORDER BY start_time ASC) as occurrence
  FROM queued_queries
)
WHERE occurrence > 1
  • Rate Limit Users
SELECT executed_by, COUNT(*) as queries_last_10_mins
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 10 MINUTE
GROUP BY executed_by
HAVING queries_last_10_mins > 100;

sparkplug
Contributor

These information is already available in the warehouse monitoring view that's why I was able to tell that users are running SELECT 1 queries.

 

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @sparkplug,

If you are using a shared classic SQL warehouse, I would set expectations that the native controls are fairly limited. The publicly documented knobs are mainly around warehouse sizing and queuing behaviour, warehouse monitoring, and statement timeout, rather than per-user governance controls like rate limiting or SQL pattern blocking. You can see the documented warehouse behaviour here, the monitoring options here, and the STATEMENT_TIMEOUT parameter here.

More specifically, I do not see a documented built-in setting on classic SQL warehouses to rate limit users based on how many queries they submit in a short period, to block arbitrary read-only statements such as SELECT 1, or to automatically cancel duplicate queued queries before they run. What you can do natively is monitor queued and running queries, adjust warehouse size and cluster counts, and set a statement timeout. That timeout can also be set at the session level, so it is not a hard, admin-only ceiling in the way you are looking for.

If you need stricter controls, the pragmatic approach is usually to put a service or proxy in front of the warehouse and have that layer enforce per-user throttling, reject known keepalive patterns before submission, and detect duplicate queued requests. If you are submitting through the Statement Execution API, you can also cancel statements programmatically using the cancel endpoint once your policy determines they should be stopped.

In simple terms... on a shared classic warehouse, this is not something you can fully solve with warehouse settings alone. You can get part of the way there with timeout, monitoring, and right-sizing, but the more opinionated controls you listed generally need to be enforced outside the warehouse itself.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***