cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results for 
Search instead for 
Did you mean: 

Lakehouse Monitoring API – Timeout Error When Enabling Monitors for All Tables in a Catalog

esalazar
New Contributor II

I'm using the Databricks Lakehouse Monitoring API to enable monitoring across every table in a catalog. I wrote a script that loops through all schemas and tables and calls the create_monitor API for each one. However, when running the script from a notebook, I consistently get a Timed out after 0:05:00 error.

It seems like enabling monitors sequentially for a large number of tables is exceeding the execution timeout, especially if each API call takes a few seconds.

Questions:

  • Is there a recommended way to avoid this timeout when enabling monitors at scale?

  • Should I implement parallelism or batching in the script?

  • Is there a way to increase the execution timeout in a Databricks notebook?

Any guidance or best practices would be appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions

BigRoux
Databricks Employee
Databricks Employee
Key Recommendations for Enabling Lakehouse Monitoring at Scale Without Notebook Timeouts
1. Use Parallelism and Batching
- Avoid sequential API calls for many tables—they are slow and will likely hit execution limits. - Implement batching and use parallel threads or asynchronous calls (such as with Python’s ThreadPoolExecutor) to enable multiple monitors at once.
- Begin with a modest number of parallel tasks (e.g., 5–10) to avoid API rate limits and Databricks backend overload.
2. Prefer Jobs or External Automation Over Notebooks
- Databricks notebooks have execution timeouts and are not intended for large-scale bulk operations. - For many tables, automate the process using a Databricks Job or external orchestrator (like Terraform or custom REST API scripts) for longer or more robust execution.
3. Adjust Execution Timeout If Needed
- If using notebooks, increase the timeout if supported by your compute environment:
python spark.conf.set("spark.databricks.execution.timeout", "18000") # seconds - Not all compute types honor this setting. Serverless and job clusters are most likely to support overrides. - For SQL Warehouses, adjust the STATEMENT_TIMEOUT parameter.
4. Optimize Monitor Creation
- Enable Change Data Feed (CDF) on tables and use recommended profiles for more efficient, scalable monitoring. - Use the latest Databricks SDK (>= 0.28.0) for improved API handling.
5. Handle API Rate Limits and Errors
- Monitor for errors like PENDING_UPDATE or rate limit responses; implement retry logic if encountered. - Increase concurrency cautiously, watching for API throttling or backend queuing.
6. There Is No "Batch Create" API
- Each table must be enabled via an individual API call; batching and parallelizing must be done client-side.

Summary Table
Problem Area Solution/Best Practice
Timeouts (notebook/script) Increase execution timeout if possible
Bulk/slow enablement Use batching & parallelism (reasonable limits)
Operational scale Run as a Job, or orchestrate externally
API throttling/errors Implement retry/error handling
Efficiency Enable CDF, use proper profiles, latest SDK

Example (Parallel Creation): ```python from concurrent.futures import ThreadPoolExecutor
def create_monitor_for_table(table_info): # API call logic here pass
with ThreadPoolExecutor(max_workers=10) as executor: executor.map(create_monitor_for_table, list_of_tables) `` _Adjust maxworkers` based on observed performance and API constraints.

In summary:
To avoid timeouts, use parallelism, increase execution timeouts when possible, handle API limits/errors gracefully, and prefer running long bulk operations via jobs or automation frameworks rather than interactive notebooks. There’s no built-in batch operation; you must implement batching/parallelism on the client side.
 
Hope this helps, Lou.

View solution in original post

3 REPLIES 3

BigRoux
Databricks Employee
Databricks Employee
Key Recommendations for Enabling Lakehouse Monitoring at Scale Without Notebook Timeouts
1. Use Parallelism and Batching
- Avoid sequential API calls for many tables—they are slow and will likely hit execution limits. - Implement batching and use parallel threads or asynchronous calls (such as with Python’s ThreadPoolExecutor) to enable multiple monitors at once.
- Begin with a modest number of parallel tasks (e.g., 5–10) to avoid API rate limits and Databricks backend overload.
2. Prefer Jobs or External Automation Over Notebooks
- Databricks notebooks have execution timeouts and are not intended for large-scale bulk operations. - For many tables, automate the process using a Databricks Job or external orchestrator (like Terraform or custom REST API scripts) for longer or more robust execution.
3. Adjust Execution Timeout If Needed
- If using notebooks, increase the timeout if supported by your compute environment:
python spark.conf.set("spark.databricks.execution.timeout", "18000") # seconds - Not all compute types honor this setting. Serverless and job clusters are most likely to support overrides. - For SQL Warehouses, adjust the STATEMENT_TIMEOUT parameter.
4. Optimize Monitor Creation
- Enable Change Data Feed (CDF) on tables and use recommended profiles for more efficient, scalable monitoring. - Use the latest Databricks SDK (>= 0.28.0) for improved API handling.
5. Handle API Rate Limits and Errors
- Monitor for errors like PENDING_UPDATE or rate limit responses; implement retry logic if encountered. - Increase concurrency cautiously, watching for API throttling or backend queuing.
6. There Is No "Batch Create" API
- Each table must be enabled via an individual API call; batching and parallelizing must be done client-side.

Summary Table
Problem Area Solution/Best Practice
Timeouts (notebook/script) Increase execution timeout if possible
Bulk/slow enablement Use batching & parallelism (reasonable limits)
Operational scale Run as a Job, or orchestrate externally
API throttling/errors Implement retry/error handling
Efficiency Enable CDF, use proper profiles, latest SDK

Example (Parallel Creation): ```python from concurrent.futures import ThreadPoolExecutor
def create_monitor_for_table(table_info): # API call logic here pass
with ThreadPoolExecutor(max_workers=10) as executor: executor.map(create_monitor_for_table, list_of_tables) `` _Adjust maxworkers` based on observed performance and API constraints.

In summary:
To avoid timeouts, use parallelism, increase execution timeouts when possible, handle API limits/errors gracefully, and prefer running long bulk operations via jobs or automation frameworks rather than interactive notebooks. There’s no built-in batch operation; you must implement batching/parallelism on the client side.
 
Hope this helps, Lou.

bhanu_gautam
Valued Contributor II

@BigRoux , Thank you this is very well explained and it is really helpful.

Regards
Bhanu Gautam

Kudos are appreciated

esalazar
New Contributor II

Thank you, this is really helpful!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now