Databricks Community

esalazar · ‎06-30-2025

I'm using the Databricks Lakehouse Monitoring API to enable monitoring across every table in a catalog. I wrote a script that loops through all schemas and tables and calls the create_monitor API for each one. However, when running the script from a notebook, I consistently get a Timed out after 0:05:00 error.

It seems like enabling monitors sequentially for a large number of tables is exceeding the execution timeout, especially if each API call takes a few seconds.

Questions:

Is there a recommended way to avoid this timeout when enabling monitors at scale?
Should I implement parallelism or batching in the script?
Is there a way to increase the execution timeout in a Databricks notebook?

Any guidance or best practices would be appreciated!

Louis_Frolio · ‎07-01-2025

Key Recommendations for Enabling Lakehouse Monitoring at Scale Without Notebook Timeouts

1. Use Parallelism and Batching
- Avoid sequential API calls for many tables—they are slow and will likely hit execution limits. - Implement batching and use parallel threads or asynchronous calls (such as with Python’s ThreadPoolExecutor) to enable multiple monitors at once.
- Begin with a modest number of parallel tasks (e.g., 5–10) to avoid API rate limits and Databricks backend overload.

2. Prefer Jobs or External Automation Over Notebooks
- Databricks notebooks have execution timeouts and are not intended for large-scale bulk operations. - For many tables, automate the process using a Databricks Job or external orchestrator (like Terraform or custom REST API scripts) for longer or more robust execution.

3. Adjust Execution Timeout If Needed
- If using notebooks, increase the timeout if supported by your compute environment:

python
  spark.conf.set("spark.databricks.execution.timeout", "18000")  # seconds

- Not all compute types honor this setting. Serverless and job clusters are most likely to support overrides. - For SQL Warehouses, adjust the STATEMENT_TIMEOUT parameter.

4. Optimize Monitor Creation
- Enable Change Data Feed (CDF) on tables and use recommended profiles for more efficient, scalable monitoring. - Use the latest Databricks SDK (>= 0.28.0) for improved API handling.

5. Handle API Rate Limits and Errors
- Monitor for errors like PENDING_UPDATE or rate limit responses; implement retry logic if encountered. - Increase concurrency cautiously, watching for API throttling or backend queuing.

6. There Is No "Batch Create" API
- Each table must be enabled via an individual API call; batching and parallelizing must be done client-side.

Summary Table

Problem Area	Solution/Best Practice
Timeouts (notebook/script)	Increase execution timeout if possible
Bulk/slow enablement	Use batching & parallelism (reasonable limits)
Operational scale	Run as a Job, or orchestrate externally
API throttling/errors	Implement retry/error handling
Efficiency	Enable CDF, use proper profiles, latest SDK

Example (Parallel Creation): ```python from concurrent.futures import ThreadPoolExecutor

def create_monitor_for_table(table_info): # API call logic here pass

with ThreadPoolExecutor(max_workers=10) as executor: executor.map(create_monitor_for_table, list_of_tables)

``
_Adjust

maxworkers` based on observed performance and API constraints.

In summary:
To avoid timeouts, use parallelism, increase execution timeouts when possible, handle API limits/errors gracefully, and prefer running long bulk operations via jobs or automation frameworks rather than interactive notebooks. There’s no built-in batch operation; you must implement batching/parallelism on the client side.

Hope this helps, Lou.

View solution in original post

Louis_Frolio · ‎07-01-2025