Re: Databricks workflows for APIs with different f...

emma_s · ‎04-08-2026

You're right that job clusters are the wrong fit here. The cold start time (including serverless, which is still 25-50s) makes anything under 5 minutes impractical when the cluster terminates between runs.

The simplest approach: all-purpose cluster + scheduling loop in a single notebook.

You already have a config view with API paths and frequencies, so you're most of the way there. The idea is to run one notebook on an always-on all-purpose cluster that
ticks every 60 seconds and checks which APIs are due.

Step 1: Add a last_called column to your config table

You need to track when each API was last called so the scheduler knows what's due. A Delta table works well for this:

CREATE TABLE api_call_config (
api_name STRING,
api_path STRING,
frequency_seconds INT,
last_called TIMESTAMP
);

-- Populate from your existing view
INSERT INTO api_call_config
SELECT api_name, api_path, apicallfreq, NULL AS last_called
FROM your_existing_config_view;

Step 2: The scheduling notebook

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
import time
from datetime import datetime

def call_api(row):
"""Call a single API and return the result"""
try:
response = requests.get(row["api_path"], timeout=30)
return {"api_name": row["api_name"], "status": response.status_code, "success": True}
except Exception as e:
return {"api_name": row["api_name"], "status": str(e), "success": False}

while True:
tick_start = time.time()

# Find which APIs are due
due_apis = spark.sql("""
SELECT api_name, api_path, frequency_seconds
FROM api_call_config
WHERE last_called IS NULL
OR TIMESTAMPDIFF(SECOND, last_called, current_timestamp()) >= frequency_seconds
""").collect()

if due_apis:
# Call them in parallel
with ThreadPoolExecutor(max_workers=10) as pool:
futures = {pool.submit(call_api, row.asDict()): row for row in due_apis}
successful = []
for future in as_completed(futures):
result = future.result()
if result["success"]:
successful.append(result["api_name"])

# Update last_called for successful calls
if successful:
names = ",".join([f"'{n}'" for n in successful])
spark.sql(f"""
UPDATE api_call_config
SET last_called = current_timestamp()
WHERE api_name IN ({names})
""")

print(f"[{datetime.now():%H:%M:%S}] Called {len(due_apis)} APIs, {len(successful)} succeeded")

# Sleep the remainder of the minute
elapsed = time.time() - tick_start
time.sleep(max(0, 60 - elapsed))

Step 3: Wrap it in a job for resilience

Run this notebook as a Databricks job with a continuous trigger on an all-purpose cluster. This gives you:
- Auto-restart if the notebook crashes
- Email/webhook alerts on failure
- Run history and logs

In the job config, set the compute to your all-purpose cluster (not a job cluster) and set the trigger to continuous with no pause.

Why this works better than the other suggestions:

- No cold starts since the cluster stays running
- No separate dispatcher/queue since your config table already stores the frequencies and last_called handles the scheduling
- Easy to change frequencies by just updating the config table, no job redefinition needed
- Parallel execution via ThreadPoolExecutor so slow APIs don't block fast ones
- Adding new APIs is just an INSERT into the config table

One thing to watch: size the max_workers on the ThreadPoolExecutor based on how many APIs might be due at once. If all 70 fire on the same tick (e.g. at startup when
last_called is NULL), you might want to cap it and batch them. Also consider what happens if an API call takes longer than 60 seconds, since the next tick will try to call it again. A simple fix is to update last_called before the call (optimistic) rather than after, or add an in_progress flag.

I tested the SQL frequency-check logic and the ThreadPoolExecutor pattern on a live Databricks workspace and both work as expected.

View solution in original post