Hi Ellie,
What youāre seeing is actually quite common , the same script can behave slightly differently when:
- run interactively in a notebook on a cluster, vs
- run as a job / via API trigger (or from a mobile wrapper hitting that API).
Itās usually not āDatabricks being randomā, but a mix of different environments and lifecycle. A few typical causes:
- Different cluster types & configs
In many setups:
Notebook runs ā on an all-purpose (interactive) cluster
API / job runs ā on a job cluster or a different pool
Those clusters can differ in:
- Runtime version (DBR), Spark / Scala / Python versions
- Node type / size, autoscaling configs
- Spark configs (shuffle, partitions, broadcast thresholds, timeouts, etc.)
- Installed libraries / init scripts
Even small config differences can change timing and sometimes behaviour (e.g. shuffles, joins, broadcast vs shuffle strategy, timeouts, etc.).
Check: compare the cluster JSON or Spark UI ā Environment for both runs.
- 2. Stateful notebook vs stateless job
Notebook sessions are stateful:
- You may have cached tables, temp views, broadcast variables already loaded
- Python / Scala variables defined in earlier cells
- Spark configs changed during experimentation
- Data cached in memory or on local disk
When you trigger via API, the job run usually starts with a clean, fresh context:
- No prior caches, temp views, or globals
- No āwarmā JVM / Python runtime, everything has to spin up from scratch
That alone can easily explain different timings and sometimes corner-case behaviour if the script accidentally relies on prior state.
- 3. Identity, permissions & environment variables
API-triggered runs often use:
A service principal or technical user
Potentially different default catalog / schema workspace, or permissions
Different secret scopes / environment variables(set in job config vs cluster)
If your script reads from:
- `dbutils.secrets.get(...)`
- `os.environ[...]`
- default database / catalog (without fully qualifying paths)
ā¦those differences can influence output or error paths.
- 4. Cluster lifecycle & āwarm-upā
Job clusters or on-demand clusters go through:
- Cold start: spin up nodes, start Spark, load libraries
- JIT warm-up: JVM + Python processes optimising code paths during execution
The first run (especially via API) can be noticeably slower than:
- a long-running interactive cluster thatās already āhotā
- a notebook where youāve already run some heavy cells
If youāre measuring timing precisely, cold vs warm states will show up.
- API gateway / orchestration differences
When you go through: a mobile app ā API gateway ā Databricks Jobs API
you also introduce:
- HTTP timeouts / retries
- Slightly different error handling
- Extra latency before the job even starts
This doesnāt usually change results, but it does change timings, and if downstream systems have strict time budgets or expect logs in a specific order, you may notice differences.
How Iād debug / stabilise this
- Log environment at the start of the script(for both notebook & API runs):
- `spark.version`, DBR runtime
- `spark.conf.getAll` (or at least key configs)
- `os.environ` subset (env vars your script uses)
- current user / service principal (`spark.sql("SELECT current_user()")`)
- Make the code stateless & parameterised:
- Donāt rely on notebook globals or earlier cells
- Donāt rely on āwhatever catalog/schema I happen to be inā ā fully qualify tables & paths
- Use the same cluster config for both paths:
- Either run the notebook on the same job cluster
- Or configure the job to use the same all-purpose cluster, just for testing
- Warm-up if necessary:
For very time-sensitive workflows, consider a small āwarm-upā call before the real workload (or keep a cluster/pool warm).
Is this Databricks-specific or universal?
What you observed with other platforms and mobile script executors is spot on:
this is largely a universal āenvironment + lifecycleā behaviour, not unique to Databricks.
Databricks just makes the differences more visible because:
- interactive clusters are long-lived and stateful
- job / API runs are short-lived and stateless by design
If you can share a minimal example (e.g. same script, cluster configs, and rough timing logs), the community can help narrow down exactly which of the above is biting you most.
Hope this helps clarify what to look at!