Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-03-2025 12:43 PM
Greetings @jigar191089 , I did some digging and here are some ideas to think about.
This smells like a shared-state/import-path issue on an interactive cluster under concurrency.
What likely happened
- Your notebook imports Python modules from /dbfs by adding it to sys.path and then doing imports. DBFS root (dbfs:/ and the /dbfs FUSE mount) isn’t recommended for storing or importing production code; instead Databricks recommends keeping source code as workspace files (or packaging as wheels and installing as libraries) to avoid reliability and governance pitfalls. Moving code off DBFS reduces odd behaviors under concurrency, including import stalls. Databricks specifically advises against using DBFS root for sensitive/production code and recommends workspace files or Unity Catalog Volumes instead.
-
Because your Azure Data Factory activity points at an existing interactive cluster, concurrent runs were attaching to the same compute. Each run gets its own notebook session, but they still share the cluster’s driver/executors and any compute‑scoped state, so contention or environment mutation at the cluster level can surface as hangs during import (especially when multiple tasks import from the same location on /dbfs at once). For isolation, using a per‑run job cluster is preferred; separate clusters avoid cross‑run interference entirely.
-
After the cluster restart, the Python processes and file-system caches were reset, so subsequent imports worked—consistent with a transient interpreter/path/import cache issue. Resetting the notebook session or cluster is the documented way to clear Python state when things get wedged.
Will concurrent runs share environment state on an interactive cluster?
Short answer: partially.
-
Each notebook run has its own session. If you use notebook‑scoped libraries via %pip, those installs are isolated to that notebook session and do not affect other notebooks, even on the same cluster.
-
Anything compute‑scoped (cluster libraries, init scripts, Spark configuration, cached data, etc.) is shared by all notebooks/jobs on that cluster. Installing a library at the cluster level makes it available to all attached notebooks and jobs, which can cause interference across concurrent runs if versions or side effects conflict.
-
Databricks recommends running with modern access modes (standard/shared) and avoiding legacy “no‑isolation shared.” “Standard access mode” is the recommended default for most workloads.
Concrete fixes and best practices
Adopt as many of these as you can; the first two are the big ones.
-
Move code off /dbfs:
- Store Python modules as workspace files under /Workspace (or in a Git folder/Repos) and import them normally; or package as a wheel (.whl) and install as a library. Both patterns are recommended over importing from DBFS paths.
-
Prefer job clusters for ADF‑triggered runs (or Databricks Workflows), especially when runs can overlap. A new cluster per run eliminates shared-state interference by construction.
-
If you must share an interactive cluster:
- Use %pip in the first cell to create a notebook‑scoped environment for each run; this limits cross‑run impacts on the same cluster.
-
Keep libraries that need to be shared stable as compute‑scoped libraries and avoid mutating them during job execution to reduce race conditions.
-
Avoid star imports (from utils import *). Import explicit symbols to reduce import‑time side effects.
-
If you deploy code updates, start a new session (or restart the cluster) to clear Python/import caches before the next run.
-
Use Workspace Files or Volumes paths instead of /dbfs when importing:
- For workspace source files, reference them as workspace files (for example, keep your package alongside notebooks under /Workspace and import with normal Python module paths); Databricks recommends workspace files for source code and Volumes for larger artifacts.
Minimal migration example
Replace:
python
import sys
sys.path.insert(0, '/dbfs/DataEnabling/Pyspark')
from utils import *
from trigger_processing_framework.insert_trigger import InsertTrigger
With one of: 1) Workspace files (recommended during development): - Place your Python package as files under /Workspace/… (optionally in a Git folder/Repos). - Use normal import package.module without touching sys.path.2) Wheel library (recommended for production): - Build a wheel and install it either as a compute‑scoped library or with
%pip install /Volumes/.../my_lib.whl for a notebook‑scoped env, then import normally.If you keep using a shared interactive cluster in the short term, try running a single concurrent ADF trigger to validate the import path change first, then increase concurrency.
Hope this helps, Louis.