cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Multiple concurrent jobs using interactive cluster

jigar191089
New Contributor III

Hi All,

I have notebook in Databricks. This notebook is executed from azure datafactory pipeline having a databricks notebook activity with linkedservice connected to an interactive cluster.

When multiple concurrent runs of this pipeline are created, I am observing that the notebook job is going in some endless loop.

The command where this happens is as below

import inspect
import json

import sys

sys.path.insert(0, '/dbfs/DataEnabling/Pyspark')

# import user defined libraries
from utils import *
from trigger_processing_framework.insert_trigger import InsertTrigger
from trigger_processing_framework.update_trigger import UpdateTrigger

Later upon restarting the cluster and trying again for the same scenario, the jobs are completing as per expectation.

Any idea what may have went wrong in the first attempt.

Also when multiple runs are created for same notebook, will each run have it own environment state while running or concurrent jobs will interfere with each other since I am using interactive cluster

 

 

12 REPLIES 12

nikhilj0421
Databricks Employee
Databricks Employee

Hi @jigar191089, are all the jobs writing to the same location? What is the DBR version you're using? Do you notice any load on the cluster?

Each run will use its own environment, it won't interfere with each other. 

jigar191089
New Contributor III

@nikhilj0421 , the jobs are writing to different location. DBR is 14.3 ML LTS. Not sure, how to check the load. But if you see the above code it is just import statements.

@jigar191089, you can monitor the metrics section of the cluster to check the load on the cluster. 
Also, you will see "driver is up but not responsive due to GC" messages in the cluster's event log. 

Can you share the stdout.txt and stderr.txt file when the job gets stuck?

jigar191089
New Contributor III

Hi @nikhilj0421 I am not able to attach txt file here. This is the screenshot

jigar191089_0-1747987870677.png

 

jigar191089_1-1747987906107.pngjigar191089_2-1747987944695.png

 

 

nikhilj0421
Databricks Employee
Databricks Employee

Event logs confirm that it isn't because of the driver is under memory pressure. 

Checking the stdout will be very helpful here. Could you please share a screenshot of what you see after Ctrl + C does not work?

Also, are you seeing any library issue in your stderr or stdout? 
 

jigar191089_0-1748260472106.png


no observation in stderr and stdout. How can I share the log as only attachment of : jpg, gif, png, pdf. are allowed

Do you see any attachment option?

nikhilj0421
Databricks Employee
Databricks Employee

What libraries are you installing in your cluster?

There are few custom build and then few libraries available on PyPi

nikhilj0421
Databricks Employee
Databricks Employee

Can you share the screenshots of the pypi library installed on your cluster?

jigar191089
New Contributor III

jigar191089_0-1748265306139.png

jigar191089_1-1748265321800.png



Do note. This libraries are installed using the cluster init script

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @jigar191089 , I did some digging and here are some ideas to think about.

 

This smells like a shared-state/import-path issue on an interactive cluster under concurrency.
 

What likely happened

  • Your notebook imports Python modules from /dbfs by adding it to sys.path and then doing imports. DBFS root (dbfs:/ and the /dbfs FUSE mount) isn’t recommended for storing or importing production code; instead Databricks recommends keeping source code as workspace files (or packaging as wheels and installing as libraries) to avoid reliability and governance pitfalls. Moving code off DBFS reduces odd behaviors under concurrency, including import stalls. Databricks specifically advises against using DBFS root for sensitive/production code and recommends workspace files or Unity Catalog Volumes instead.
  • Because your Azure Data Factory activity points at an existing interactive cluster, concurrent runs were attaching to the same compute. Each run gets its own notebook session, but they still share the cluster’s driver/executors and any compute‑scoped state, so contention or environment mutation at the cluster level can surface as hangs during import (especially when multiple tasks import from the same location on /dbfs at once). For isolation, using a per‑run job cluster is preferred; separate clusters avoid cross‑run interference entirely.
  • After the cluster restart, the Python processes and file-system caches were reset, so subsequent imports worked—consistent with a transient interpreter/path/import cache issue. Resetting the notebook session or cluster is the documented way to clear Python state when things get wedged.

Will concurrent runs share environment state on an interactive cluster?

Short answer: partially.
  • Each notebook run has its own session. If you use notebook‑scoped libraries via %pip, those installs are isolated to that notebook session and do not affect other notebooks, even on the same cluster.
  • Anything compute‑scoped (cluster libraries, init scripts, Spark configuration, cached data, etc.) is shared by all notebooks/jobs on that cluster. Installing a library at the cluster level makes it available to all attached notebooks and jobs, which can cause interference across concurrent runs if versions or side effects conflict.
  • Databricks recommends running with modern access modes (standard/shared) and avoiding legacy “no‑isolation shared.” “Standard access mode” is the recommended default for most workloads.

Concrete fixes and best practices

Adopt as many of these as you can; the first two are the big ones.
  • Move code off /dbfs:
    • Store Python modules as workspace files under /Workspace (or in a Git folder/Repos) and import them normally; or package as a wheel (.whl) and install as a library. Both patterns are recommended over importing from DBFS paths.
  • Prefer job clusters for ADF‑triggered runs (or Databricks Workflows), especially when runs can overlap. A new cluster per run eliminates shared-state interference by construction.
  • If you must share an interactive cluster:
    • Use %pip in the first cell to create a notebook‑scoped environment for each run; this limits cross‑run impacts on the same cluster.
    • Keep libraries that need to be shared stable as compute‑scoped libraries and avoid mutating them during job execution to reduce race conditions.
    • Avoid star imports (from utils import *). Import explicit symbols to reduce import‑time side effects.
    • If you deploy code updates, start a new session (or restart the cluster) to clear Python/import caches before the next run.
  • Use Workspace Files or Volumes paths instead of /dbfs when importing:
    • For workspace source files, reference them as workspace files (for example, keep your package alongside notebooks under /Workspace and import with normal Python module paths); Databricks recommends workspace files for source code and Volumes for larger artifacts.

Minimal migration example

Replace: python import sys sys.path.insert(0, '/dbfs/DataEnabling/Pyspark') from utils import * from trigger_processing_framework.insert_trigger import InsertTrigger With one of: 1) Workspace files (recommended during development): - Place your Python package as files under /Workspace/… (optionally in a Git folder/Repos). - Use normal import package.module without touching sys.path.
2) Wheel library (recommended for production): - Build a wheel and install it either as a compute‑scoped library or with %pip install /Volumes/.../my_lib.whl for a notebook‑scoped env, then import normally.
If you keep using a shared interactive cluster in the short term, try running a single concurrent ADF trigger to validate the import path change first, then increase concurrency.
 
Hope this helps, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now