cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

%sh python modules run losses access to spark

ajay_wavicle
Contributor

%sh python modules run losses access to spark. How do i regain spark session and access the databricks tables

4 REPLIES 4

soloengine
New Contributor II

%sh runs a shell command on the driver node’s OS, not inside the notebook’s Python/Spark runtime. It basically opens a separate Linux process on the driver machine.

The Spark session, on the other hand, is attached to the notebook runtime. So when you use normal Python cells, you’re inside the Spark-enabled environment.

May I know what you are running using %sh?

@soloengine to run existing python notebooks. we are trying to make it support spark as well

saurabh18cs
Honored Contributor III

Hi @ajay_wavicle 

The Spark session never disappears %sh simply runs outside it.

do this:

from mymodule import myfunc
myfunc(spark)

dont do this :


%sh
python my_script.py

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ajay_wavicle,

Thanks for the detailed writeup. The reason you lose access to Spark when using %sh is that it launches a completely separate Linux process on the driver node. That process runs outside the notebook runtime, so it has no connection to the SparkSession, dbutils, or any of the variables defined in your notebook cells.

There are several approaches to run your existing Python code while keeping full Spark access. Here is a rundown from simplest to most flexible.


OPTION 1: USE %run TO EXECUTE NOTEBOOKS INLINE

If your existing Python code is already in other Databricks notebooks, the easiest approach is %run. This executes the target notebook in the same Spark session, so all Spark APIs, tables, and dbutils are fully available.

%run /path/to/your_notebook

Any functions and variables defined in the called notebook become available in the calling notebook. One constraint: %run must be the only content in the cell.

Documentation: https://docs.databricks.com/en/notebooks/notebook-workflows.html


OPTION 2: IMPORT PYTHON FILES AS MODULES (RECOMMENDED)

Starting with Databricks Runtime 11.3 LTS and above, you can store .py files directly in the workspace alongside your notebooks and import them as regular Python modules. This is the cleanest approach for reusing existing Python code with Spark.

1. Upload or create your Python files in the workspace (for example, alongside your notebook or in a subfolder).

2. If needed, add the directory to your Python path:

import sys
import os
sys.path.append(os.path.abspath('/Workspace/path/to/your/modules'))

3. Import and call your functions, passing the spark session explicitly:

from my_module import my_function
result = my_function(spark)

Inside my_module.py, your function receives the active SparkSession:

def my_function(spark):
df = spark.table("my_catalog.my_schema.my_table")
# do your processing
return df

On Databricks Runtime 14.0 and above, the current working directory defaults to the directory containing the notebook, so relative imports are even simpler.

During development, you can enable autoreload so changes to your modules are picked up without restarting the kernel:

%load_ext autoreload
%autoreload 2

Documentation: https://docs.databricks.com/en/files/workspace-modules.html


OPTION 3: USE dbutils.notebook.run() FOR ORCHESTRATION

If you need to run a notebook as a separate job (for example, with different parameters or in a workflow), use dbutils.notebook.run(). This launches the notebook as a new job run with its own Spark session and full access to tables.

result = dbutils.notebook.run(
"/path/to/your_notebook",
timeout_seconds=600,
arguments={"param1": "value1"}
)

The called notebook can return a string result and create global temporary views to share data back.

Documentation: https://docs.databricks.com/en/notebooks/notebook-workflows.html


OPTION 4: DATABRICKS CONNECT (FOR EXTERNAL SCRIPTS)

If you truly need to run standalone Python scripts (outside the notebook environment) that connect to Spark on a Databricks cluster, Databricks Connect is the right tool. It lets external Python processes establish a remote SparkSession.

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()
df = spark.table("my_catalog.my_schema.my_table")

This is ideal if you have a large codebase developed locally or in a CI/CD pipeline and need cluster-backed Spark execution.

Documentation: https://docs.databricks.com/en/dev-tools/databricks-connect/python/index.html


QUICK SUMMARY

- Do not use %sh to run Python scripts that need Spark. The shell process cannot access the SparkSession.
- For notebook-to-notebook calls, use %run (shares the same session) or dbutils.notebook.run() (new session with parameters).
- For .py file reuse, import them as Python modules and pass the spark object explicitly.
- For external/standalone scripts, use Databricks Connect.

Hope this helps you get your existing Python code running with full Spark access. Let us know which approach works best for your use case.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.