Databricks Community

NCat · ‎08-08-2023

Hi community,

How can I start SparkSession out of Notebook?
I want to split my Notebook into small Python modules, and I want to let some of them to call Spark functionality.

-werners- · ‎08-09-2023

can you elaborate a bit more?
Are you going to call those modules in a notebook, and want to use spark functions in them?
Or do you want to explicitly start a separate sparksession for each module?

sakhulaz · ‎08-09-2023

Hello,

To start a SparkSession outside of a notebook, you can follow these steps to split your code into small Python modules and utilize Spark functionality:

Import Required Libraries: In your Python module, import the necessary libraries for Spark:

In your Python module, import the necessary libraries for Spark:

from pyspark.sql import SparkSession

Create SparkSession:

Initialize the SparkSession at the beginning of your module:

spark = SparkSession.builder \
    .appName("YourAppName") \
    .config("spark.some.config.option", "config-value") \
    .getOrCreate()

Customize the configuration options as needed.

Sam's Club

Tharun-Kumar · ‎08-09-2023

@NCat

Databricks provides Spark Session out of the box. You have to just use the variable "spark".

Screenshot 2023-08-09 at 5.52.07 PM.png

In order to use it in other modules, you have to pass the spark variable as a parameter to the other modules.

NCat · ‎08-09-2023

Thank you for all replies.
@-werners- I want to use Spark Session in modules which is called from Notebook.

@sakhulaz How can I get the config options to attach to the Databricks data?

@Tharun-Kumar Thank you. That approach definitely works for my situation!

-werners- · ‎08-09-2023

in general (as already stated) a notebook automatically gets a sparksession.
You don't have to do anything.
If you specifically need to have separate sessions (isolation), you should run different notebooks (or plan different jobs) as these get a new session (a session per notebook/job).
Calling magic functions like %scala, %run etc use the same sparksession, so no isolation there.

benrich · ‎06-29-2024

To start a SparkSession outside of a Jupyter Notebook and enable its use in multiple Python modules, follow these steps:

Install Apache Spark: Ensure Spark is installed on your system. You can download it from the Apache Spark website and set it up with Hadoop or use a standalone cluster.
Set Up Environment Variables: Configure the necessary environment variables (SPARK_HOME, JAVA_HOME, and PYTHONPATH) to point to the correct locations.
Create a Spark Configuration Module: Create a Python file (e.g., spark_config.py) to set up the SparkSession:
python
Copy code
from pyspark.sql import SparkSession def create_spark_session(app_name="MyApp"😞 spark = SparkSession.builder \ .appName(app_name) \ .getOrCreate() return spark
Initialize SparkSession in Your Modules: Import and use the create_spark_session function in your Python modules to get the SparkSession:
python
Copy code
from spark_config import create_spark_session spark = create_spark_session("ModuleName") # Now you can use Spark functionality, e.g.: df = spark.read.csv("path/to/data.csv") df.show()
Run Your Modules: Execute your Python scripts or modules from the command line or within a larger application, and the Spark session will be initialized and used as needed.

benrich

jacovangelder · ‎06-30-2024

Just overtake Databricks sparksession.

from pyspark.sql import SparkSession
spark = SparkSession.getActiveSession()

Databricks Community

How can I start SparkSession out of Notebook?

Photos

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences