How can I start SparkSession out of Notebook?

NCat · ‎08-08-2023

Hi community,

How can I start SparkSession out of Notebook?
I want to split my Notebook into small Python modules, and I want to let some of them to call Spark functionality.

-werners- · ‎08-09-2023

can you elaborate a bit more?
Are you going to call those modules in a notebook, and want to use spark functions in them?
Or do you want to explicitly start a separate sparksession for each module?

sakhulaz · ‎08-09-2023

Hello,

To start a SparkSession outside of a notebook, you can follow these steps to split your code into small Python modules and utilize Spark functionality:

Import Required Libraries: In your Python module, import the necessary libraries for Spark:

In your Python module, import the necessary libraries for Spark:

from pyspark.sql import SparkSession

Create SparkSession:

Initialize the SparkSession at the beginning of your module:

spark = SparkSession.builder \
    .appName("YourAppName") \
    .config("spark.some.config.option", "config-value") \
    .getOrCreate()

Customize the configuration options as needed.

Sam's Club

Tharun-Kumar · ‎08-09-2023

@NCat

Databricks provides Spark Session out of the box. You have to just use the variable "spark".

In order to use it in other modules, you have to pass the spark variable as a parameter to the other modules.

NCat · ‎08-09-2023

Thank you for all replies.
@-werners- I want to use Spark Session in modules which is called from Notebook.

@sakhulaz How can I get the config options to attach to the Databricks data?

@Tharun-Kumar Thank you. That approach definitely works for my situation!

-werners- · ‎08-09-2023

in general (as already stated) a notebook automatically gets a sparksession.
You don't have to do anything.
If you specifically need to have separate sessions (isolation), you should run different notebooks (or plan different jobs) as these get a new session (a session per notebook/job).
Calling magic functions like %scala, %run etc use the same sparksession, so no isolation there.

benrich · ‎06-29-2024

To start a SparkSession outside of a Jupyter Notebook and enable its use in multiple Python modules, follow these steps:

Install Apache Spark: Ensure Spark is installed on your system. You can download it from the Apache Spark website and set it up with Hadoop or use a standalone cluster.
Set Up Environment Variables: Configure the necessary environment variables (SPARK_HOME, JAVA_HOME, and PYTHONPATH) to point to the correct locations.
Create a Spark Configuration Module: Create a Python file (e.g., spark_config.py) to set up the SparkSession:
python
Copy code
from pyspark.sql import SparkSession def create_spark_session(app_name="MyApp"😞 spark = SparkSession.builder \ .appName(app_name) \ .getOrCreate() return spark
Initialize SparkSession in Your Modules: Import and use the create_spark_session function in your Python modules to get the SparkSession:
python
Copy code
from spark_config import create_spark_session spark = create_spark_session("ModuleName") # Now you can use Spark functionality, e.g.: df = spark.read.csv("path/to/data.csv") df.show()
Run Your Modules: Execute your Python scripts or modules from the command line or within a larger application, and the Spark session will be initialized and used as needed.

benrich

jacovangelder · ‎06-30-2024

Just overtake Databricks sparksession.

from pyspark.sql import SparkSession
spark = SparkSession.getActiveSession()