How can I start SparkSession out of Notebook?

NCat
New Contributor III

Hi community,

How can I start SparkSession out of Notebook?
I want to split my Notebook into small Python modules, and I want to let some of them to call Spark functionality.

-werners-
Esteemed Contributor III

can you elaborate a bit more?
Are you going to call those modules in a notebook, and want to use spark functions in them?
Or do you want to explicitly start a separate sparksession for each module?

sakhulaz
New Contributor II

Hello,

To start a SparkSession outside of a notebook, you can follow these steps to split your code into small Python modules and utilize Spark functionality:

  1. Import Required Libraries: In your Python module, import the necessary libraries for Spark:

In your Python module, import the necessary libraries for Spark:

from pyspark.sql import SparkSession
  1. Create SparkSession:

Initialize the SparkSession at the beginning of your module:

spark = SparkSession.builder \
    .appName("YourAppName") \
    .config("spark.some.config.option", "config-value") \
    .getOrCreate()

Customize the configuration options as needed.

Tharun-Kumar
Databricks Employee
Databricks Employee

@NCat 

Databricks provides Spark Session out of the box. You have to just use the variable "spark". 

Screenshot 2023-08-09 at 5.52.07 PM.png

In order to use it in other modules, you have to pass the spark variable as a parameter to the other modules.

NCat
New Contributor III

Thank you for all replies.
@-werners- I want to use Spark Session in modules which is called from Notebook.

@sakhulaz How can I get the config options to attach to the Databricks data?

@Tharun-Kumar Thank you. That approach definitely works for my situation!

-werners-
Esteemed Contributor III

in general (as already stated) a notebook automatically gets a sparksession.
You don't have to do anything.
If you specifically need to have separate sessions (isolation), you should run different notebooks (or plan different jobs) as these get a new session (a session per notebook/job).
Calling magic functions like %scala, %run etc use the same sparksession, so no isolation there.

benrich
New Contributor II

To start a SparkSession outside of a Jupyter Notebook and enable its use in multiple Python modules, follow these steps:

  1. Install Apache Spark: Ensure Spark is installed on your system. You can download it from the Apache Spark website  and set it up with Hadoop or use a standalone cluster.

  2. Set Up Environment Variables: Configure the necessary environment variables (SPARK_HOME, JAVA_HOME, and PYTHONPATH) to point to the correct locations.

  3. Create a Spark Configuration Module: Create a Python file (e.g., spark_config.py) to set up the SparkSession:

    python
    Copy code
    from pyspark.sql import SparkSession def create_spark_session(app_name="MyApp"😞 spark = SparkSession.builder \ .appName(app_name) \ .getOrCreate() return spark
  4. Initialize SparkSession in Your Modules: Import and use the create_spark_session function in your Python modules to get the SparkSession:

    python
    Copy code
    from spark_config import create_spark_session spark = create_spark_session("ModuleName") # Now you can use Spark functionality, e.g.: df = spark.read.csv("path/to/data.csv") df.show()
  5. Run Your Modules: Execute your Python scripts or modules from the command line or within a larger application, and the Spark session will be initialized and used as needed.

benrich

jacovangelder
Databricks MVP

Just overtake Databricks sparksession.

from pyspark.sql import SparkSession
spark = SparkSession.getActiveSession()