cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to create SQL Functions using Pysparkin local machine

badari_narayan
New Contributor II

I am trying to create spark SQL function in particular schema 

(i.e) 

spark.sql(" CREATE OR REPLACE FUNCTION <spark_catalog>.<schema_name>.<function_name()> RETURNS STRING RETURN <value>")

This works perfectly fine on Databricks using notebooks.

But, I need to use this same in my project which is to run in local machine (VScode). But I am facing issues. Please advise me on how to implement this in my local machine.

6 REPLIES 6

filipniziol
Contributor III

Hi @badari_narayan ,

In VSCode, install the Databricks extension and then connect it to your existing Databricks workspace and cluster.

This setup allows you to run Databricks notebooks and scripts from your local VSCode, while using the Spark context of the connected Databricks cluster.

Explanation:

Databricks Extension in VSCode: The Databricks extension for VSCode allows you to connect to your Databricks workspace, access notebooks, and run code directly from VSCode.

Use of Spark Context: When connected to a Databricks cluster, the code you execute from VSCode uses the cluster's Spark context. This means computations and data processing are performed on the Databricks cluster, not on your local machine.

 

badari_narayan
New Contributor II

Hi @filipniziol,

Can you provide any sample code snippets or tutorial video. So that I can make sure I making no mistakes from my side.

filipniziol
Contributor III

Hi @filipniziol ,

Thanks for the response, in this tutorial they are running via uploading or as workflows to the Databricks, it will be suitable for single file, But in my case, I need to run entire project and I don't want it to be uploaded to Databricks, just need it to run in local VSCode. 

Do you have any ideas for this?

To run locally in the vscode, install pyspark and then start a local spark session:

 

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .getOrCreate()

 

This setup should let the project run locally without Databricks while you know what pi sign can do.

filipniziol
Contributor III

Hi @badari_narayan ,

In general you may run pyspark project locally, but with limitations.

  1. Create virtual environment
  2. Install pyspark in your virtual environment (the same version you have on your cluster)
  3. Since spark version 2.x you even do not need to create a local spark cluster.

 

spark = SparkSession.builder \
    .appName("LocalSparkApp") \
    .master("local[*]") \
    .getOrCreate()โ€‹

 

In order to work it locally you will have limitations and also you will need to set it up properly.
For example you will need to create first your spark instance like above (in databricks workspace it is already available). Also, you will be able to run Spark SQL available in pyspark, but not Databricks prioprietary SQL.
Locally, you also do not have a unity catalog configured in your workspace.

Recently Databricks team made unity catalog open source, so you may check the pages below:
https://www.unitycatalog.io/
https://github.com/unitycatalog/unitycatalog

Still, you will need to setup your local unity catalog, if you do not want to connect databricks workspace and then to run the code to create the function in your local unity catalog.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group