Databricks

Mr__D · ‎03-23-2023

Hi All,

Could you please suggest to me the best way to write PySpark code in Databricks,

I don't want to write my code in Databricks notebook but create python files(modular project) in Vscode and call only the primary function in the notebook(the rest of the logic will be written in python files).

Could you please let me know the best way to achieve it?

Thanks,

Deepak

Anonymous · ‎03-24-2023

@Deepak Bhatt :

Yes, you can write your PySpark code in modular Python files outside of Databricks and then call them from a Databricks notebook. Here are the steps you can follow:

Create a Python file in your local development environment (e.g., VS Code) and write your PySpark code in it. You can define a main function in this file which will be called from the Databricks notebook.
Save the Python file to a Git repository or a cloud storage service such as Azure Blob Storage or Amazon S3.
In the Databricks notebook, you can clone the Git repository or mount the cloud storage service to access the Python file.
Import the Python file in your notebook using the Python import statement. For example, if your Python file is named my_pyspark_code.py, you can import it like this:

import my_pyspark_code

Call the main function in your Python file from the Databricks notebook. For example, if your main function is named run_spark_job() you can call it like this:

my_pyspark_code.run_spark_job()

By following these steps, you can write your PySpark code in a modular and maintainable way outside of Databricks, and then easily call it from a Databricks notebook.

View solution in original post

Anonymous · ‎03-24-2023

@Deepak Bhatt :

Yes, you can write your PySpark code in modular Python files outside of Databricks and then call them from a Databricks notebook. Here are the steps you can follow:

Create a Python file in your local development environment (e.g., VS Code) and write your PySpark code in it. You can define a main function in this file which will be called from the Databricks notebook.
Save the Python file to a Git repository or a cloud storage service such as Azure Blob Storage or Amazon S3.
In the Databricks notebook, you can clone the Git repository or mount the cloud storage service to access the Python file.
Import the Python file in your notebook using the Python import statement. For example, if your Python file is named my_pyspark_code.py, you can import it like this:

import my_pyspark_code

Call the main function in your Python file from the Databricks notebook. For example, if your main function is named run_spark_job() you can call it like this:

my_pyspark_code.run_spark_job()

By following these steps, you can write your PySpark code in a modular and maintainable way outside of Databricks, and then easily call it from a Databricks notebook.

Anonymous · ‎03-25-2023

Hi @Deepak Bhatt

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

ThiagoLDC · ‎12-17-2023

Hi all,

I have a very similar problem. I can write code perfectly in my github Repos here, but when I try to access it through an import command I receive the error:

ModuleNotFoundError: No module named 'module_name'

When I try to check my environment with os.getcwd() I see that my default path is databricks/driver. However, when I copy the path of my current environment through the files path, and use it in os.changedir() I receive the following error:

FileNotFoundError: [Errno 2] No such file or directory: "/Repos/repo_name"

Is there a quick fix for this? I usually do not have problems like this in VScode os Jupyter notebooks.

Thanks!

ThiagoLDC · ‎12-17-2023

I understood the error now. It was quite easy actually. For me it was just about changing the .py script to the same cluster that the notebook.

Now it's working fine.

ThiagoLDC · ‎12-17-2023

Sorry, I think I actually got it wrong in the comment above. It worked, but I also had to upload the .py to the dbfs file system. Still looking for a faster way to solve this issue.

shr_ath · ‎01-13-2024

Hi @ThiagoLDC ,
In order to import a user defined module, the .py file either needs to be in the same directory or you can place your file in Repo and import it form there.
In the notebook while importing the code form Repo you can import it like below:

import sys, os
sys.path.append(os.path.abspath('<module-path>'))
from <pyfilename> import <class/function>

for detailed documentation refer
https://docs.databricks.com/en/delta-live-tables/import-workspace-files.html

Gamlet · ‎01-17-2024

Certainly! To write PySpark code in Databricks while maintaining a modular project in VSCode, you can organize your PySpark code into Python files in VSCode, with a primary function encapsulating the main logic. Then, upload these files to Databricks, create a Databricks notebook, and use the %run magic command to execute the primary function from the uploaded Python files, allowing you to keep the core logic outside of Databricks notebooks for better code organization and reusability.

Best wishes, Zpak

Databricks

Writing modular code in Databricks

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI