cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Writing modular code in Databricks

Mr__D
New Contributor II

Hi All,

Could you please suggest to me the best way to write PySpark code in Databricks,

I don't want to write my code in Databricks notebook but create python files(modular project) in Vscode and call only the primary function in the notebook(the rest of the logic will be written in python files).

Could you please let me know the best way to achieve it?

Thanks,

Deepak

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Deepak Bhatt​ :

Yes, you can write your PySpark code in modular Python files outside of Databricks and then call them from a Databricks notebook. Here are the steps you can follow:

  1. Create a Python file in your local development environment (e.g., VS Code) and write your PySpark code in it. You can define a main function in this file which will be called from the Databricks notebook.
  2. Save the Python file to a Git repository or a cloud storage service such as Azure Blob Storage or Amazon S3.
  3. In the Databricks notebook, you can clone the Git repository or mount the cloud storage service to access the Python file.
  4. Import the Python file in your notebook using the Python import statement. For example, if your Python file is named my_pyspark_code.py, you can import it like this:
import my_pyspark_code

Call the main function in your Python file from the Databricks notebook. For example, if your main function is named run_spark_job() you can call it like this:

my_pyspark_code.run_spark_job()

By following these steps, you can write your PySpark code in a modular and maintainable way outside of Databricks, and then easily call it from a Databricks notebook.

View solution in original post

7 REPLIES 7

Anonymous
Not applicable

@Deepak Bhatt​ :

Yes, you can write your PySpark code in modular Python files outside of Databricks and then call them from a Databricks notebook. Here are the steps you can follow:

  1. Create a Python file in your local development environment (e.g., VS Code) and write your PySpark code in it. You can define a main function in this file which will be called from the Databricks notebook.
  2. Save the Python file to a Git repository or a cloud storage service such as Azure Blob Storage or Amazon S3.
  3. In the Databricks notebook, you can clone the Git repository or mount the cloud storage service to access the Python file.
  4. Import the Python file in your notebook using the Python import statement. For example, if your Python file is named my_pyspark_code.py, you can import it like this:
import my_pyspark_code

Call the main function in your Python file from the Databricks notebook. For example, if your main function is named run_spark_job() you can call it like this:

my_pyspark_code.run_spark_job()

By following these steps, you can write your PySpark code in a modular and maintainable way outside of Databricks, and then easily call it from a Databricks notebook.

Anonymous
Not applicable

Hi @Deepak Bhatt​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

ThiagoLDC
New Contributor II

Hi all, 

I have a very similar problem. I can write code perfectly in my github Repos here, but when I try to access it through an import command I receive the error:

 

ModuleNotFoundError: No module named 'module_name' 

When I try to check my environment with os.getcwd() I see that my default path is databricks/driver. However, when I copy the path of my current environment through the files path, and use it in os.changedir() I receive the following error:

FileNotFoundError: [Errno 2] No such file or directory: "/Repos/repo_name"

Is there a quick fix for this? I usually do not have problems like this in VScode os Jupyter notebooks.

Thanks!

ThiagoLDC
New Contributor II

I understood the error now. It was quite easy actually. For me it was just about changing the .py script to the same cluster that the notebook.

Now it's working fine.

ThiagoLDC
New Contributor II

Sorry, I think I actually got it wrong in the comment above. It worked, but I also had to upload the .py to the dbfs file system. Still looking for a faster way to solve this issue. 

shr_ath
New Contributor II

Hi @ThiagoLDC ,
In order to import a user defined module, the .py file either needs to be in the same directory or you can place your file in Repo and import it form there. 
In the notebook while importing the code form Repo you can import it like below:

import sys, os
sys.path.append(os.path.abspath('<module-path>'))
from <pyfilename> import <class/function>

for detailed documentation refer 
https://docs.databricks.com/en/delta-live-tables/import-workspace-files.html

Gamlet
New Contributor II

Certainly! To write PySpark code in Databricks while maintaining a modular project in VSCode, you can organize your PySpark code into Python files in VSCode, with a primary function encapsulating the main logic. Then, upload these files to Databricks, create a Databricks notebook, and use the %run magic command to execute the primary function from the uploaded Python files, allowing you to keep the core logic outside of Databricks notebooks for better code organization and reusability.

Best wishes, Zpak