Databricks Community

pernilak · ‎04-15-2024

Hi!

As suggested by Databricks, we are working with Databricks from VSCode using Databricks bundles for our deployment and using the VSCode Databricks Extension and Databricks Connect during development.

However, there are some limitations that we are seeing (that hopefully can be fixed). One of them is when working with files from Unity Catalog using native python.

E.g: Using this code:

with open(my_file, 'r', encoding='utf-8') as f:
    content = f.read()

When running this in the Databricks Workspace, I am returned:

/Volumes/<my catalog>/<my schema>/<my volume path>/<my file>.xsl

However, running it from VSCode, I am returned:

No such file or directory: /Volumes/<my catalog>/<my schema>/<my volume path>/<my file>.xslx

I know that the extension works so that spark commands are executed on the attached cluster, but native python works so that it is ran on the machine. However, should there not be a way of forcing this also to use the cluster, as it makes no sense running this locally as I am trying to read a volume path?

I know that a can make the entire file "Run as a workflow in Databricks", but I would prefer being able to run cell by cell locally. I also know that if a change my code to running spark commands, e.g. spark.read(...), then it would work - but I don't think I should be forced to write my code differently just because I want to develop in VSCode as per suggested by Databricks.

Kaniz_Fatma · ‎04-17-2024

Hi @pernilak, It’s great that you’re using Databricks with Visual Studio Code (VSCode) for your development workflow!

Let’s address the limitations you’ve encountered when working with files from Unity Catalog using native Python.

When running Python code in Databricks, there’s a distinction between executing Spark commands (which run on the attached cluster) and native Python code (which runs on your local machine). In your case, reading a volume path using native Python locally doesn’t make sense, as you rightly pointed out.
The Databricks extension for VSCode allows you to write and run local Python, R, Scala, and SQL code on a remote Databricks workspace.
You can use this extension to interact with Databricks SQL warehouses, run notebooks, and more ¹.
However, it doesn’t directly address the issue of reading volume paths using native Python.
Databricks Connect enables you to write, run, and debug local Python code on a remote Databricks workspace.
By using Databricks Connect, you can execute Python code on the cluster, which should help with reading volume paths.
You’ll need to set up Databricks Connect in your local environment and configure it to connect to yo....
Databricks Asset Bundles (bundles) allow you to programmatically define, deploy, and run Databricks jobs, Delta Live Tables pipelines, and MLOps Stacks using CI/CD best practices.
While bundles primarily focus on job deployment, they might offer a way to handle your use case more...
You mentioned that changing your code to use Spark commands (e.g., spark.read(...)) would work.
While it’s not ideal to modify your code just for the development environment, it might be a practical workaround for now.
Consider encapsulating the file reading logic in a utility function that abstracts away the differences between local and cluster execution. This way, you can keep your code consistent across both environments.

rustam · ‎07-22-2024

Thank you for the detailed reply, @Kaniz_Fatma and the great question @pernilak!

I would also like to code and debug in VS Code while all the code in my Jupyter notebooks can be executed on a databricks cluster cell by cell with access to the data in our Unity Catalog. As described in this Azure Databricks Documentation, Databricks Connect runs only "code involving DataFrame operations on the cluster". Therefore, it seems not to address the original request or I'm missing something?

Is it possible to configure a databricks cluster as a remote python interpreter so that all local code accessed through VS Code is executed on the remote databricks cluster as I would have executed the code from a databricks notebook?

Thank you very much in advance and best regards,

Rustam