Databricks Community

KrzysztofPrzyso · ‎08-08-2024

Hi All,

In my organization, we use Databricks Connect and VS Code for data engineering purposes. This setup is working great, especially for:

- Debugging
- Unit tests
- GitHub Copilot
- Reusable modules and custom libraries

In my view, the developer experience here is significantly better than in notebooks. It's crucial for us that the same code can run on both Databricks Connect/IDE and the Databricks cluster.

Following best practices, we use Unity Catalog for governance and access control, which has been working well, including with UC Volumes to manage access to unstructured data.

In the machine learning, AI, and LLM world, there is a trend towards using pure Python instead of PySpark.

The problem I am facing is accessing files using pure Python (for ML purposes) via Databricks Connect and UC Volumes. This works without issues with PySpark.

Ideally, I would like to avoid:

- Having different code for the cluster and Databricks Connect
- Needing to copy files between Volumes and the local machine (e.g., using the SDK dbutils)

On the cluster/notebook, I can directly open and read files from UC Volumes. I would like to have the same capability in Databricks Connect without requiring additional workarounds.

Louis_Frolio · 2 weeks ago

Hey @KrzysztofPrzyso , so sorry this question got lost in the shuffle.

This is a known architectural limitation with Databricks Connect when working with pure Python file operations and UC Volumes. The issue stems from how Databricks Connect executes code differently depending on whether it's PySpark or pure Python.

Understanding the Core Issue

When you use Databricks Connect, PySpark operations are serialized and executed on the remote Databricks cluster where the `/Volumes/` filesystem paths are directly accessible via FUSE mounting. However, pure Python file I/O operations (like `open()`, `os.listdir()`, or standard library functions) execute locally on your development machine, where the `/Volumes/` paths simply don't exist as mounted filesystems.

This architectural difference means that while PySpark can seamlessly access UC Volumes through Databricks Connect, pure Python cannot without additional abstraction layers.

Current Workarounds

Unfortunately, there isn't a native solution that provides completely identical code execution between Databricks Connect and cluster environments for pure Python file operations. However, you have several options:

1. Create an Abstraction Layer

Build a lightweight wrapper that detects the execution environment and routes file operations accordingly:

```python
def get_file_handle(volume_path, mode='r'):
try:
# Try direct access (works on cluster)
return open(volume_path, mode)
except FileNotFoundError:
# Fallback for Databricks Connect
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Use SDK to access files
# This requires implementing download/upload logic
```

2. Use PySpark for File Reading

Even in ML workflows, you can use PySpark to read files into memory, then work with pure Python:

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Read file content using Spark
file_content = spark.read.text("/Volumes/catalog/schema/volume/file.txt").collect()
```

This approach works consistently across both environments since Databricks Connect properly handles PySpark operations.

3. Use External Volumes with Cloud URIs

If you're using external volumes, you can access the underlying cloud storage directly using cloud provider SDKs (boto3 for AWS, azure-storage for Azure). While this requires cloud credentials in your local environment, it provides consistent access patterns.

4. Databricks SDK Files API

The Databricks SDK provides a Files API that can interact with Volumes programmatically, though this introduces the dependency you wanted to avoid.

The Reality

The fundamental challenge is that Databricks Connect was primarily designed for Spark-based workloads. Pure Python file I/O is inherently local-execution, which conflicts with remote filesystem access. Until Databricks provides a client-side filesystem driver or extends Databricks Connect to transparently proxy file operations, you'll need some form of abstraction or workaround.

For production ML workflows, many teams accept option #1 (abstraction layer) or #2 (using PySpark for initial file reading) as pragmatic solutions that maintain most of the developer experience benefits while ensuring code portability between local development and cluster execution.

Hope this helps, Louis.