Hey @KrzysztofPrzyso , so sorry this question got lost in the shuffle.
This is a known architectural limitation with Databricks Connect when working with pure Python file operations and UC Volumes. The issue stems from how Databricks Connect executes code differently depending on whether it's PySpark or pure Python.
Understanding the Core Issue
When you use Databricks Connect, PySpark operations are serialized and executed on the remote Databricks cluster where the `/Volumes/` filesystem paths are directly accessible via FUSE mounting. However, pure Python file I/O operations (like `open()`, `os.listdir()`, or standard library functions) execute locally on your development machine, where the `/Volumes/` paths simply don't exist as mounted filesystems.
This architectural difference means that while PySpark can seamlessly access UC Volumes through Databricks Connect, pure Python cannot without additional abstraction layers.
Current Workarounds
Unfortunately, there isn't a native solution that provides completely identical code execution between Databricks Connect and cluster environments for pure Python file operations. However, you have several options:
1. Create an Abstraction Layer
Build a lightweight wrapper that detects the execution environment and routes file operations accordingly:
```python
def get_file_handle(volume_path, mode='r'):
try:
# Try direct access (works on cluster)
return open(volume_path, mode)
except FileNotFoundError:
# Fallback for Databricks Connect
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Use SDK to access files
# This requires implementing download/upload logic
```
2. Use PySpark for File Reading
Even in ML workflows, you can use PySpark to read files into memory, then work with pure Python:
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Read file content using Spark
file_content = spark.read.text("/Volumes/catalog/schema/volume/file.txt").collect()
```
This approach works consistently across both environments since Databricks Connect properly handles PySpark operations.
3. Use External Volumes with Cloud URIs
If you're using external volumes, you can access the underlying cloud storage directly using cloud provider SDKs (boto3 for AWS, azure-storage for Azure). While this requires cloud credentials in your local environment, it provides consistent access patterns.
4. Databricks SDK Files API
The Databricks SDK provides a Files API that can interact with Volumes programmatically, though this introduces the dependency you wanted to avoid.
The Reality
The fundamental challenge is that Databricks Connect was primarily designed for Spark-based workloads. Pure Python file I/O is inherently local-execution, which conflicts with remote filesystem access. Until Databricks provides a client-side filesystem driver or extends Databricks Connect to transparently proxy file operations, you'll need some form of abstraction or workaround.
For production ML workflows, many teams accept option #1 (abstraction layer) or #2 (using PySpark for initial file reading) as pragmatic solutions that maintain most of the developer experience benefits while ensuring code portability between local development and cluster execution.
Hope this helps, Louis.