Hi All,
In my organization, we use Databricks Connect and VS Code for data engineering purposes. This setup is working great, especially for:
- Debugging
- Unit tests
- GitHub Copilot
- Reusable modules and custom libraries
In my view, the developer experience here is significantly better than in notebooks. It's crucial for us that the same code can run on both Databricks Connect/IDE and the Databricks cluster.
Following best practices, we use Unity Catalog for governance and access control, which has been working well, including with UC Volumes to manage access to unstructured data.
In the machine learning, AI, and LLM world, there is a trend towards using pure Python instead of PySpark.
The problem I am facing is accessing files using pure Python (for ML purposes) via Databricks Connect and UC Volumes. This works without issues with PySpark.
Ideally, I would like to avoid:
- Having different code for the cluster and Databricks Connect
- Needing to copy files between Volumes and the local machine (e.g., using the SDK dbutils)
On the cluster/notebook, I can directly open and read files from UC Volumes. I would like to have the same capability in Databricks Connect without requiring additional workarounds.