@adhi_databricks: I want to add my perspective when it comes to pure local development (without Databricks connect).
I wanted to setup a local development environment without connecting to Databricks workspace/cloud storage; develop PySpark code in VSCode using local Spark and GenAI. After local development, I wanted to deploy code to Databricks workspace for Data Engineering alone.
Faced following challenges whether we use Gen AI or Not.
1) Notebook Architecture
- we import notebooks using %run. Linting support is minimal(none) in Notebooks. we are facing runtime errors for syntax; sometimes indent issues also. It must have been caught during formatting and linting phase.
- preparation of coverage reports, code quality checks using tools like Sonar and executing test cases is challenging
- linting, testing, code quality analysis and coverage check must be done before the code is pushed to Databricks workspace in test or prod environment
2) Strong Dependencies with Cloud storage, Delta and Meta-store (HIVE or UC)
- Difficult/unable to mock cloud storage/delta/meta-store
- Unable to test all parts of the code due to Dependencies
We were able to overcome some of the issues using the following approach,
1) Create a pure python package that has framework or reusable code - code that does not change often. This will be formatted, linted, tested and built to wheel package. This is a common package and has its own lifecycle. And developers will install it locally in a Python venv.
2) Create features/changes that are to be introduced in the current release/sprint. This will eventually become a wheel, but at the moment can be separate modules in the vscode. This will import required modules from the common package. Code formatting, Linting and Unit testing can be automated for these changes - with Gen AI help.
3) Dependency handling - Delta Lake/Hive abstraction implemented using Docker - this is time consuming, but possible. When developers start the VSCode, Docker container can also start, and dependencies can be made ready. It is not a smooth setup; there are issues with this.
Hope this helps