How are people developing python libraries for use within a team on databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-24-2021 03:11 PM
In the past, before databricks, I would try and pull commonly used functions and features out of notebooks and save them in a python library that the whole team would work on and develop. This allowed for good code reuse and maintaining best practices within the team.
I did this with judicious use of `pip install -e .` and `%autoreload` in the past, allowing me to simultaneously work on a notebook and the library that the notebook depends on.
Is there a way to do this kind of development with Databricks? How do other people develop this kind of library for use with Databricks? Are people just doing mostly copy-paste development between different notebooks?
- Labels:
-
Best Practices
-
Python
-
Python Libraries
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 12:28 AM
This is an very interesting question.
We put reusable code into libraries using databricks-connect. So we develop the libraries in an IDE, pack em into a library and attach them to clusters (ofc with the use of git).
However, this is imo a suboptimal approach as a lot of code resides in notebooks, so we have a mix of notebook code and library code packed in libraries.
Also databricks-connect is not following the latest databricks releases. So that is a pity.
I would really like to have one single environment where we have the advantages of notebooks and the advantages of an ide, like MS does with VS Code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 12:53 AM
The way we do this is to package as much re-usable code up into a common library as possible and then test it to within an inch of it's life with unit tests (I tend to use unittest for lower barrier to entry, but which ever framework works best for you). This includes putting any user defined functions or Spark API functions through unit tests with Spark running locally. We then have build pipelines in Azure DevOps (though this works with Github actions as well) that lints, tests, builds, and then deploys the library to Databricks where it can be pulled in by the notebooks. Ideally leaving the notebooks to read in and write out data frames, but with the bulk of the work in the libraries.
That's assuming use of notebooks, and not submitting whole jobs as jar/wheel files.
I've done a blog post on unit testing PySpark libraries which is online for anyone to read.
For linting I normally pull in the following libraries
flake8
pep8-naming
flake8-docstrings
flake8-bandit
flake8-eradicate
These then lint, ensure naming conventions, check that docstrings have been created properly, check for common security issues, and identifies commented out code. I get pretty brutal with the builds and if any tests fail, linting fails, or coverage drops below 90% then then entire build fails.
I also wrote one on auto-generating documentation and publishing to an Azure Static Web App (with authentication) so anyone on the internal team can use it.
Hope that all helps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 01:43 AM
How do you manage your local spark instance? (version mgmt, reinstall etc)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 04:24 AM
Easy, I don't have a local central spark instance. Instead I work in python virtual environments and install the version of pyspark which matches the Databricks runtime I'm building against. Then I've got an environment controlled version of Spark. I also pull in other items as per the runtime. So, for example, with a current library I'm building the requirements look like this.
# Project dependencies
pyspark==3.0.1
pandas==1.0.1
pyarrow==1.0.1
numpy==1.18.1
pytz==2019.3
pyodbc==4.0.31
And these match the versions which are pre-installed as part of Databricks Runtime 7.3 LTS. If I need to target a new version of the runtime then I just update the project dependencies as per that runtime.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 04:29 AM
I see. I might try this out and maybe dockerize it. Although it would be nice if dbrx would provide docker images of all their supported versions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 04:34 AM
I'd probably just avoid docker entirely. I've run into loads of issues with doing things that way. It's honestly easier to use venv, virtualenv, conda etc... to create a python virtual environment and run from there. Then you're not adding a few hundred MB for the OS base image as well.
I put a demo library up online as part of a talk I did last year, you're more than welcome to use it as a reference for this method if you'd like
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2022 06:08 AM
Hi, I can use some more suggestions - how to manage reusable code in Databricks. Thanks