Databricks

spott_submittab · ‎09-24-2021

In the past, before databricks, I would try and pull commonly used functions and features out of notebooks and save them in a python library that the whole team would work on and develop. This allowed for good code reuse and maintaining best practices within the team.

I did this with judicious use of `pip install -e .` and `%autoreload` in the past, allowing me to simultaneously work on a notebook and the library that the notebook depends on.

Is there a way to do this kind of development with Databricks? How do other people develop this kind of library for use with Databricks? Are people just doing mostly copy-paste development between different notebooks?

Kaniz · ‎09-24-2021

Hi @ spott_submittable! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will follow up with my team and get back to you soon.Thanks.

-werners- · ‎09-27-2021

This is an very interesting question.

We put reusable code into libraries using databricks-connect. So we develop the libraries in an IDE, pack em into a library and attach them to clusters (ofc with the use of git).

However, this is imo a suboptimal approach as a lot of code resides in notebooks, so we have a mix of notebook code and library code packed in libraries.

Also databricks-connect is not following the latest databricks releases. So that is a pity.

I would really like to have one single environment where we have the advantages of notebooks and the advantages of an ide, like MS does with VS Code.

dazfuller · ‎09-27-2021

The way we do this is to package as much re-usable code up into a common library as possible and then test it to within an inch of it's life with unit tests (I tend to use unittest for lower barrier to entry, but which ever framework works best for you). This includes putting any user defined functions or Spark API functions through unit tests with Spark running locally. We then have build pipelines in Azure DevOps (though this works with Github actions as well) that lints, tests, builds, and then deploys the library to Databricks where it can be pulled in by the notebooks. Ideally leaving the notebooks to read in and write out data frames, but with the bulk of the work in the libraries.

That's assuming use of notebooks, and not submitting whole jobs as jar/wheel files.

I've done a blog post on unit testing PySpark libraries which is online for anyone to read.

For linting I normally pull in the following libraries

flake8

pep8-naming

flake8-docstrings

flake8-bandit

flake8-eradicate

These then lint, ensure naming conventions, check that docstrings have been created properly, check for common security issues, and identifies commented out code. I get pretty brutal with the builds and if any tests fail, linting fails, or coverage drops below 90% then then entire build fails.

I also wrote one on auto-generating documentation and publishing to an Azure Static Web App (with authentication) so anyone on the internal team can use it.

Hope that all helps

-werners- · ‎09-27-2021

How do you manage your local spark instance? (version mgmt, reinstall etc)?

dazfuller · ‎09-27-2021

Easy, I don't have a local central spark instance. Instead I work in python virtual environments and install the version of pyspark which matches the Databricks runtime I'm building against. Then I've got an environment controlled version of Spark. I also pull in other items as per the runtime. So, for example, with a current library I'm building the requirements look like this.

# Project dependencies
pyspark==3.0.1
pandas==1.0.1
pyarrow==1.0.1
numpy==1.18.1
pytz==2019.3
pyodbc==4.0.31

And these match the versions which are pre-installed as part of Databricks Runtime 7.3 LTS. If I need to target a new version of the runtime then I just update the project dependencies as per that runtime.

-werners- · ‎09-27-2021

I see. I might try this out and maybe dockerize it. Although it would be nice if dbrx would provide docker images of all their supported versions.

dazfuller · ‎09-27-2021

I'd probably just avoid docker entirely. I've run into loads of issues with doing things that way. It's honestly easier to use venv, virtualenv, conda etc... to create a python virtual environment and run from there. Then you're not adding a few hundred MB for the OS base image as well.

I put a demo library up online as part of a talk I did last year, you're more than welcome to use it as a reference for this method if you'd like

https://github.com/dazfuller/demo-pyspark-lib