As a data scientist developing ML models in Python on Databricks, you likely utilize notebooks for conducting training experiments. The ML code you jot down in your notebooks might end up cluttered and laden with unnecessary elements, potentially hindering your production pipeline's efficiency. So, it becomes essential to tidy up your notebook(s). You can then either construct automated workflows with these notebooks or refactor them into modularized Python code. This ensures that an automated process can reproduce the training run and execute it regularly with new data.
The Databricks documentation has some high-level tips for this phase of work, see here. In this article, we expand on the Refactor code advice therein.
Here are our top tips for tidying up your notebook code:
As a companion to this article, we spoke to one of our most valued partners, Gavi Regunath, about this subject. You can watch the recording of that discussion on the Advancing Analytics YouTube channel.
Let’s start with an easy one: In your notebook, select the menu option “Edit: Format notebook”. This will use Black to format your code. Using a standard formatter will make code easier to read, help align with code style best practices and remove unnecessary differences in code formatting.
Important: Before this step, think about whether you hard-coded any cloud resource secrets or passwords in your code. Once these are committed, it will be difficult to impossible to remove them from your VCS, and you will have to immediately rotate the keys or change any impacted credentials. To avoid this, you should never hard code these in your code even in early development: use Databricks Secrets instead!
If you do not use version control for your notebook yet, or only used the built-in version of (legacy) git version control on Databricks, make sure now that your notebook is checked in to your organization’s git repository on GitHub, GitLab, Azure DevOps or similar. An extensive guide on Getting started with version control in Databricks is coming up soon in this MLOps Gym series.
When committing to a Git repository, be aware that your command comments and notebook outputs will not be saved. While there is a feature available for saving outputs when using iPython Notebooks (*.ipynb files), we will not utilize that feature in this context.
You have a few options for storing notebooks in a VCS, each with its own benefits and downsides:
The below table compares the above options, considering the difficulty to diff files and whether they are rendered as notebooks (as opposed to source code) on GitHub.
Databricks Notebook (*.py, “source”) |
iPython notebook without outputs |
iPython notebook with outputs |
|
Outputs |
no |
no |
yes |
Diff |
yes |
yes |
Difficult with outputs |
Github rendering |
raw |
yes |
yes |
Our recommendation is to use iPython notebooks without outputs.
While the Databricks web IDE is a powerful development tool, you may accomplish certain refactoring tasks more easily within an IDE, such as VSCode. The easiest way to do this is to check out the same repository in VSCode, edit your code (without executing it), and just make sure to push and pull during context switches. Alternatively, there is also a Databricks VSCode integration which allows for execution against Databricks clusters as well.
If your IDE plugin, such as a linter, doesn’t directly work with iPython notebooks, you can either (temporarily) convert your notebook to Databricks Notebooks, or export your iPython notebook as a Source file.
It is common to have the impression that your notebook “works”, only to realize that some cell steps rely on your short-term memory to run them in a particular order, skip, or parametrize. In fact, in order for your notebook to later be successfully scheduled (or used by another human), it is essential that it can run “hands off” just by choosing the command “Clear state and run all”.
Note that this is not entirely identical to running all cells manually in order – besides being easier to launch and less error-prone, it can catch errors when a cell’s output is not immediately available for the next cell (e.g. when creating a cloud resource through a long-running operation).
Therefore our advice is to run your notebook at least once with “Clear state and run all” (under the “Run > Clear” menu option). Check for any errors, and potentially also for any warnings and whether the outputs are still as you expected.
A few ways to programmatically output the DBR version are:
|
Or
|
You can obtain a comprehensive inventory of your installed Python libraries by capturing the %pip freeze output. Here's an example:
|
However, the issue here is that most of these libraries and versions are by default included in the Databricks Runtime version (DBR) you chose. So it is more practical to isolate any deviations from the DBR libraries and versions as a single `%pip install` with exact version pinnings, which later will be the source of your `requirements.txt` file:
|
Note that in the Notebook UI, you can isolate these Notebook-installed library versions by choosing the “Python Libraries” button on the right sidebar, and choosing “Type: Notebook”. In turn the type “Runtime” are the ones included by the DBR, and “Type: Cluster” are cluster-installed libraries – you also want to copy these latter ones into your requirements.txt.
The last step is to create a “requirements.txt” file to collect all your library upgrades and new libraries on top of the DBR libraries, in the same directory path as your notebook, and replace any cluster and notebook library installs with a single notebook call:
|
In an exploratory notebook, it sometimes makes sense to import libraries directly before their first use, to make our code more self-sufficient. However, as we are preparing to package our code later as a Python package, we need our imports to be on top, just after any pip installs. This can also avoid some namespace conflicts.
You can use isort to automatically sort your imports in the canonical order. See the section on using IDE’s.
Understand what inputs and outputs your notebook is consuming and writing to. E.g., reading from Delta tables, and registering MLflow models. These will be the key in/out interfaces of your packaged Python code as well.
During interactive execution, you probably generated data outputs using statements like display(), show(), print(), or similar. As you prepare your code for automatic execution, these become runtime and code maintenance overheads, so you should remove them. Note that as Spark is lazily executed, not asking for a dataframe to be displayed can make a substantial runtime difference – however, if you need to write it out to a table, of course that still triggers the relevant calculations.
For informative logs that are still deemed useful, you can use Python’s logging module – but don’t move expensive operations to logging either, as they still will have to be performed.
Now that you removed outputs, it’s a bit less visible whether your code does generate the outputs you expect. If your data output is somewhat deterministic, write some assert statements to check. These can be about the number of columns, rows, etc.; not necessarily exact values.
These assert statements will come in handy later when you write unit tests against your modularized code.
Just make sure that the assert statements themselves do not take up substantial time, or wrap them in an “if DEBUG” flag.
Now that you have assert statements checking for output and removed pure print statements, you can easily check which parts of your code are no-ops from the perspective of your outputs and can be removed.
Many linters and type checkers like Pylance and Pylint can highlight code that has no effect or is unreachable. To take advantage of these, you might need to open your notebook in an IDE (see section on using IDE’s).
One advantage of using notebook formats over Python source files is the ability to quickly overview the running times of each cell. If certain cells take several minutes to execute, it's likely that you would notice them during your work. In such cases, it's worth considering refactoring those cells, possibly by avoiding loops and breaking down the task into smaller, more manageable steps. This not only helps improve the efficiency of your code but also enhances its readability and maintainability.
Remove any line or cell magics (except for %pip install if necessary), as they will not work in modularized Python code.
Some examples:
Remove any hard-coded variables, and if needed, pull them ahead (below imports) as (capitalized) constant parameters.
While the full modularization of our code can wait until later, if you have repetitive code, it is good to merge them by defining functions.
You can use pylint to automatically find repetitive code. (“duplicate-code”)
With the above steps, you arrived much closer to the ideal of a single end-to-end notebook that can be automatically executed (parametrized, if necessary) an arbitrary number of times, to build your model. Once you have tidied up your notebook code, you can convert it directly into a job (workflow) using Databricks UI, API, SDK, CLI, or Terraform provider.
In an upcoming article, we'll explore packaging a Python module and utilizing the pre-commit tool.
Next blog in this series: MLOps Gym - Advanced MLflow Guide for LLMs (Evaluate)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.