Databricks Community

danielsparing · ‎06-04-2024

screenshot_2024-06-03_at_14.12.20_720.png

As a data scientist developing ML models in Python on Databricks, you likely utilize notebooks for conducting training experiments. The ML code you jot down in your notebooks might end up cluttered and laden with unnecessary elements, potentially hindering your production pipeline's efficiency. So, it becomes essential to tidy up your notebook(s). You can then either construct automated workflows with these notebooks or refactor them into modularized Python code. This ensures that an automated process can reproduce the training run and execute it regularly with new data.

The Databricks documentation has some high-level tips for this phase of work, see here. In this article, we expand on the Refactor code advice therein.

Here are our top tips for tidying up your notebook code:

You won't want to miss this...
Tip 1: Use the Notebook built-in formatter
Tip 2: Check in your notebook via Databricks Repos into your Version Control System (VCS)
Tip 3: Consider (also) using an IDE to edit your notebook
Tip 4: Check that your notebook runs with the “Clear state and run all” command without errors
Tip 5: Capture the Databricks Runtime and cluster configuration your notebook uses
Tip 6: Take note of your notebook’s inputs and outputs
Tip 7: Remove all display and print statements
Tip 8: Consider writing “assert” statements to test your outputs
Tip 9: Remove any orphaned cells or statements that do not lead to output
Tip 10: Take note of your cells’ running times and optimize or refactor the longest ones
Tip 11: Remove any line or cell magics
Tip 12: Remove any hard-coded variables
Tip 13: Remove repetitive code
Converting Interactive Notebooks to Repeatable Workflows
Coming up next!

You won't want to miss this...

As a companion to this article, we spoke to one of our most valued partners, Gavi Regunath, about this subject. You can watch the recording of that discussion on the Advancing Analytics YouTube channel.

Tip 1: Use the Notebook built-in formatter

Let’s start with an easy one: In your notebook, select the menu option “Edit: Format notebook”. This will use Black to format your code. Using a standard formatter will make code easier to read, help align with code style best practices and remove unnecessary differences in code formatting.

Tip 2: Check in your notebook via Databricks Repos into your Version Control System (VCS)

Important: Before this step, think about whether you hard-coded any cloud resource secrets or passwords in your code. Once these are committed, it will be difficult to impossible to remove them from your VCS, and you will have to immediately rotate the keys or change any impacted credentials. To avoid this, you should never hard code these in your code even in early development: use Databricks Secrets instead!

If you do not use version control for your notebook yet, or only used the built-in version of (legacy) git version control on Databricks, make sure now that your notebook is checked in to your organization’s git repository on GitHub, GitLab, Azure DevOps or similar. An extensive guide on Getting started with version control in Databricks is coming up soon in this MLOps Gym series.

When committing to a Git repository, be aware that your command comments and notebook outputs will not be saved. While there is a feature available for saving outputs when using iPython Notebooks (*.ipynb files), we will not utilize that feature in this context.

You have a few options for storing notebooks in a VCS, each with its own benefits and downsides:

As Databricks notebooks: By default, Databricks notebooks are saved in a format known as "source," also referred to as *.py extension files. These files commence with the line "# Databricks notebook source" and do not encompass their cell outputs. It's worth noting that there are methods available for saving Databricks notebooks along with their outputs, typically in formats such as *.dbc or HTML. However, directly saving outputs from a working Databricks notebook to a Version Control System (VCS) is not supported.
As iPython notebooks without outputs: Databricks notebooks within Repos can be converted to iPython notebooks (*.ipynb extension) via the menu option `File > Change notebook format`. However, when committing an iPython notebook to a Version Control System (VCS), its output will not be included by default, even if the output is visible in the Databricks UI.
As iPython notebooks with outputs: When committing an iPython notebook via Databricks Repos, you have the option to include the outputs as well. This can be achieved by creating a commit_outputs repo config file. In the commit UI, you'll find a header button specifically for this purpose. By enabling this option, your outputs are stored along with the notebook, allowing them to be diffed and tracked.

The below table compares the above options, considering the difficulty to diff files and whether they are rendered as notebooks (as opposed to source code) on GitHub.

	*Databricks Notebook (.py, “source”)**	iPython notebook without outputs	iPython notebook with outputs
Outputs	no	no	yes
Diff	yes	yes	Difficult with outputs
Github rendering	raw	yes	yes

Our recommendation is to use iPython notebooks without outputs.

Tip 3: Consider (also) using an IDE to edit your notebook

While the Databricks web IDE is a powerful development tool, you may accomplish certain refactoring tasks more easily within an IDE, such as VSCode. The easiest way to do this is to check out the same repository in VSCode, edit your code (without executing it), and just make sure to push and pull during context switches. Alternatively, there is also a Databricks VSCode integration which allows for execution against Databricks clusters as well.

If your IDE plugin, such as a linter, doesn’t directly work with iPython notebooks, you can either (temporarily) convert your notebook to Databricks Notebooks, or export your iPython notebook as a Source file.

Tip 4: Check that your notebook runs with the “Clear state and run all” command without errors

It is common to have the impression that your notebook “works”, only to realize that some cell steps rely on your short-term memory to run them in a particular order, skip, or parametrize. In fact, in order for your notebook to later be successfully scheduled (or used by another human), it is essential that it can run “hands off” just by choosing the command “Clear state and run all”.

Note that this is not entirely identical to running all cells manually in order – besides being easier to launch and less error-prone, it can catch errors when a cell’s output is not immediately available for the next cell (e.g. when creating a cloud resource through a long-running operation).

Therefore our advice is to run your notebook at least once with “Clear state and run all” (under the “Run > Clear” menu option). Check for any errors, and potentially also for any warnings and whether the outputs are still as you expected.

Tip 5: Capture the Databricks Runtime and cluster configuration your notebook uses

DBR version

A few ways to programmatically output the DBR version are:

notebook_context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
notebook_context['tags']['sparkVersion']
# Output:
# '13.3.x-cpu-ml-scala2.12'

Or

spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
# Output:
# '13.3.x-cpu-ml-scala2.12'

Python libraries

You can obtain a comprehensive inventory of your installed Python libraries by capturing the %pip freeze output. Here's an example:

%pip freeze
# Output:
# absl-py==1.0.0
# accelerate==0.23.0
# aiohttp==3.8.6
# aiosignal==1.3.1
# [...]

However, the issue here is that most of these libraries and versions are by default included in the Databricks Runtime version (DBR) you chose. So it is more practical to isolate any deviations from the DBR libraries and versions as a single `%pip install` with exact version pinnings, which later will be the source of your `requirements.txt` file:

%pip install beautifulsoup4==4.11.2

Note that in the Notebook UI, you can isolate these Notebook-installed library versions by choosing the “Python Libraries” button on the right sidebar, and choosing “Type: Notebook”. In turn the type “Runtime” are the ones included by the DBR, and “Type: Cluster” are cluster-installed libraries – you also want to copy these latter ones into your requirements.txt.

The last step is to create a “requirements.txt” file to collect all your library upgrades and new libraries on top of the DBR libraries, in the same directory path as your notebook, and replace any cluster and notebook library installs with a single notebook call:

%pip install -r requirements.txt

Imports

In an exploratory notebook, it sometimes makes sense to import libraries directly before their first use, to make our code more self-sufficient. However, as we are preparing to package our code later as a Python package, we need our imports to be on top, just after any pip installs. This can also avoid some namespace conflicts.

You can use isort to automatically sort your imports in the canonical order. See the section on using IDE’s.

Tip 6: Take note of your notebook’s inputs and outputs

Understand what inputs and outputs your notebook is consuming and writing to. E.g., reading from Delta tables, and registering MLflow models. These will be the key in/out interfaces of your packaged Python code as well.

Tip 7: Remove all display and print statements

During interactive execution, you probably generated data outputs using statements like display(), show(), print(), or similar. As you prepare your code for automatic execution, these become runtime and code maintenance overheads, so you should remove them. Note that as Spark is lazily executed, not asking for a dataframe to be displayed can make a substantial runtime difference – however, if you need to write it out to a table, of course that still triggers the relevant calculations.

For informative logs that are still deemed useful, you can use Python’s logging module – but don’t move expensive operations to logging either, as they still will have to be performed.

Tip 8: Consider writing “assert” statements to test your outputs

Now that you removed outputs, it’s a bit less visible whether your code does generate the outputs you expect. If your data output is somewhat deterministic, write some assert statements to check. These can be about the number of columns, rows, etc.; not necessarily exact values.

These assert statements will come in handy later when you write unit tests against your modularized code.

Just make sure that the assert statements themselves do not take up substantial time, or wrap them in an “if DEBUG” flag.

Tip 9: Remove any orphaned cells or statements that do not lead to output

Now that you have assert statements checking for output and removed pure print statements, you can easily check which parts of your code are no-ops from the perspective of your outputs and can be removed.

Many linters and type checkers like Pylance and Pylint can highlight code that has no effect or is unreachable. To take advantage of these, you might need to open your notebook in an IDE (see section on using IDE’s).

Tip 10: Take note of your cells’ running times and optimize or refactor the longest ones

One advantage of using notebook formats over Python source files is the ability to quickly overview the running times of each cell. If certain cells take several minutes to execute, it's likely that you would notice them during your work. In such cases, it's worth considering refactoring those cells, possibly by avoiding loops and breaking down the task into smaller, more manageable steps. This not only helps improve the efficiency of your code but also enhances its readability and maintainability.

Tip 11: Remove any line or cell magics

Remove any line or cell magics (except for %pip install if necessary), as they will not work in modularized Python code.

Some examples:

%fs can be rewritten with os
%sh potentially can be rewritten but be careful, consider a native Python alternative
Databricks CLI could be rewritten to the Databricks Python SDK
%sql can be rewritten to pyspark.sql incl. using expr()

Tip 12: Remove any hard-coded variables

Remove any hard-coded variables, and if needed, pull them ahead (below imports) as (capitalized) constant parameters.

Tip 13: Remove repetitive code

While the full modularization of our code can wait until later, if you have repetitive code, it is good to merge them by defining functions.

You can use pylint to automatically find repetitive code. (“duplicate-code”)

Converting Interactive Notebooks to Repeatable Workflows

With the above steps, you arrived much closer to the ideal of a single end-to-end notebook that can be automatically executed (parametrized, if necessary) an arbitrary number of times, to build your model. Once you have tidied up your notebook code, you can convert it directly into a job (workflow) using Databricks UI, API, SDK, CLI, or Terraform provider.

In an upcoming article, we'll explore packaging a Python module and utilizing the pre-commit tool.

Coming up next!

Next blog in this series: MLOps Gym - Advanced MLflow Guide for LLMs (Evaluate)

KhileshKhubnani · ‎07-07-2024

Quite useful! Thanks..

RichardStevens · ‎07-20-2024

Useful topic. Thank you.