cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

pkgutils walk_packages stopped working in DBR 17.2

Maxrb
New Contributor

Hi,

After moving from Databricks runtime 17.1 to 17.2 suddenly my pkgutils walk_packages doesn't identify any packages within my repository anymore.

This is my example code:

import pkgutil
import os

packages = pkgutil.walk_packages([os.getcwd()])
print(list(packages))

 previously it found all my packages but since the update to 17.2 it doesn't work anymore.

7 REPLIES 7

Louis_Frolio
Databricks Employee
Databricks Employee

Hello @Maxrb , I did some digging on my end and I have some suggestions and/or hints to help you further troubleshoot your issue.

What youโ€™re running into lines up with a few runtime-specific behaviors that changed around Databricks Runtime 17.x, and together they explain why package discovery suddenly went quiet after the move to 17.2.

What likely changed

First, the current working directory on Databricks is the directory of the running notebook or script, not necessarily your repo root. If your packages live somewhere elseโ€”say at the repo root or under a src folderโ€”then pkgutil.walk_packages([os.getcwd()]) will simply never see them. Itโ€™s scanning the wrong place.

Second, when youโ€™re importing Python code from workspace files or Git folders that live outside the notebookโ€™s directory, you need to be explicit about sys.path. The root of a Git folder is automatically added, but subdirectories are not. And if youโ€™re working with workspace files, the path you append must include the /Workspace/โ€ฆ prefix. If Python canโ€™t see the directory, pkgutil wonโ€™t either.

Finally, across the 17.x line there were changes to Python import hooks that tightened up how workspace paths are handled. A related issue showed up in 17.3 with wheel tasks, but even in 17.2 the behavior is more strict and predictable. Code that implicitly relied on os.getcwd() pointing at the repo root can now fail if the notebook lives in a subfolder.

Quick sanity checks

Before changing anything, itโ€™s worth confirming what Python thinks is going on:

Print the working directory and its contents:

print(os.getcwd())

print(os.listdir(os.getcwd()))

This tells you immediately whether youโ€™re scanning a directory that actually contains your packages.

Also double-check that your packages include an init.py. pkgutil.walk_packages only discovers classic packages; it wonโ€™t enumerate PEP 420 namespace packages.

Recommended fixes

Which fix you choose really depends on where your code lives.

Option 1: Point pkgutil directly at your repo code (my preferred approach)

If your packages live under something like /Workspace/Repos///src, be explicit. Add that directory to sys.path and walk it directly:

import os
import sys
import pkgutil

repo_root = "/Workspace/Repos/<user>/<repo>"
src_dir = os.path.join(repo_root, "src")  # or repo_root if you donโ€™t use src/

if src_dir not in sys.path:
    sys.path.append(src_dir)

packages = list(pkgutil.walk_packages([src_dir]))
print(packages)

This removes all ambiguity about what youโ€™re scanning and what Python can import.

Option 2: Let sys.path do the work

If your notebook lives at the Git folder root (not nested), that root is already on sys.path. In that case you can just let pkgutil walk everything Python already knows about:

import pkgutil

packages = list(pkgutil.walk_packages())
print(packages)

This only works if your layout is clean and flat, but when it applies, itโ€™s the simplest solution.

Option 3: Compute the repo root from the notebook location

If your notebook is nested a few levels down, compute the repo root relative to the working directory and add it:

import os
import sys
import pkgutil

cwd = os.getcwd()
repo_root = os.path.dirname(os.path.dirname(cwd))  # adjust depth as needed

if repo_root not in sys.path:
    sys.path.append(repo_root)

packages = list(pkgutil.walk_packages([repo_root]))
print(packages)

Why os.getcwd() started betraying you

In 17.x, Databricks is much more consistent about setting CWD to the notebookโ€™s directory. If your code used to run from a location that happened to be the repo rootโ€”and now runs from a subfolderโ€”then walk_packages([os.getcwd()]) will return nothing because itโ€™s doing exactly what you asked: scanning the wrong directory.

That behavior lines up with the documented CWD semantics and the newer guidance around workspace files and Git folders. Nothing is โ€œbrokenโ€ so much as more strictly defined.

Hope these tips get you over the finish line.

Cheers, Lou.

Thanks for the detailed answer @Louis_Frolio ,

Unfortunately, none of this is working, I have a notebook in my repo root, I checked all the sys.path, the cwd and, did all the options you mentioned and still it doesn't work in dbr 17.2+. 

Simply put I see all the folders there in listdir, but somehow it doesn't pick up any packages.

Do you not experience the same with local packages?

 

Cheers,

Max

Louis_Frolio
Databricks Employee
Databricks Employee

Hmmm, I have not personally experienced this. I dug a little deeper in our internal docs and and leveraged some internal tools to put togehter another approach for you.  Please give this a try and let me know.

Youโ€™re running into a subtle but very real behavior change in Databricks Runtime 17.2, and it shows up most clearly when using pkgutil.walk_packages() with the current working directory.

This isnโ€™t your code suddenly โ€œbreaking.โ€ Itโ€™s the interaction between Pythonโ€™s import system and how DBR 17.2 (now on Python 3.12) treats discovery paths.

Letโ€™s walk through it.

The root cause

pkgutil.walk_packages() doesnโ€™t just crawl a filesystem path. It expects that path to behave like a real Python import location:

โ€ข The directory must contain proper packages (__init__.py)

โ€ข And just as importantly, the directory must be reachable through Pythonโ€™s import machinery

In DBR 17.2, relying on os.getcwd() alone is no longer sufficient. Even if the files are there, Python wonโ€™t reliably discover them unless that directory is also present on sys.path. Earlier runtimes were more forgiving; Python 3.12 is not.

Thatโ€™s why walk_packages() suddenly appears to return nothing.

The most reliable fix

Option 1: Explicitly add the directory to

sys.path

This aligns your filesystem view with Pythonโ€™s import system and works consistently:

import pkgutil
import os
import sys

cwd = os.getcwd()
if cwd not in sys.path:
    sys.path.insert(0, cwd)

packages = pkgutil.walk_packages([cwd])
print(list(packages))

This is the safest pattern and the one I recommend in most Databricks notebooks.

A cleaner alternative for repos

Option 2: Use an absolute workspace path

If your code lives in a repo or workspace folder, be explicit about where packages live instead of relying on the notebookโ€™s working directory:

import pkgutil
import os

repo_path = os.path.abspath("/Workspace/path/to/your/repo")
packages = pkgutil.walk_packages([repo_path])
print(list(packages))

This avoids ambiguity entirely and plays nicely with Git folders and workspace imports.

One more thing to double-check

Make sure your package structure is real Python, not just folders that โ€œlookโ€ like packages.

Every directory you expect to be discovered must include an __init__.py. Python 3.12 is noticeably stricter here, and DBR 17.2 surfaces that reality.

Why this showed up in 17.2

DBR 17.2 includes a Python upgrade to 3.12.x, along with internal changes to import handling. pkgutil.walk_packages() has always required paths to be importableโ€”but earlier runtimes were more lenient when the current working directory happened to work by accident.

In short:

What used to work implicitly now needs to be explicit.

Thatโ€™s not a regressionโ€”itโ€™s Python behaving the way it always documented itself.

Regards, Louis.

Hi @Louis_Frolio ,

Unfortunately Whatever I am doing, add all paths etc, trying all your solutions it just simply doesn't work. When I run pkgutil on for instance a pyspark.sql packages __path__ it simply works. For me it looks like anything inside the workspace it doesn't find, while in dbr <17.2, all of these things were working. I don't see any files being discovered whatsoever, it just returns an empty array.

I'm a bit lost what could be happening here, I tried it inside a repo, a normal workspace with a folder but somehow no matter what I try it always return an empty list when the "package" is inside my workspace.

Maxrb
New Contributor

I did a bit of deep dive into the source code of the pkgutils walk_package, and I noticed this happening:

def get_importer(path_item):
    path_item = os.fsdecode(path_item)
    try:
        importer = sys.path_importer_cache[path_item]
    except:
        importer = []
    return importer

for a given path in dbr <17.2 like `/Workspace/Repos/<user>/<repo>` this returns a normal FileFinder Object, when I try on >= 17.2 this returns <dbruntime.workspace_import_machinery._WorkspacePathEntryFinder object at 0x.....>.

looking further this means that it will never find any files and thus not work on imports within the repo.

 

Louis_Frolio
Databricks Employee
Databricks Employee

Hey @Maxrb ,

Just thinking out loud here, but this might be worth experimenting with.

You could try using a Unity Catalog Volume as a lightweight package repository. Volumes can act as a secure, governed home for Python wheels (and JARs), and Databricks explicitly supports installing libraries directly from volume paths onto clusters, notebooks, and jobs. In fact, for UC-enabled workspaces, volumes are the recommended pattern for this exact use case.

Just a thought.

Cheers, Lou.

 Hey @Louis_Frolio,

Thanks for thinking along, the whole idea is that this package is not installed as a jar, wheel or something else, but it's a living module in my repository. For production I don't think I will have this issue as I install my repo as a wheel file using Databricks asset bundles and I expect them to still be discovered using this pkgutil, but currently when developing in databricks it's breaking. Note that locally in vscode using databricks connect everything is still working fine.

I checked all the updates in dbr 17.2 and I couldn't find anything specifically related to this.

I don't have the capacity to investigate this any further currently, but I am doubting that the current behaviour is correct.

but again, thanks for thinking along!

 

Cheers,

Max

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now