cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

The case of the phantom files!

POB756
New Contributor II

I developed a notebook that uses *.py files as module imports.  On my cluster, the logic in the notebook works fine, my custom modules get loaded, code executes.

Using an AzDO pipeline, I deploy the notebook and supporting files to a separate workspace which acts as a test environment.  When the notebook runs, it complains Errno 95 when loading modules.  On a module import statement it fails.

I've opened the files in the test environment and confirmed that all files are physically present.  In Python used by the notebook, I verify that the folders and files exists using os.path.exists(...).

For Workspace B (test), I'm using my same exact cluster as where I developed, Workspace A.

Based on talking to two LLMs, they each claim that the workspace file system is virtualized.  Ergo, there are no guarantees that Python can access local *.py files.  Even if Python says TRUE when os.path.exists() is called, the file may not be really there.  So in Workspace A, my development, the file exists and the notebook runs.  In Workspace B (test), it may/may not because of the magic of "back planes", synchronization, etc.

LLMs both advise that DBFS is the best way to reliably store Python modules.  

Databricks documentation & the CLI lead you to believe that the workspace can support different file types -- it can.  That's how I deploy to test, but the files aren't really there.  So while I can open the folders in the workspace and see the *.py file there, they really aren't there or are inaccessible to Python on that workspace.

According to ChatGPT 4.1, even if my team adopted asset bundles, there are no guarantees that a local Python file will really be there.

I've been doing software dev and ETLs for a long time and am frankly amazed at the learning curve involved.  "Well, it's a file system but not really a file system and you may/may not get stuff pushed correctly using our tools; or they are pushed correctly but don't work reliably because of the back plane, etc."

Some of the documentation indicates DBFS is no longer recommended.  Workspaces are not practical because, according to LLMs, they are primarily for the purpose of supporting notebooks.  Python is not realiable because it can tell you a file and folder exists and even list them, but they may/may not be there because of the complexity of a distributed file system, etc.

If the problem is a permissions issue in Workspace B, the error message doesn't indicate this at all.
Wheel files are an option, but really onerous for us since we are not using Unity Catalog at the moment.  

My team is really struggling with DB as a whole.  I'm finding it is an anti-pattern for SDLC activities, but am looking for a different perspective.  Please do not reply with "use VS Code with Databricks plugin."   This is not an option for us and the guidance for other scenarios is very confusing (don't use DBFS, WSFS is not reliable, etc.)

3 REPLIES 3

-werners-
Esteemed Contributor III

I totally understand your struggle.
Can you tell me what your current development way of working is?
Do you use git folders? asset bundles (is mentioned)? what does the devops pipeline do? etc

POB756
New Contributor II

We can't use asset bundles because it would entail a significant refactoring/rewrite.

We use notebooks.  The notebooks are stored in Git and then deployed via Azure DevOps pipelines to workspaces for different lifecycles (test, stg, etc.).  The pipelines are using databricks cli for the pushing of items to the workspace file system.  The deployment process works okay and is simliar to what we would do if using asset bundles. 

One of the things I work on and struggle with in DB is refactoring code into unit testable code.  My background is in software dev and we want unit tests. 

Combining notebooks with ETL concerns is, IMO, a recipe for disaster.  My project ends up doing anti-patterns because of the way Databricks works.  Complex ETL code should never be embedded directly in a notebook (procedural code), but the older versions pushed teams in that direction unfortunately.

-werners-
Esteemed Contributor III

Got it.
My background is not SWE, I have always been a 'data guy', but I definitely appreciate a proper dev workflow (ci/cd, git integration, tests).
When we started using databricks like 7 or 8 years ago, we went for notebooks as this got us up to speed fast and we could deliver very fast.
I have nothing against notebooks, the code is executed as any other code.
BUT, as you also mention, they do not promote good code practice (modularity mainly) and testing.
That is why I decided to drastically change our way of working, which was alsmost exactly as you are doing now.
It became clear that when external consultants joined our team, we had to step up and become way stricter in the way code was promoted to prod.
So I looked into asset bundles and ci/cd pipelines.  And that is what we start using now.
We still use notebooks, but only like a 'main' program that executes/calls functions/methods and writes the data.
All the rest sits in .py files and config files.
We use pytest for unit tests.
Classes that are used in many places are packaged into a wheel (by asset bundles!)
On each commit to git, unit tests run (locally!), after each successful PR merge, we can run integration tests and end to end tests (on a databricks cluster).
The latter 2 are not yet fully in practice due to a lack of time at the moment but we definitely will go there.

So, do not despair using notebooks. You can still modularize. 

But be aware that if you import .py files, databricks looks only into the CWD, so you will either have to append to the python path or package them in wheels.
For testing, databricks made some guidelines on how to test with notebooks:
https://docs.databricks.com/aws/en/notebooks/test-notebooks
Franky I don't like it.  I don't want to run unit tests on a databricks cluster (not necessary). But it might help you.

A reason why you might look into DAB is to programmatically create jobs.

Then there is also serverless compute since a short while. That can be interesting for short running tests.
I do not use it because there is no way to set environment vars on the serverless cluster (yet).

So about those 'missing' files: I never had any issue with that.  We put notebooks etc in the workspace, data in UC.
Works without a problem.
DBFS however is indeed not recommended anymore.

That being said:
I am far from happy how Databricks is making decent code deployment harder and harder.  It should be simple and solid so an engineer can focus on writing code.
Heck, we even have to stop using scala because it is clear that there is no future for that language in databricks.  New features are released on python/sql first, and maybe scala some time later.
I am still a fan of the platform though, I just wish they gave engineers some attention.

So hopefully this helps a bit.

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now