Real Lessons in Databricks Schema, Streaming, and Unity Catalog

Brahmareddy — Wed, 26 Mar 2025 00:33:16 GMT

Hey Databricks community,

I wanted to take a moment to share some things I’ve learned while working with Databricks in real projects—especially around schema management, Unity Catalog, Autoloader, and streaming jobs. These are the kinds of small details that aren’t always obvious at first, but once you learn them, they save a ton of time and frustration. If you’ve run into any of these, you’re not alone!

When Moving Code with Asset Bundles Breaks Your Python Imports

Ever deployed a notebook using Databricks Asset Bundles (DAB) and suddenly your imports stopped working? I had that issue when importing a local Python module like from my_package.hello import hello_world. Everything worked fine from my Git repo, but failed after deployment.

Fix:

Just add the root path back to sys.path inside your notebook:

import sys sys.path.append('/Workspace/dev/my_bundle/files') # Adjust based on your project path

That little line saves hours of debugging.

Unity Catalog & External Tables: What’s Actually “External”?

If you created a catalog or schema with an ADLS path and thought that meant your tables are "external"—you're not alone. Turns out, Unity Catalog treats tables as managed if they're written to the catalog or schema's default path—even if it’s in ADLS.

Tip:

If you want a true external table, register a separate External Location, then create your table with a LOCATION that points outside the managed area.

CREATE TABLE my_catalog.my_schema.my_table ( name STRING ) USING DELTA LOCATION 'abfss://my-container@my-storage.dfs.core.windows.net/custom-path/'

Autoloader & Path Changes: How to Avoid Reprocessing Everything

I ran into a situation where I had to change the S3 bucket my Autoloader pipeline was reading from. Even though the files were the same (just copied over), Autoloader saw them as new files and wanted to process them all again.

Solution:

Set cloudFiles.includeExistingFiles = false to skip already-existing files in the new path.

spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ .option("cloudFiles.includeExistingFiles", "false") \ .load("s3://new-bucket/path/")

Also, keep the checkpoint location the same to retain Autoloader’s state.

Materialized Views: Great, but Not Always Incremental

I tried building an incremental Materialized View, filtering by a timestamp from another table. It failed silently and fell back to full refresh. After digging, I found out Materialized Views only work incrementally when the query is fully deterministic and the input is a Delta table. Using streaming inputs or dynamic filters? That breaks it.

Better Option:

Use Delta Live Tables (DLT) for true incremental streaming with more flexibility.

Final Thoughts

These little things—like understanding how Autoloader tracks files, how Unity Catalog handles table paths, or how to structure your Python imports—can save you hours or days. Hopefully, these tips help someone else hit fewer bumps on their Databricks journey.

Got questions or something to share? Drop a comment or message. Let’s keep learning from each other.

Regards,

Brahma

topic Real Lessons in Databricks Schema, Streaming, and Unity Catalog in Community Articles