cancel
Showing results for 
Search instead for 
Did you mean: 
Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Real Lessons in Databricks Schema, Streaming, and Unity Catalog

Brahmareddy
Honored Contributor II

Hey Databricks community,

I wanted to take a moment to share some things Iโ€™ve learned while working with Databricks in real projectsโ€”especially around schema management, Unity Catalog, Autoloader, and streaming jobs. These are the kinds of small details that arenโ€™t always obvious at first, but once you learn them, they save a ton of time and frustration. If youโ€™ve run into any of these, youโ€™re not alone!


When Moving Code with Asset Bundles Breaks Your Python Imports

Ever deployed a notebook using Databricks Asset Bundles (DAB) and suddenly your imports stopped working? I had that issue when importing a local Python module like from my_package.hello import hello_world. Everything worked fine from my Git repo, but failed after deployment.

Fix:

Just add the root path back to sys.path inside your notebook:

 

 
import sys sys.path.append('/Workspace/dev/my_bundle/files') # Adjust based on your project path

That little line saves hours of debugging.


Unity Catalog & External Tables: Whatโ€™s Actually โ€œExternalโ€?

If you created a catalog or schema with an ADLS path and thought that meant your tables are "external"โ€”you're not alone. Turns out, Unity Catalog treats tables as managed if they're written to the catalog or schema's default pathโ€”even if itโ€™s in ADLS.

Tip:

If you want a true external table, register a separate External Location, then create your table with a LOCATION that points outside the managed area.

CREATE TABLE my_catalog.my_schema.my_table ( name STRING ) USING DELTA LOCATION 'abfss://my-container@my-storage.dfs.core.windows.net/custom-path/'

Autoloader & Path Changes: How to Avoid Reprocessing Everything

I ran into a situation where I had to change the S3 bucket my Autoloader pipeline was reading from. Even though the files were the same (just copied over), Autoloader saw them as new files and wanted to process them all again.

Solution:

Set cloudFiles.includeExistingFiles = false to skip already-existing files in the new path.

 

 
spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ .option("cloudFiles.includeExistingFiles", "false") \ .load("s3://new-bucket/path/")

Also, keep the checkpoint location the same to retain Autoloaderโ€™s state.


Materialized Views: Great, but Not Always Incremental

I tried building an incremental Materialized View, filtering by a timestamp from another table. It failed silently and fell back to full refresh. After digging, I found out Materialized Views only work incrementally when the query is fully deterministic and the input is a Delta table. Using streaming inputs or dynamic filters? That breaks it.

Better Option:

Use Delta Live Tables (DLT) for true incremental streaming with more flexibility.


Final Thoughts

These little thingsโ€”like understanding how Autoloader tracks files, how Unity Catalog handles table paths, or how to structure your Python importsโ€”can save you hours or days. Hopefully, these tips help someone else hit fewer bumps on their Databricks journey.

Got questions or something to share? Drop a comment or message. Letโ€™s keep learning from each other.

Regards,

Brahma

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now