<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Real Lessons in Databricks Schema, Streaming, and Unity Catalog in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/real-lessons-in-databricks-schema-streaming-and-unity-catalog/m-p/113628#M399</link>
    <description>&lt;P&gt;Hey Databricks community,&lt;/P&gt;&lt;P&gt;I wanted to take a moment to share some things I’ve learned while working with Databricks in real projects—especially around schema management, Unity Catalog, Autoloader, and streaming jobs. These are the kinds of small details that aren’t always obvious at first, but once you learn them, they save a ton of time and frustration. If you’ve run into any of these, you’re not alone!&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;When Moving Code with Asset Bundles Breaks Your Python Imports&lt;/H2&gt;&lt;P&gt;Ever deployed a notebook using Databricks Asset Bundles (DAB) and suddenly your imports stopped working? I had that issue when importing a local Python module like from my_package.hello import hello_world. Everything worked fine from my Git repo, but failed after deployment.&lt;/P&gt;&lt;H3&gt;Fix:&lt;/H3&gt;&lt;P&gt;Just add the root path back to sys.path inside your notebook:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;import sys sys.path.append('/Workspace/dev/my_bundle/files') # Adjust based on your project path&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;That little line saves hours of debugging.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Unity Catalog &amp;amp; External Tables: What’s Actually “External”?&lt;/H2&gt;&lt;P class=""&gt;If you created a catalog or schema with an ADLS path and thought that meant your tables are "external"—you're not alone. Turns out, Unity Catalog treats tables as &lt;STRONG&gt;managed&lt;/STRONG&gt; if they're written to the catalog or schema's default path—even if it’s in ADLS.&lt;/P&gt;&lt;H3&gt;Tip:&lt;/H3&gt;&lt;P class=""&gt;If you want a &lt;STRONG&gt;true external table&lt;/STRONG&gt;, register a separate &lt;STRONG&gt;External Location&lt;/STRONG&gt;, then create your table with a LOCATION that points outside the managed area.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV&gt;CREATE TABLE my_catalog.my_schema.my_table ( name STRING ) USING DELTA LOCATION 'abfss://my-container@my-storage.dfs.core.windows.net/custom-path/'&lt;/DIV&gt;&lt;/DIV&gt;&lt;HR /&gt;&lt;H2&gt;Autoloader &amp;amp; Path Changes: How to Avoid Reprocessing Everything&lt;/H2&gt;&lt;P class=""&gt;I ran into a situation where I had to change the S3 bucket my &lt;STRONG&gt;Autoloader&lt;/STRONG&gt; pipeline was reading from. Even though the files were the same (just copied over), Autoloader saw them as new files and wanted to process them all again.&lt;/P&gt;&lt;H3&gt;Solution:&lt;/H3&gt;&lt;P&gt;Set cloudFiles.includeExistingFiles = false to skip already-existing files in the new path.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ .option("cloudFiles.includeExistingFiles", "false") \ .load("s3://new-bucket/path/")&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Also, keep the &lt;STRONG&gt;checkpoint location&lt;/STRONG&gt; the same to retain Autoloader’s state.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Materialized Views: Great, but Not Always Incremental&lt;/H2&gt;&lt;P class=""&gt;I tried building an incremental Materialized View, filtering by a timestamp from another table. It failed silently and fell back to full refresh. After digging, I found out Materialized Views only work incrementally when the query is &lt;STRONG&gt;fully deterministic&lt;/STRONG&gt; and &lt;STRONG&gt;the input is a Delta table&lt;/STRONG&gt;. Using streaming inputs or dynamic filters? That breaks it.&lt;/P&gt;&lt;H3&gt;Better Option:&lt;/H3&gt;&lt;P class=""&gt;Use &lt;STRONG&gt;Delta Live Tables (DLT)&lt;/STRONG&gt; for true incremental streaming with more flexibility.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Final Thoughts&lt;/H2&gt;&lt;P class=""&gt;These little things—like understanding how Autoloader tracks files, how Unity Catalog handles table paths, or how to structure your Python imports—can save you hours or days. Hopefully, these tips help someone else hit fewer bumps on their Databricks journey.&lt;/P&gt;&lt;P class=""&gt;Got questions or something to share? Drop a comment or message. Let’s keep learning from each other.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Regards,&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Brahma&lt;/STRONG&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 26 Mar 2025 00:33:16 GMT</pubDate>
    <dc:creator>Brahmareddy</dc:creator>
    <dc:date>2025-03-26T00:33:16Z</dc:date>
    <item>
      <title>Real Lessons in Databricks Schema, Streaming, and Unity Catalog</title>
      <link>https://community.databricks.com/t5/community-articles/real-lessons-in-databricks-schema-streaming-and-unity-catalog/m-p/113628#M399</link>
      <description>&lt;P&gt;Hey Databricks community,&lt;/P&gt;&lt;P&gt;I wanted to take a moment to share some things I’ve learned while working with Databricks in real projects—especially around schema management, Unity Catalog, Autoloader, and streaming jobs. These are the kinds of small details that aren’t always obvious at first, but once you learn them, they save a ton of time and frustration. If you’ve run into any of these, you’re not alone!&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;When Moving Code with Asset Bundles Breaks Your Python Imports&lt;/H2&gt;&lt;P&gt;Ever deployed a notebook using Databricks Asset Bundles (DAB) and suddenly your imports stopped working? I had that issue when importing a local Python module like from my_package.hello import hello_world. Everything worked fine from my Git repo, but failed after deployment.&lt;/P&gt;&lt;H3&gt;Fix:&lt;/H3&gt;&lt;P&gt;Just add the root path back to sys.path inside your notebook:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;import sys sys.path.append('/Workspace/dev/my_bundle/files') # Adjust based on your project path&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;That little line saves hours of debugging.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Unity Catalog &amp;amp; External Tables: What’s Actually “External”?&lt;/H2&gt;&lt;P class=""&gt;If you created a catalog or schema with an ADLS path and thought that meant your tables are "external"—you're not alone. Turns out, Unity Catalog treats tables as &lt;STRONG&gt;managed&lt;/STRONG&gt; if they're written to the catalog or schema's default path—even if it’s in ADLS.&lt;/P&gt;&lt;H3&gt;Tip:&lt;/H3&gt;&lt;P class=""&gt;If you want a &lt;STRONG&gt;true external table&lt;/STRONG&gt;, register a separate &lt;STRONG&gt;External Location&lt;/STRONG&gt;, then create your table with a LOCATION that points outside the managed area.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV&gt;CREATE TABLE my_catalog.my_schema.my_table ( name STRING ) USING DELTA LOCATION 'abfss://my-container@my-storage.dfs.core.windows.net/custom-path/'&lt;/DIV&gt;&lt;/DIV&gt;&lt;HR /&gt;&lt;H2&gt;Autoloader &amp;amp; Path Changes: How to Avoid Reprocessing Everything&lt;/H2&gt;&lt;P class=""&gt;I ran into a situation where I had to change the S3 bucket my &lt;STRONG&gt;Autoloader&lt;/STRONG&gt; pipeline was reading from. Even though the files were the same (just copied over), Autoloader saw them as new files and wanted to process them all again.&lt;/P&gt;&lt;H3&gt;Solution:&lt;/H3&gt;&lt;P&gt;Set cloudFiles.includeExistingFiles = false to skip already-existing files in the new path.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ .option("cloudFiles.includeExistingFiles", "false") \ .load("s3://new-bucket/path/")&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Also, keep the &lt;STRONG&gt;checkpoint location&lt;/STRONG&gt; the same to retain Autoloader’s state.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Materialized Views: Great, but Not Always Incremental&lt;/H2&gt;&lt;P class=""&gt;I tried building an incremental Materialized View, filtering by a timestamp from another table. It failed silently and fell back to full refresh. After digging, I found out Materialized Views only work incrementally when the query is &lt;STRONG&gt;fully deterministic&lt;/STRONG&gt; and &lt;STRONG&gt;the input is a Delta table&lt;/STRONG&gt;. Using streaming inputs or dynamic filters? That breaks it.&lt;/P&gt;&lt;H3&gt;Better Option:&lt;/H3&gt;&lt;P class=""&gt;Use &lt;STRONG&gt;Delta Live Tables (DLT)&lt;/STRONG&gt; for true incremental streaming with more flexibility.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Final Thoughts&lt;/H2&gt;&lt;P class=""&gt;These little things—like understanding how Autoloader tracks files, how Unity Catalog handles table paths, or how to structure your Python imports—can save you hours or days. Hopefully, these tips help someone else hit fewer bumps on their Databricks journey.&lt;/P&gt;&lt;P class=""&gt;Got questions or something to share? Drop a comment or message. Let’s keep learning from each other.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Regards,&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Brahma&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 00:33:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/real-lessons-in-databricks-schema-streaming-and-unity-catalog/m-p/113628#M399</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2025-03-26T00:33:16Z</dc:date>
    </item>
  </channel>
</rss>

