<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Best Practices for Managing Schema Changes and Metadata Lineage in Delta Tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-practices-for-managing-schema-changes-and-metadata-lineage/m-p/105098#M41992</link>
    <description>&lt;P&gt;Hello Databricks Community,&lt;/P&gt;&lt;P&gt;We are working with Airflow DAGs to trigger Databricks jobs that use Delta tables for performing upsert operations via a MERGE statement. The job was initially designed to perform a merge upsert with predefined Delta tables. However, we recently introduced a new column to the schema, and this has caused issues as the previous Delta tables don't have this new column, resulting in upsert failures.&lt;/P&gt;&lt;P&gt;While one solution we considered is to perform a schema overwrite just before the job run to add this new column with a NULL default, we are concerned that this approach would introduce unnecessary schema changes for every future job runs—even when the new column is already present. This would potentially lead to performance overhead and also lack clarity on when schema changes were actually necessary. Another approach would be, to change the code after the first job run to remove the modification since the metadata is established in delta with first run and the future job runs would be fine but we are missing the lineage of changes which might have happened.&lt;/P&gt;&lt;P&gt;Given this scenario, we believe this issue is akin to handling migrations in software engineering and managing table creation or modification in a robust way. We are looking for best practices or recommendations for:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Tracking Schema Changes&lt;/STRONG&gt;: How can we manage schema evolution and ensure metadata lineage tracking for Delta tables in a consistent way?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Efficient Schema Updates&lt;/STRONG&gt;: What are the best practices for modifying the schema without unnecessarily repeating operations? How can we avoid repeated schema additions if they are not needed anymore?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Table Creation and Management&lt;/STRONG&gt;: Should we treat this as part of a database migration effort and ensure proper versioning of the Delta table schema? If so, what tools or strategies are recommended to track these changes effectively?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Any advice, strategies, or examples from the community would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
    <pubDate>Fri, 10 Jan 2025 02:24:20 GMT</pubDate>
    <dc:creator>geckopher</dc:creator>
    <dc:date>2025-01-10T02:24:20Z</dc:date>
    <item>
      <title>Best Practices for Managing Schema Changes and Metadata Lineage in Delta Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-managing-schema-changes-and-metadata-lineage/m-p/105098#M41992</link>
      <description>&lt;P&gt;Hello Databricks Community,&lt;/P&gt;&lt;P&gt;We are working with Airflow DAGs to trigger Databricks jobs that use Delta tables for performing upsert operations via a MERGE statement. The job was initially designed to perform a merge upsert with predefined Delta tables. However, we recently introduced a new column to the schema, and this has caused issues as the previous Delta tables don't have this new column, resulting in upsert failures.&lt;/P&gt;&lt;P&gt;While one solution we considered is to perform a schema overwrite just before the job run to add this new column with a NULL default, we are concerned that this approach would introduce unnecessary schema changes for every future job runs—even when the new column is already present. This would potentially lead to performance overhead and also lack clarity on when schema changes were actually necessary. Another approach would be, to change the code after the first job run to remove the modification since the metadata is established in delta with first run and the future job runs would be fine but we are missing the lineage of changes which might have happened.&lt;/P&gt;&lt;P&gt;Given this scenario, we believe this issue is akin to handling migrations in software engineering and managing table creation or modification in a robust way. We are looking for best practices or recommendations for:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Tracking Schema Changes&lt;/STRONG&gt;: How can we manage schema evolution and ensure metadata lineage tracking for Delta tables in a consistent way?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Efficient Schema Updates&lt;/STRONG&gt;: What are the best practices for modifying the schema without unnecessarily repeating operations? How can we avoid repeated schema additions if they are not needed anymore?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Table Creation and Management&lt;/STRONG&gt;: Should we treat this as part of a database migration effort and ensure proper versioning of the Delta table schema? If so, what tools or strategies are recommended to track these changes effectively?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Any advice, strategies, or examples from the community would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jan 2025 02:24:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-managing-schema-changes-and-metadata-lineage/m-p/105098#M41992</guid>
      <dc:creator>geckopher</dc:creator>
      <dc:date>2025-01-10T02:24:20Z</dc:date>
    </item>
    <item>
      <title>Re: Best Practices for Managing Schema Changes and Metadata Lineage in Delta Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-managing-schema-changes-and-metadata-lineage/m-p/105126#M42000</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/142125"&gt;@geckopher&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;SPAN&gt;To address your concerns about managing schema evolution, tracking metadata lineage, and efficiently updating schemas in Delta tables, here are some best practices and strategies:&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;H3&gt;Tracking Schema Changes&lt;/H3&gt;&lt;P&gt;&lt;STRONG&gt;Delta Table History&lt;/STRONG&gt;: Utilize Delta Lake's built-in history feature to track changes to your Delta tables. You can use the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;DESCRIBE HISTORY&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;command to view the history of operations performed on a Delta table.&lt;/P&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;DESCRIBE HISTORY &amp;lt;delta.`/path/to/delta-table`&amp;gt;&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&lt;SPAN&gt;Efficient Schema Updates&lt;/SPAN&gt;&lt;/H3&gt;&lt;DIV class=""&gt;&lt;P&gt;&lt;STRONG&gt;1. Schema Evolution&lt;/STRONG&gt;: Use Delta Lak's schema evolution capabilities to automatically handle schema changes. When performing a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;MERGE&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;operation, you can use the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;mergeSchema&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;option to allow Delta Lake to auto merge the schema changes.&lt;/P&gt;&lt;P&gt;&lt;FONT face="terminal,monaco"&gt;&lt;STRONG&gt;Note:&lt;/STRONG&gt;&amp;nbsp;&lt;SPAN&gt;The&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;overwriteSchema&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;option is used when you want to completely overwrite the schema of the target table with the schema of the source data.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;deltaTable
    .alias("t")
    .merge(
        sourceDF.alias("s"),
        "s.key = t.key"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .option("mergeSchema", "true")
    .execute()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;P&gt;&lt;STRONG&gt;2. Conditional Schema Updates&lt;/STRONG&gt;: Before performing a schema update, check if the new column already exists. This can be done programmatically by inspecting the schema of the Delta table.&lt;/P&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
schema = deltaTable.toDF().schema

if "new_column" not in schema.fieldNames():
    # Perform schema update
    deltaTable.updateExpr(
        condition="true",
        set={"new_column": "NULL"}
    )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;H3&gt;&lt;SPAN&gt;Example Code&lt;/SPAN&gt;&lt;/H3&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SchemaManagement").getOrCreate()

# Path to Delta table
delta_table_path = "/path/to/delta-table"

# Check if the new column exists
deltaTable = DeltaTable.forPath(spark, delta_table_path)
schema = deltaTable.toDF().schema

if "new_column" not in schema.fieldNames():
    # Apply schema changes
    spark.sql(f"""
    ALTER TABLE delta.`{delta_table_path}`
    ADD COLUMNS (new_column STRING)
    """)

# Perform the MERGE operation
sourceDF = spark.read.format("delta").load("/path/to/source-data")
(
    deltaTable.alias("t")
    .merge(sourceDF.alias("s"), "s.key = t.key")
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .option("mergeSchema", "true")
    .execute()
)

# Track changes
history = spark.sql(f"DESCRIBE HISTORY delta.`{delta_table_path}`")
display(history)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;By following these best practices and strategies, you can manage schema evolution, track metadata lineage, and efficiently update schemas in Delta tables without unnecessary repetition or performance overhead.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Refer this link for more understanding&amp;nbsp;&lt;A href="https://delta.io/blog/2023-02-08-delta-lake-schema-evolution/" target="_blank" rel="noopener"&gt;Delta Lake Schema Evolution | Delta Lake&lt;/A&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Regards,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Hai Prasad&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 10 Jan 2025 08:14:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-managing-schema-changes-and-metadata-lineage/m-p/105126#M42000</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2025-01-10T08:14:56Z</dc:date>
    </item>
  </channel>
</rss>

