Databricks Community

Fatimah-Tariq · ‎03-06-2025

I have a pipeline in databricks with this flow
SQL SERVER (Source) -> Staging (Parquet) -> Bronze (DLT) -> Silver(DLT) -> Gold (DLT)

The pipeline is up and running smoothly for months but recently, there was a schema update at my source level and one of the Primary keys of one of my tables got updated from int to bigint.

After the schema change happened at source, it is getting reflected in my pipeline too, so if I go and look at the schemas of the table in all layers, it shows me bigint. So, I believe "mergeschema" handled the schema evolution and the data type of the column has been updated to bigint from int. But, it caused a very strange behavior. The data type is updated but when then pipeline ran next time after the update, it deleted all of the previous data for this column(records are in millions). So now, the pipeline is running and it's fetching the new data correctly(I assume because the data type now matches with the source i.e. bigint) but all of my prev data for this col is now null in the table. And since, this col is a part of a PK, it messed up my data really badly because millions of records are lost.

What exactly could be the reason behind this missing data and how can I fix it? I need to read the deleted records back from my source to make the data correct again.

For side notes, the pipeline is incremental and runs once daily.

Fatimah-Tariq · ‎03-07-2025

Hi @Alberto_Umana, sorry to bother you. You helped me last time with my DLT problem so I was wondering if you could help me with this one too if you have any idea?
Thankyou!

Alberto_Umana · ‎03-07-2025

Hi @Fatimah-Tariq,

Thanks for your question.

When a schema change involves the data type of a column, and this column is part of a primary key, Delta Lake's schema evolution mechanism can handle it by updating the schema. But, this change can sometimes inadvertently cause data inconsistencies, such as setting previous records to null if the old data cannot be cast perfectly into the new data type.

Use MERGE INTO with the AUTO-OPTIMIZE and AUTO-COMPACTION features to ensure smooth transitions between schema versions.

I will check internally what could be best option to address this , or if a alter to come back to previous data type should be the way.

Fatimah-Tariq · ‎03-07-2025

Thank you so much for responding. Had to bother you because the issue is high priority since it messed up the data.
I'll be waiting for your answer.

Fatimah-Tariq · ‎03-10-2025

Hi @Alberto_Umana, is there any word on how to fix my data and bring all the records back to the pipeline schema?

Databricks Community

Schema update Issue in DLT

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples