cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema update Issue in DLT

Fatimah-Tariq
New Contributor III

I have a pipeline in databricks with this flow
SQL SERVER (Source) -> Staging (Parquet) -> Bronze (DLT) -> Silver(DLT) -> Gold (DLT)

The pipeline is up and running smoothly for months but recently, there was a schema update at my source level and one of the Primary keys of one of my tables got updated from int to bigint.

After the schema change happened at source, it is getting reflected in my pipeline too, so if I go and look at the schemas of the table in all layers, it shows me bigint. So, I believe "mergeschema" handled the schema evolution and the data type of the column has been updated to bigint from int. But, it caused a very strange behavior. The  data type is updated but when then pipeline ran next time after the update, it deleted all of the previous data for this column(records are in millions). So now, the pipeline is running and it's fetching the new data correctly(I assume because the data type now matches with the source i.e. bigint) but all of my prev data for this col is now null in the table. And since, this col is a part of a PK, it messed up my data really badly because millions of records are lost.

What exactly could be the reason behind this missing data and how can I fix it? I need to read the deleted records back from my source to make the data correct again. 

For side notes, the pipeline is incremental and runs once daily.

4 REPLIES 4

Fatimah-Tariq
New Contributor III

Hi @Alberto_Umana, sorry to bother you. You helped me last time with my DLT problem so I was wondering if you could help me with this one too if you have any idea? 
Thankyou!

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @Fatimah-Tariq,

Thanks for your question. 

When a schema change involves the data type of a column, and this column is part of a primary key, Delta Lake's schema evolution mechanism can handle it by updating the schema. But, this change can sometimes inadvertently cause data inconsistencies, such as setting previous records to null if the old data cannot be cast perfectly into the new data type.

Use MERGE INTO with the AUTO-OPTIMIZE and AUTO-COMPACTION features to ensure smooth transitions between schema versions.

I will check internally what could be best option to address this , or if a alter to come back to previous data type should be the way.

Thank you so much for responding. Had to bother you because the issue is high priority since it messed up the data.
I'll be waiting for your answer. 

Fatimah-Tariq
New Contributor III

Hi @Alberto_Umana, is there any word on how to fix my data and bring all the records back to the pipeline schema?

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now