cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data validation with df writes using append mode

RevanthV
New Contributor III

Hi Team,

Recently i came across a situation where I had to write a huge data and it took 6 hrs to complete...later when i checked the target data , I saw 20% of the total records written incorrectly or corrupted because the source data itself was corrupted.

Now because of this I had to rewrite twb entire data again in two steps.

Create a dataframe and Filter the valid rows and weite the data .

This now took more time than the above(8 hrs).I just wanted to know if there is something which could validate the records (based on a condition) and through an error if data is incorrect before writing the data in append mode., so that we can sava a lot of time (12 hrs in this case)and compute as well .

The is an option called replaceWhere which does the same but is not applicable for append? 

Any idea on how we can get through this?

1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee

Hi @RevanthV ,

We currently don't have such an option, and I think the only way is to use a filter to filter out the unwanted data. However, if you dont know about the source data and are performing a continuous append or have some backfill jobs that could accidentally write incorrect data, then we dont have any such option.

There is a bug already raised for the same issue in OSS: https://github.com/delta-io/delta/issues/5764,

I have a proposal and will try working on this and get back soon.

Anudeep

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now