Data validation with df writes using append mode

RevanthV
New Contributor III

Hi Team,

Recently i came across a situation where I had to write a huge data and it took 6 hrs to complete...later when i checked the target data , I saw 20% of the total records written incorrectly or corrupted because the source data itself was corrupted.

Now because of this I had to rewrite twb entire data again in two steps.

Create a dataframe and Filter the valid rows and weite the data .

This now took more time than the above(8 hrs).I just wanted to know if there is something which could validate the records (based on a condition) and through an error if data is incorrect before writing the data in append mode., so that we can sava a lot of time (12 hrs in this case)and compute as well .

The is an option called replaceWhere which does the same but is not applicable for append? 

Any idea on how we can get through this?

K_Anudeep
Databricks Employee
Databricks Employee

Hi @RevanthV ,

We currently don't have such an option, and I think the only way is to use a filter to filter out the unwanted data. However, if you dont know about the source data and are performing a continuous append or have some backfill jobs that could accidentally write incorrect data, then we dont have any such option.

There is a bug already raised for the same issue in OSS: https://github.com/delta-io/delta/issues/5764,

I have a proposal and will try working on this and get back soon.

Anudeep

Sanjeeb2024
Contributor III

Hi @RevanthV  - You can create a data validation module and before writing the data to target, you can first validate the data as part of your pipeline and then define a process how to handle bad data. You can explore great expectations library or soda as well. However if you put lots of validation it will also increase your ETL job pipeline execution.

 

Sanjeeb Mohapatra

RevanthV
New Contributor III

Hey @K_Anudeep , thanks a lot for tagging me into the GitHub issue.. This is exactly what I want " validate and commit" feature and i se you have already raised a PR for the same with a new option called . I will try this out and check if it satisfies my use case