cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Feedback on the data quality and consistency checks in Spark

Coders
New Contributor II

I'm seeking validation from experts regarding the data quality and consistency checks we're implementing as part of a data migration using Spark and Databricks.

Our migration involves transferring data from Azure Data Lake to a different data lake. As part of this process, we've opted to conduct data consistency checks on every row and column once the data has been successfully written to the target data lake.

Here's our current approach:

  1. Upon completion of migration, we retrieve row counts from both the source and target data frames and compare them.
  2. We perform a schema check on the data frames, comparing between the source and target.
  3. We generate hashes by combining all columns for both the source and target data frames, and then compare the hash IDs.

If any of these steps fail, we use Delta table versioning to roll back to the previous version of the data.

I'm personally inclined to believe that only the first step, counting rows, is necessary. The other two steps feel unnecessary because we aren't modifying the data during migration; we're simply copying it. I believe these additional steps consume resources and may impact performance.

I'd greatly appreciate expert opinions on the validity of our approach and any guidance you can offer. Thank you.

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @Coders

  • Focusing on row counts is valid, but the other steps serve important purposes too.
  • While copying data seems straightforward, subtle issues can arise (e.g., encoding mismatches, null handling).
  • Consider optimizing hash generation to reduce resource overhead.
  • Regularly monitor performance impact and adjust as needed.

Good luck with your migration!

Coders
New Contributor II

Thank you for the response. But still I don’t understand why those things can happen when we are not doing any kind of ETL while copying the data.

MichTalebzadeh
Valued Contributor

Hi,

What is the definition of "we are copying" in this context? I assume that both source and target are relational tables. It is not just rows being copied, the DDL has to be validated. between source and target. Some fields can be problematic. For example mapping date, datetime and timestamp columns. My assumption is that you are using Spark for ETL as well. You will need a complete audit of what you are doing, assuming that you are working in a regulated environment and auditors will want it.as well. The acid tests will be that you schedule a suite of batch jobs (you need help from the user community) to run at source and target respectively providing identical set of results for different jobs. It is really an iterative process best practiced in SIT/UAT environments first.  Also you will need to bear in mind that SLAs in your target environment is also met.

HTH

HTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Coders
New Contributor II

Hi,

Thank you for the response. When we say we are copying, it's a data migration from one data lake to another. Not performing any kind of DDL or DML queries using spark SQL on top it. 

It's a straightforward merge from one data lake to another using the condition.

joarobles
New Contributor III

Hi @Coders, I'd also consider some profiling checks for column stats and distribution just to be sure everything is consistent after the migration.

Afterwards, you should consider the best-practice of implementing some data quality validations on the data lake. For example, take a look at Rudol Data Quality that has native Databricks integration with no-code validations.

Have a high-quality week!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group