Databricks Community

Coders · ‎03-07-2024

I'm seeking validation from experts regarding the data quality and consistency checks we're implementing as part of a data migration using Spark and Databricks.

Our migration involves transferring data from Azure Data Lake to a different data lake. As part of this process, we've opted to conduct data consistency checks on every row and column once the data has been successfully written to the target data lake.

Here's our current approach:

Upon completion of migration, we retrieve row counts from both the source and target data frames and compare them.
We perform a schema check on the data frames, comparing between the source and target.
We generate hashes by combining all columns for both the source and target data frames, and then compare the hash IDs.

If any of these steps fail, we use Delta table versioning to roll back to the previous version of the data.

I'm personally inclined to believe that only the first step, counting rows, is necessary. The other two steps feel unnecessary because we aren't modifying the data during migration; we're simply copying it. I believe these additional steps consume resources and may impact performance.

I'd greatly appreciate expert opinions on the validity of our approach and any guidance you can offer. Thank you.

Coders · ‎03-08-2024

Thank you for the response. But still I don’t understand why those things can happen when we are not doing any kind of ETL while copying the data.

MichTalebzadeh · ‎03-08-2024

Hi,

What is the definition of "we are copying" in this context? I assume that both source and target are relational tables. It is not just rows being copied, the DDL has to be validated. between source and target. Some fields can be problematic. For example mapping date, datetime and timestamp columns. My assumption is that you are using Spark for ETL as well. You will need a complete audit of what you are doing, assuming that you are working in a regulated environment and auditors will want it.as well. The acid tests will be that you schedule a suite of batch jobs (you need help from the user community) to run at source and target respectively providing identical set of results for different jobs. It is really an iterative process best practiced in SIT/UAT environments first. Also you will need to bear in mind that SLAs in your target environment is also met.

HTH

HTH

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Coders · ‎03-08-2024

Hi,

Thank you for the response. When we say we are copying, it's a data migration from one data lake to another. Not performing any kind of DDL or DML queries using spark SQL on top it.

It's a straightforward merge from one data lake to another using the condition.

joarobles · ‎07-25-2024

Hi @Coders, I'd also consider some profiling checks for column stats and distribution just to be sure everything is consistent after the migration.

Afterwards, you should consider the best-practice of implementing some data quality validations on the data lake. For example, take a look at Rudol Data Quality that has native Databricks integration with no-code validations.

Have a high-quality week!