Feedback on the data quality and consistency checks in Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-07-2024 11:19 AM
I'm seeking validation from experts regarding the data quality and consistency checks we're implementing as part of a data migration using Spark and Databricks.
Our migration involves transferring data from Azure Data Lake to a different data lake. As part of this process, we've opted to conduct data consistency checks on every row and column once the data has been successfully written to the target data lake.
Here's our current approach:
- Upon completion of migration, we retrieve row counts from both the source and target data frames and compare them.
- We perform a schema check on the data frames, comparing between the source and target.
- We generate hashes by combining all columns for both the source and target data frames, and then compare the hash IDs.
If any of these steps fail, we use Delta table versioning to roll back to the previous version of the data.
I'm personally inclined to believe that only the first step, counting rows, is necessary. The other two steps feel unnecessary because we aren't modifying the data during migration; we're simply copying it. I believe these additional steps consume resources and may impact performance.
I'd greatly appreciate expert opinions on the validity of our approach and any guidance you can offer. Thank you.
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2024 04:45 AM
Thank you for the response. But still I don’t understand why those things can happen when we are not doing any kind of ETL while copying the data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2024 06:56 AM
Hi,
What is the definition of "we are copying" in this context? I assume that both source and target are relational tables. It is not just rows being copied, the DDL has to be validated. between source and target. Some fields can be problematic. For example mapping date, datetime and timestamp columns. My assumption is that you are using Spark for ETL as well. You will need a complete audit of what you are doing, assuming that you are working in a regulated environment and auditors will want it.as well. The acid tests will be that you schedule a suite of batch jobs (you need help from the user community) to run at source and target respectively providing identical set of results for different jobs. It is really an iterative process best practiced in SIT/UAT environments first. Also you will need to bear in mind that SLAs in your target environment is also met.
HTH
HTH
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2024 08:58 AM
Hi,
Thank you for the response. When we say we are copying, it's a data migration from one data lake to another. Not performing any kind of DDL or DML queries using spark SQL on top it.
It's a straightforward merge from one data lake to another using the condition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2024 08:50 AM
Hi @Coders, I'd also consider some profiling checks for column stats and distribution just to be sure everything is consistent after the migration.
Afterwards, you should consider the best-practice of implementing some data quality validations on the data lake. For example, take a look at Rudol Data Quality that has native Databricks integration with no-code validations.
Have a high-quality week!

