03-07-2024 11:19 AM
I'm seeking validation from experts regarding the data quality and consistency checks we're implementing as part of a data migration using Spark and Databricks.
Our migration involves transferring data from Azure Data Lake to a different data lake. As part of this process, we've opted to conduct data consistency checks on every row and column once the data has been successfully written to the target data lake.
Here's our current approach:
If any of these steps fail, we use Delta table versioning to roll back to the previous version of the data.
I'm personally inclined to believe that only the first step, counting rows, is necessary. The other two steps feel unnecessary because we aren't modifying the data during migration; we're simply copying it. I believe these additional steps consume resources and may impact performance.
I'd greatly appreciate expert opinions on the validity of our approach and any guidance you can offer. Thank you.
03-08-2024 04:02 AM
Hi @Coders,
Good luck with your migration!
03-08-2024 04:45 AM
Thank you for the response. But still I don’t understand why those things can happen when we are not doing any kind of ETL while copying the data.
03-08-2024 06:56 AM
Hi,
What is the definition of "we are copying" in this context? I assume that both source and target are relational tables. It is not just rows being copied, the DDL has to be validated. between source and target. Some fields can be problematic. For example mapping date, datetime and timestamp columns. My assumption is that you are using Spark for ETL as well. You will need a complete audit of what you are doing, assuming that you are working in a regulated environment and auditors will want it.as well. The acid tests will be that you schedule a suite of batch jobs (you need help from the user community) to run at source and target respectively providing identical set of results for different jobs. It is really an iterative process best practiced in SIT/UAT environments first. Also you will need to bear in mind that SLAs in your target environment is also met.
HTH
HTH
03-08-2024 08:58 AM
Hi,
Thank you for the response. When we say we are copying, it's a data migration from one data lake to another. Not performing any kind of DDL or DML queries using spark SQL on top it.
It's a straightforward merge from one data lake to another using the condition.
07-25-2024 08:50 AM
Hi @Coders, I'd also consider some profiling checks for column stats and distribution just to be sure everything is consistent after the migration.
Afterwards, you should consider the best-practice of implementing some data quality validations on the data lake. For example, take a look at Rudol Data Quality that has native Databricks integration with no-code validations.
Have a high-quality week!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group