I'm seeking validation from experts regarding the data quality and consistency checks we're implementing as part of a data migration using Spark and Databricks.
Our migration involves transferring data from Azure Data Lake to a different data lake. As part of this process, we've opted to conduct data consistency checks on every row and column once the data has been successfully written to the target data lake.
Here's our current approach:
- Upon completion of migration, we retrieve row counts from both the source and target data frames and compare them.
- We perform a schema check on the data frames, comparing between the source and target.
- We generate hashes by combining all columns for both the source and target data frames, and then compare the hash IDs.
If any of these steps fail, we use Delta table versioning to roll back to the previous version of the data.
I'm personally inclined to believe that only the first step, counting rows, is necessary. The other two steps feel unnecessary because we aren't modifying the data during migration; we're simply copying it. I believe these additional steps consume resources and may impact performance.
I'd greatly appreciate expert opinions on the validity of our approach and any guidance you can offer. Thank you.