Databricks Community

ponnapv · ‎09-04-2024

Introduction

Enterprise Data Warehouse (EDW) Migration use cases are always complex in nature. Data architects and Data engineers spend the majority of time in analyzing the source data, identifying the ingesting patterns, and defining the transformation logic during the design phase of the migration. Even after ingesting the source data and converting the workloads into the modern data platform, the migration projects can still slow down during the data validation phase. So it's important to understand the pitfalls and have a mitigation plan.

Data Validation is an important phase for the success of the project and we shall discuss how to conquer this last-mile of migration!

Migration Strategy

Migration project starts with the discovery phase in which the source system is analyzed using Data warehouse profilers that can provide a summary of the total inventory along with its complexity. At this stage the migration strategy is defined that would pick one of the following patterns:

Lift & Shift
Modernize & Improve
Replatform

Along with it other details at each step of the migration process is defined that includes:

Data Ingestion & Modeling pattern for migration
Orchestration and workload conversion approach
Workspace set up
Data Validation strategy
User Acceptance Criteria and Test Strategy
Cutover plan

Key Takeaway - It’s important to recognize the value of having a clear strategy towards data validation in the life cycle of a migration project and define the validation approach as early as possible

Migration Steps

EDW migration flows through multiple steps starting from Ingestion and Transformation to querying data, visualization, and serving. Here is a reference architecture depicting how different components of a modern data platform aligns with the migration steps:

Migration steps include:

Ingestion - Bringing the source data into Databricks using native ingestion connector support like Lakeflow connect, Autoloader, Copy Into, Arcion or picking a Partner data ingestion / ETL tools from partner connect
Transformation - The Data transformation workloads can be migrated over to Databricks platform as workflows that supports creating a Delta Live Table pipeline or a Job. Workflows can be used for Ingestion and Transformation of data. This aggregated data is used to extract features and build models, and also for data analysis using Visualization tools
Visualization - The cleansed and aggregated data is available for data analysis. Databricks provides SQL Warehouse (a compute resource) for query execution, and AI/BI Dashboards for easy visualization. It also supports integration to other partners’ BI tools

Key Takeaway - The Ingested and Transformed data along with the visualization dashboards has to be independently validated to make sure every step of migration matches and aligns with the source system.

Data Validation Approach

Data validation on a migration project happens at 3 stages:

Raw/Bronze layer data: Data validation starts with the first step of migration, that is data ingestion. As the source data gets routed to the Databricks platform, it's important to validate data in its Raw form. The first step of Medallion architecture holds the data in its raw form which can be compared against the source data.
Transformed/Gold layer data: The data transformation happens in 2 steps of Medallion architecture. The data cleansing happens in the Silver layer and aggregation in the Gold layer where data is saved in a format that can be easy to analyze using the BI reporting systems. This is the most important stage of Validation. A very detailed row/column level comparison is recommended for the Gold layer data.
BI Reports: The final phase of migration will be to compare the BI reports pointing to the source systems and migrated data. BI system should go through minimal changes just to point to the migrated data source keeping all the queries in the Dashboards/Widgets intact. This is often just a re-confirmation for the transformed data.

Key Takeaway - The depth of validation varies from customer to customer and also depends on the level of maturity of the data platform. We can find a good balance in between all stages of validation by prioritizing Gold layer data validation to be at each row/column level, simplifying the Bronze layer data validation by defining the KPIs & matching the row counts that can be either total number of orders, daily inventory, parts count etc, and keeping the BI Report validation to minimum/basic checks.

Data Validation Tools

Data Validation can be performed using one of the following tools:

Lakehouse Federation - Federation supports few cloud hosted data sources that can be easily connected and accessed from Databricks using simple queries just like accessing any delta table. Refer to the example with 3 steps:

Create a database connection
Create a foreign catalog using Lakehouse federation
Execute the query

JDBC ODBC Connection - Query the source database using JDBC and compare the results with data on Databricks. Refer to the example code (in the Reference section below) that follows the below steps:
1. Define the database connection to the source database to read the data
2. Read the source data into a dataframe
3. Read the migrated data into a dataframe
4. Compare the two dataframes using DataComPy
5. Generate a report
Accelerator Frameworks - We have multiple accelerator frameworks that help in validating the data on Databricks with the source data. These accelerators are tested and hardened to provide a consistent result. Customers would have a choice to pick an accelerator from the below options:
1. Use Databricks Accelerator - Remorph
2. Use a 3rd party vendor Accelerator - BladeBridge, LeapLogic, etc
3. Build a custom data validation tool (using the details provided in option 1 & 2)

One or a combination of the above tools can be used to validate the data at the initial 2 stages in the validation approach.

Key Takeaway - An accelerator would always provide a nudge required for speeding the process and reducing the total time taken to complete validations. I have noticed more than 30% of productivity gain and a big boost in customer confidence when we use an Accelerator for validation. It is recommended to identify and customize the accelerators that can automate and simplify the validation process.

Challenges with the Last-mile of Migration

As with any phase of Migration, Data validation has its own challenges but often ignored till it blocks the whole process. Understanding some of these challenges ahead of time and setting a mitigation plan would keep the migration process on path to success.

Challenges	Mitigation Plan
Identify the level of validation required based on the project complexity	Socializing the different levels of validation approach with the stakeholders and scoping out what is required for the project during the initial phase of migration
Identify the right tools and define the validation approach ahead of time	Accelerators can bring in the required support to enhance the validation process and reduce the total time to complete the validation successfully. Identify the accelerator as part of the PoC and have the customizations required ready
Multiple execution cycles of validating the same set of data	Automate Validation to simplify reruns
The back and forth between multiple teams involved in validation might be challenging to keep the project on track	Keeping all the SMEs in check and aligned for signoff
Evolving the validation process	Keep a log of all the exceptions captured during the validation process and integrate it back to the automation process

Conclusion

Data validation sounds like a simple step in the migration process and oftentimes is overseen in the design phase. This can lead to a potential slowdown in the overall migration process. Hence understanding the challenges and defining a validation approach early in the project lifecycle would help in navigating through this phase easily and thus decrease the time taken to complete a successful migration.

Reference

Data Validation Steps

import datacompy
#Step1 - Define the database connection to the source database
df_source = spark.read \
   .format("jdbc") \
   .option("url", "jdbc:oracle:thin:@//hostname:port/service") \
   .option("dbtable", "schema.table") \
   .option("user", "username") \
   .option("password", "password") \
   .option("driver", "oracle.jdbc.driver.OracleDriver") \
   .load()

#Step2 - Read the source data into a dataframe
df_source.createOrReplaceTempView("source")
df_source=spark.sql(df_source)
#display(df_source)

#Step3 - Read the migrated data into a dataframe
df_delta = spark.read.table("/path/to/delta/table")
#display(df_delta)

#Step4 - Compare the two dataframes
compare = datacompy.Compare(
   df_source,
   df_delta,
   join_columns='col1',  #You can also specify a list of columns
   abs_tol=0.0001,
   rel_tol=0,
   df1_name='source',
   df2_name='delta')
#OR
#compare = datacompy.Compare(df_source, df_delta, join_columns=['col1', 'col2'])

#Step 5 - Generate the report
compare.report()

Databricks Community

Data Validation: The Last Mile of Migration

Introduction

Migration Strategy

Migration Steps

Data Validation Approach

Data Validation Tools

Challenges with the Last-mile of Migration

Conclusion

Reference

Data Validation Steps

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL