- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Introduction
Enterprise Data Warehouse (EDW) Migration use cases are always complex in nature. Data architects and Data engineers spend the majority of time in analyzing the source data, identifying the ingesting patterns, and defining the transformation logic during the design phase of the migration. Even after ingesting the source data and converting the workloads into the modern data platform, the migration projects can still slow down during the data validation phase. So it's important to understand the pitfalls and have a mitigation plan.
Data Validation is an important phase for the success of the project and we shall discuss how to conquer this last-mile of migration!
Migration Strategy
Migration project starts with the discovery phase in which the source system is analyzed using Data warehouse profilers that can provide a summary of the total inventory along with its complexity. At this stage the migration strategy is defined that would pick one of the following patterns:
- Lift & Shift
- Modernize & Improve
- Replatform
Along with it other details at each step of the migration process is defined that includes:
- Data Ingestion & Modeling pattern for migration
- Orchestration and workload conversion approach
- Workspace set up
- Data Validation strategy
- User Acceptance Criteria and Test Strategy
- Cutover plan
Key Takeaway - It’s important to recognize the value of having a clear strategy towards data validation in the life cycle of a migration project and define the validation approach as early as possible |
Migration Steps
EDW migration flows through multiple steps starting from Ingestion and Transformation to querying data, visualization, and serving. Here is a reference architecture depicting how different components of a modern data platform aligns with the migration steps:
Migration steps include:
- Ingestion - Bringing the source data into Databricks using native ingestion connector support like Lakeflow connect, Autoloader, Copy Into, Arcion or picking a Partner data ingestion / ETL tools from partner connect
- Transformation - The Data transformation workloads can be migrated over to Databricks platform as workflows that supports creating a Delta Live Table pipeline or a Job. Workflows can be used for Ingestion and Transformation of data. This aggregated data is used to extract features and build models, and also for data analysis using Visualization tools
- Visualization - The cleansed and aggregated data is available for data analysis. Databricks provides SQL Warehouse (a compute resource) for query execution, and AI/BI Dashboards for easy visualization. It also supports integration to other partners’ BI tools
Key Takeaway - The Ingested and Transformed data along with the visualization dashboards has to be independently validated to make sure every step of migration matches and aligns with the source system. |
Data Validation Approach
Data validation on a migration project happens at 3 stages:
- Raw/Bronze layer data: Data validation starts with the first step of migration, that is data ingestion. As the source data gets routed to the Databricks platform, it's important to validate data in its Raw form. The first step of Medallion architecture holds the data in its raw form which can be compared against the source data.
- Transformed/Gold layer data: The data transformation happens in 2 steps of Medallion architecture. The data cleansing happens in the Silver layer and aggregation in the Gold layer where data is saved in a format that can be easy to analyze using the BI reporting systems. This is the most important stage of Validation. A very detailed row/column level comparison is recommended for the Gold layer data.
- BI Reports: The final phase of migration will be to compare the BI reports pointing to the source systems and migrated data. BI system should go through minimal changes just to point to the migrated data source keeping all the queries in the Dashboards/Widgets intact. This is often just a re-confirmation for the transformed data.
Key Takeaway - The depth of validation varies from customer to customer and also depends on the level of maturity of the data platform. We can find a good balance in between all stages of validation by prioritizing Gold layer data validation to be at each row/column level, simplifying the Bronze layer data validation by defining the KPIs & matching the row counts that can be either total number of orders, daily inventory, parts count etc, and keeping the BI Report validation to minimum/basic checks. |
Data Validation Tools
Data Validation can be performed using one of the following tools:
- Lakehouse Federation - Federation supports few cloud hosted data sources that can be easily connected and accessed from Databricks using simple queries just like accessing any delta table. Refer to the example with 3 steps:
- Create a database connection
- Create a foreign catalog using Lakehouse federation
- Execute the query
- JDBC ODBC Connection - Query the source database using JDBC and compare the results with data on Databricks. Refer to the example code (in the Reference section below) that follows the below steps:
- Define the database connection to the source database to read the data
- Read the source data into a dataframe
- Read the migrated data into a dataframe
- Compare the two dataframes using DataComPy
- Generate a report
- Accelerator Frameworks - We have multiple accelerator frameworks that help in validating the data on Databricks with the source data. These accelerators are tested and hardened to provide a consistent result. Customers would have a choice to pick an accelerator from the below options:
- Use Databricks Accelerator - Remorph
- Use a 3rd party vendor Accelerator - BladeBridge, LeapLogic, etc
- Build a custom data validation tool (using the details provided in option 1 & 2)
One or a combination of the above tools can be used to validate the data at the initial 2 stages in the validation approach.
Key Takeaway - An accelerator would always provide a nudge required for speeding the process and reducing the total time taken to complete validations. I have noticed more than 30% of productivity gain and a big boost in customer confidence when we use an Accelerator for validation. It is recommended to identify and customize the accelerators that can automate and simplify the validation process. |
Challenges with the Last-mile of Migration
As with any phase of Migration, Data validation has its own challenges but often ignored till it blocks the whole process. Understanding some of these challenges ahead of time and setting a mitigation plan would keep the migration process on path to success.
Challenges |
Mitigation Plan |
Identify the level of validation required based on the project complexity |
Socializing the different levels of validation approach with the stakeholders and scoping out what is required for the project during the initial phase of migration |
Identify the right tools and define the validation approach ahead of time |
Accelerators can bring in the required support to enhance the validation process and reduce the total time to complete the validation successfully. Identify the accelerator as part of the PoC and have the customizations required ready |
Multiple execution cycles of validating the same set of data |
Automate Validation to simplify reruns |
The back and forth between multiple teams involved in validation might be challenging to keep the project on track |
Keeping all the SMEs in check and aligned for signoff |
Evolving the validation process |
Keep a log of all the exceptions captured during the validation process and integrate it back to the automation process |
Conclusion
Data validation sounds like a simple step in the migration process and oftentimes is overseen in the design phase. This can lead to a potential slowdown in the overall migration process. Hence understanding the challenges and defining a validation approach early in the project lifecycle would help in navigating through this phase easily and thus decrease the time taken to complete a successful migration.
Reference
Data Validation Steps
import datacompy |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.