cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Building an Incremental Customer Data Migration Workflow in Databricks

naveen0808
New Contributor

Building an Incremental Customer Data Migration Workflow in Databricks

By Naveen Ayalla

Introduction

In many enterprise environments, customer data is spread across legacy systems that were originally designed for operational processing rather than modern analytics. When this data needs to support reporting, pattern discovery, and AI-assisted exploration, moving it into a scalable lakehouse platform becomes an important step.

While incremental loading is a common pattern, the main challenge in this use case was applying it in a controlled way for customer and address data that needed both reliable refreshes and easier downstream exploration. The migration was not only about moving records into Databricks, but also about preparing the data for trusted analysis after it landed in the lakehouse.

In one of my recent projects, I worked on migrating customer and address data from SAP HANA into Databricks. The goal was not just to move data from one system to another. The objective was to build a workflow that could support incremental refreshes, improve data quality, maintain governed access, and make the data easier to explore after migration.

This article shares the approach I followed, the design considerations behind it, and how Databricks helped connect the engineering and analytics parts of the workflow.

Why Incremental Loading Was Needed

The source tables contained a large volume of customer and address records. Since these records continued to change over time, a full reload for every refresh was not practical.

A full load can be useful during an initial migration, but it becomes inefficient when the same large dataset has to be processed repeatedly. It increases processing time, consumes more compute, and makes refresh cycles harder to manage.

To avoid this, I used an incremental loading approach in PySpark. The logic was based on timestamp fields available in the source data. During each run, the pipeline identified records that were newly created or updated after the last successful load and processed only those changes.

This helped reduce unnecessary processing and made the workflow more suitable for ongoing use.

High-Level Migration Flow

At a high level, the workflow followed these steps:

  1. Read customer and address data from SAP HANA.

  2. Identify new and changed records using timestamp-based filters.

  3. Apply duplicate checks before loading the final tables.

  4. Write the processed data into Delta tables in Databricks.

  5. Use the Delta tables for reporting, analysis, and AI-assisted exploration.

The focus was to keep the pipeline simple, reliable, and easy to monitor. Since the data involved customer-related information, governance and controlled access were also part of the design from the beginning.

Using PySpark for Change Detection

PySpark was useful because it allowed the pipeline to handle large datasets while keeping the transformation logic flexible.

For the incremental load, I maintained the latest processed timestamp from the previous successful run. On the next execution, the pipeline compared that value against the timestamp column in the source table. Any record with a newer timestamp was treated as part of the current incremental batch.

The basic logic looked like this:

source_updated_timestamp > last_processed_timestamp

This pattern helped avoid repeated processing of unchanged records. It also made the pipeline easier to scale because each run focused only on the records that actually needed to be refreshed.

In practice, incremental loading also requires attention to edge cases. Late-arriving records, null timestamp values, duplicate updates, and failed runs all need to be considered. A timestamp-based approach works well, but it has to be implemented carefully so the pipeline remains reliable over time.

Handling Duplicate Records During Ingestion

Customer and address datasets can contain duplicate or repeated records, especially when they come from long-running operational systems. Before writing the data into the final Delta tables, I added duplicate checks during the ingestion process.

The duplicate checks helped identify repeated records based on key columns and business-relevant combinations. This improved the quality of the final tables and reduced the risk of duplicate records affecting downstream reports or analysis.

The goal was not to solve every possible data quality issue in one step. Instead, the focus was to prevent obvious duplicate records from flowing into the curated layer. Even basic validation at this stage can improve trust in the data and reduce cleanup work later.

Writing the Data to Delta Tables

After the incremental data was processed and validated, it was written into Delta tables in Databricks.

Delta Lake provided a reliable foundation for the migrated data. It supported the update and refresh patterns needed for incremental processing and made the data easier to consume for downstream analytics.

Once the data was available as Delta tables, it could be used for SQL analysis, dashboards, and AI-assisted exploration. This helped move the project beyond migration and made the data more useful for analysis.

Exploring the Data with Databricks Genie

After the customer and address tables were available in Databricks, I connected them to a Genie workspace for exploration.

This changed the way the data could be analyzed. Instead of manually writing every SQL query or creating every dashboard from scratch, users could ask questions in natural language and quickly generate useful views of the data.

For example, the data could be explored by asking questions about customer distribution, address groupings, record counts, or relationships between customer and address records.

This was useful because customer and address data often becomes more meaningful when viewed together. A customer record may provide profile-level information, while address data can add location and grouping context. Looking at both together can reveal patterns that are not always clear when each table is analyzed separately.

Pattern Discovery Across Related Data

One of the more interesting parts of the project was exploring relationships across multiple tables. In many cases, the most useful insight did not come from a single table. It came from understanding how customer records connected to address records and how those records were grouped.

Using Databricks, I was able to explore these relationships more quickly and test different ways of visualizing the data. Hierarchical views were useful for understanding how records were organized across different levels.

The AI-assisted layer helped speed up this exploration. It did not replace validation, modeling, or engineering work. Instead, it made it easier to ask follow-up questions and identify areas that deserved deeper analysis.

What Worked Well

A few design choices made the workflow more effective.

First, incremental loading reduced the amount of data processed during each run. This improved efficiency and made the refresh process easier to manage.

Second, duplicate checks during ingestion improved the quality of the final tables. This helped reduce avoidable issues before the data reached the reporting layer.

Third, storing the data in Delta tables made it easier to support downstream use cases. The data was no longer limited to the source system and could be explored more flexibly in Databricks.

Finally, using Genie made the data more accessible. Users could move from a business question to an initial visualization much faster than with a fully manual dashboard-building process.

Lessons Learned

This project reinforced a few practical lessons.

Incremental logic should be designed carefully. It is important to think about how the pipeline will behave after many runs, not only during the first load.

Data quality checks should be handled as close to ingestion as possible. Even basic duplicate checks can improve trust in the final dataset.

AI-assisted analytics works best when the underlying data is already clean, structured, and governed. The AI layer can help with exploration, but it depends heavily on the quality of the data engineering work beneath it.

Most importantly, migration should not be treated only as the movement of data from one system to another. A good migration should also improve how the data can be refreshed, governed, analyzed, and used after it lands in the target platform.

Conclusion

Migrating customer and address data from SAP HANA to Databricks gave us a more scalable and flexible foundation for analysis. By using PySpark-based incremental loading, duplicate checks, Delta tables, and Databricks Genie, the workflow supported both efficient data engineering and faster exploration.

For me, the biggest takeaway was that data engineering and AI-assisted analytics work best together. The engineering layer makes the data reliable, controlled, and scalable. The AI layer makes it easier to ask questions, explore relationships, and identify useful patterns.

For this use case, Databricks helped connect the engineering work with the analysis work in a practical way. The data was easier to refresh, easier to govern, and easier to explore once it was available in the lakehouse.

#Databricks 

1 REPLY 1

Brahmareddy
Esteemed Contributor II

Great write-up, Naveen. Very practical and clear.

I really like how you focused not just on migration, but on building a reliable incremental workflow with proper duplicate handling and governance. Thatโ€™s where real value comes from.

Also, connecting Delta tables with Genie for exploration is a strong touch. Solid engineering first, then AI-assisted analytics on top. Well done