Databricks Community

02CSE33 · 2 weeks ago

We have a requirement to carry out migration of few 100 tables which are present in SQL Server to Databricks Delta Table. We intend to explore Lakebridge capability for carrying out a PoC for this. We also want to migrate few historic records say last 2 years worth of data only and this is based on a date time column on the tables.

I was not able to find a proper documentation or resource for this. After exploration, I can only figure out that we need to have a custom data migration in place for this. I also want to explore the costs incurred for this.

Can anyone of you help me in this regard!

bidek56 · 2 weeks ago

Use Polars or DuckDB to save on yout DBR costs.

mark_ott · a week ago

Migrating several hundred SQL Server tables to Databricks Delta Lake, using Lakebridge for a Proof of Concept (PoC), can be approached with custom pipelines—especially for filtering by a date/time column to migrate only the last two years of data. Official documentation is indeed sparse, and custom implementation is common, especially if you want selective data loading and operational cost transparency. Here is a structured approach to the project and cost considerations, along with key points about Lakebridge.

Key Points on Lakebridge and Data Migration

Lakebridge is an abstraction provided by Databricks or its partners to allow secure, high-throughput, low-latency data access and transfer between on-prem SQL Server instances and the Databricks platform. While out-of-the-box wizards may not be available for this exact scenario, the best practices generally include:

Building a custom ETL pipeline: Use Spark (via Databricks) or other ETL orchestration tools to extract data from SQL Server, filter by the date/time column, and write directly to Delta Lake tables on Databricks.
Incremental data loads: For large tables and ongoing updates, adopt CDC (Change Data Capture) or Delta Live Tables for efficient synchronization.
Filtering Data: Use pushdown queries (like SELECT ... WHERE date_column >= DATEADD(year, -2, GETDATE())) for selective data migration.
Schema evolution: Delta Lake supports schema evolution, so minor changes in table structure are handled gracefully, but evaluate for any incompatible changes.
Automation: Consider scripting the migration using Python/PySpark notebooks, Databricks workflows/jobs, or orchestration tools like Azure Data Factory or Apache Airflow.

Sample Pipeline Steps

Connect to SQL Server from Databricks: Use JDBC connectors and credentials for secure access, preferably through Lakebridge for secure networking.
Extract Data: Use SQL queries to filter records by the required date column, e.g., WHERE event_date >= DATEADD(year, -2, GETDATE()).
Transform & Load: Optionally map data types, handle nulls, and ingest as Delta tables.
Validation: Post-migration data validation for completeness and data quality.
Automation: Script the process for multiple tables and, if needed, parallelize for efficiency.

Cost Considerations

Costs will depend on the following:

Databricks cluster usage: Charges apply for compute resources (DBUs and VM time), which scale with the number of tables, data size, and transformations applied.
Data egress/network costs: If data is transferred from on-prem SQL Server to Databricks (cloud), outbound data transfer costs from your source environment will be incurred.
Storage costs: Storing Delta Lake tables in cloud storage (e.g., Azure Data Lake or AWS S3) comes with separate costs.
Lakebridge licensing/usage: If Lakebridge is a licensed product or incurs metered charges, additional costs may apply—clarify with your vendor or cloud provider support.

High-Level Cost Components

Cost Area	Description
Databricks Compute	Time spent running Databricks clusters; charged by DBU/hour and instance type
Data Out/Egress	Charges for copying data from SQL Server (on-prem/cloud) to Databricks in the cloud
Cloud Storage	Fees for storing migrated Delta tables on S3/ADLS/Blob Storage
Lakebridge Charges	Any platform-specific licensing or usage fees for secure bridge connectivity (if applicable)