Migrating several hundred SQL Server tables to Databricks Delta Lake, using Lakebridge for a Proof of Concept (PoC), can be approached with custom pipelines—especially for filtering by a date/time column to migrate only the last two years of data. Official documentation is indeed sparse, and custom implementation is common, especially if you want selective data loading and operational cost transparency. Here is a structured approach to the project and cost considerations, along with key points about Lakebridge.
Key Points on Lakebridge and Data Migration
Lakebridge is an abstraction provided by Databricks or its partners to allow secure, high-throughput, low-latency data access and transfer between on-prem SQL Server instances and the Databricks platform. While out-of-the-box wizards may not be available for this exact scenario, the best practices generally include:
-
Building a custom ETL pipeline: Use Spark (via Databricks) or other ETL orchestration tools to extract data from SQL Server, filter by the date/time column, and write directly to Delta Lake tables on Databricks.
-
Incremental data loads: For large tables and ongoing updates, adopt CDC (Change Data Capture) or Delta Live Tables for efficient synchronization.
-
Filtering Data: Use pushdown queries (like SELECT ... WHERE date_column >= DATEADD(year, -2, GETDATE())) for selective data migration.
-
Schema evolution: Delta Lake supports schema evolution, so minor changes in table structure are handled gracefully, but evaluate for any incompatible changes.
-
Automation: Consider scripting the migration using Python/PySpark notebooks, Databricks workflows/jobs, or orchestration tools like Azure Data Factory or Apache Airflow.
Sample Pipeline Steps
-
Connect to SQL Server from Databricks: Use JDBC connectors and credentials for secure access, preferably through Lakebridge for secure networking.
-
Extract Data: Use SQL queries to filter records by the required date column, e.g., WHERE event_date >= DATEADD(year, -2, GETDATE()).
-
Transform & Load: Optionally map data types, handle nulls, and ingest as Delta tables.
-
Validation: Post-migration data validation for completeness and data quality.
-
Automation: Script the process for multiple tables and, if needed, parallelize for efficiency.
Cost Considerations
Costs will depend on the following:
-
Databricks cluster usage: Charges apply for compute resources (DBUs and VM time), which scale with the number of tables, data size, and transformations applied.
-
Data egress/network costs: If data is transferred from on-prem SQL Server to Databricks (cloud), outbound data transfer costs from your source environment will be incurred.
-
Storage costs: Storing Delta Lake tables in cloud storage (e.g., Azure Data Lake or AWS S3) comes with separate costs.
-
Lakebridge licensing/usage: If Lakebridge is a licensed product or incurs metered charges, additional costs may apply—clarify with your vendor or cloud provider support.
High-Level Cost Components
| Cost Area |
Description |
| Databricks Compute |
Time spent running Databricks clusters; charged by DBU/hour and instance type |
| Data Out/Egress |
Charges for copying data from SQL Server (on-prem/cloud) to Databricks in the cloud |
| Cloud Storage |
Fees for storing migrated Delta tables on S3/ADLS/Blob Storage |
| Lakebridge Charges |
Any platform-specific licensing or usage fees for secure bridge connectivity (if applicable) |