Move large SQL data into Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2025 01:25 AM
Hello,
I have a large on-prem SQL database (~15TB). It heavily utilizes the sql_variant datatype. Would like to move it into a Databricks bronze layer, and have it synchronized as close to 'live' as possible.
What could be the solution?
It seems like a very basic scenario to use Databricks, but somehow couldn't fine any example nor explanation.
I tried two approaches, neither worked:
SQL CDC -> ADF Pipeline -> Blob Storage -> Databricks
- it seems unnecessary complex, fragile
- couldn't create a Databricks DLT that would be initiated from table 'snapshot' and kept updated by CDC exports
Lakeflow Connect
- does not support sql_variant
- changing SQL schema (to eliminate/replace/convert sql_variant) is not an option due to many reasons (size, performance, downtime)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-19-2025 04:32 PM
Hi @zychoo ,
I would consider a “Near-real-time” solution into Databricks Bronze, something like:
- Log-based CDC tool (Qlik / Debezium / HVR) captures changes from SQL Server.
- Tool serializes sql_variant to JSON or string+type metadata.
- Writes to S3/Blob as Delta-friendly format (JSON or Parquet).
- Databricks Auto Loader streams into Bronze tables.
- Silver layer casts back values using the type metadata if needed.
This avoids snapshots + fragile CDC merge logic in ADF.
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa