Move large SQL data into Databricks

zychoo
New Contributor

Hello, 

I have a large on-prem SQL database (~15TB). It heavily utilizes the sql_variant datatype. Would like to move it into a Databricks bronze layer, and have it synchronized as close to 'live' as possible. 

What could be the solution? 
It seems like a very basic scenario to use Databricks, but somehow couldn't fine any example nor explanation.

I tried two approaches, neither worked:

SQL CDC -> ADF Pipeline -> Blob Storage -> Databricks
- it seems unnecessary complex, fragile
- couldn't create a Databricks DLT that would be initiated from table 'snapshot' and kept updated by CDC exports

Lakeflow Connect
- does not support sql_variant
- changing SQL schema (to eliminate/replace/convert sql_variant) is not an option due to many reasons (size, performance, downtime)

WiliamRosa
Contributor III

Hi @zychoo ,

I would consider a “Near-real-time” solution into Databricks Bronze, something like:

- Log-based CDC tool (Qlik / Debezium / HVR) captures changes from SQL Server.
- Tool serializes sql_variant to JSON or string+type metadata.
- Writes to S3/Blob as Delta-friendly format (JSON or Parquet).
- Databricks Auto Loader streams into Bronze tables.
- Silver layer casts back values using the type metadata if needed.

This avoids snapshots + fragile CDC merge logic in ADF.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa