cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Move large SQL data into Databricks

zychoo
New Contributor

Hello, 

I have a large on-prem SQL database (~15TB). It heavily utilizes the sql_variant datatype. Would like to move it into a Databricks bronze layer, and have it synchronized as close to 'live' as possible. 

What could be the solution? 
It seems like a very basic scenario to use Databricks, but somehow couldn't fine any example nor explanation.

I tried two approaches, neither worked:

SQL CDC -> ADF Pipeline -> Blob Storage -> Databricks
- it seems unnecessary complex, fragile
- couldn't create a Databricks DLT that would be initiated from table 'snapshot' and kept updated by CDC exports

Lakeflow Connect
- does not support sql_variant
- changing SQL schema (to eliminate/replace/convert sql_variant) is not an option due to many reasons (size, performance, downtime)

1 REPLY 1

WiliamRosa
New Contributor II

Hi @zychoo ,

I would consider a “Near-real-time” solution into Databricks Bronze, something like:

- Log-based CDC tool (Qlik / Debezium / HVR) captures changes from SQL Server.
- Tool serializes sql_variant to JSON or string+type metadata.
- Writes to S3/Blob as Delta-friendly format (JSON or Parquet).
- Databricks Auto Loader streams into Bronze tables.
- Silver layer casts back values using the type metadata if needed.

This avoids snapshots + fragile CDC merge logic in ADF.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now