cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Importing Azure SQL data into Databricks

BearInTheWoods
New Contributor III

Hi,

I am looking at building a data warehouse using Databricks. Most of the data will be coming from Azure SQL, and we now have Azure SQL CDC enabled to capture changes. Also I would like to import this without paying for additional connectors like FiveTran.

1. Would it be reasonable to create one Notebook / Delta Live Pipeline per source table?

2. The first time the Delta Live Pipeline runs there will be no tables or data available in Databricks so I am guessing I need a quick check to see if the table already exists, and if not, pull the entire table from Azure SQL? I was thinking something like this (although it doesn't seem to work)?

db_name = "AdventureWorks"
table_name = 'SalesLT_Customer'
 
tables_collection = spark.catalog.listTables(db_name)
table_names_in_db = [table.name for table in tables_collection]
table_exists = table_name in table_names_in_db
 
if not table_exists:    
    @dlt.table(
    name=f"SalesLT_Customer",
    comment=f"Original data for SalesLT.Customer"
    )
    def SalesLT_Customer():
        df = spark.read.format("jdbc") \
            .option("url", "jdbc:sqlserver://sql.database.windows.net;databaseName=database") \
            .option("username", "x") \
            .option("password", "x") \
            .option("dbtable", "SalesLt.Customer") \
            .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
            .load()
        return (df)

For subsequent runs, I would take the data from the CDC tables in Azure SQL. Maybe something like this?

@dlt.table(
    name=f"CDC_SalesLT_Customer_CT",
    comment=f"Original CDC data for SalesLT_Customer_CT"
)
def CDC_SalesLT_Customer_CT():
    df = spark.read.format("jdbc") \
        .option("url", "jdbc:sqlserver://sql.database.windows.net;databaseName=database") \
        .option("username", "x") \
        .option("password", "x") \
        .option("dbtable", "cdc.SalesLT_Customer_CT") \
        .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
        .load()
    return (df)

Does that seem reasonable so far?

1 REPLY 1

ravinchi
New Contributor III

@Bear Woods​ Hi! were you able to create DLT tables using CDC feature from sources like sql tables ? even I'm kinda in your situation, you need to leverage apply_changes function and create_streaming_live_table() function but it required intermediate table which I'm trying to avoid.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group