Databricks Community

BearInTheWoods · ‎07-04-2022

Hi,

I am looking at building a data warehouse using Databricks. Most of the data will be coming from Azure SQL, and we now have Azure SQL CDC enabled to capture changes. Also I would like to import this without paying for additional connectors like FiveTran.

1. Would it be reasonable to create one Notebook / Delta Live Pipeline per source table?

2. The first time the Delta Live Pipeline runs there will be no tables or data available in Databricks so I am guessing I need a quick check to see if the table already exists, and if not, pull the entire table from Azure SQL? I was thinking something like this (although it doesn't seem to work)?

db_name = "AdventureWorks"
table_name = 'SalesLT_Customer'
 
tables_collection = spark.catalog.listTables(db_name)
table_names_in_db = [table.name for table in tables_collection]
table_exists = table_name in table_names_in_db
 
if not table_exists:    
    @dlt.table(
    name=f"SalesLT_Customer",
    comment=f"Original data for SalesLT.Customer"
    )
    def SalesLT_Customer():
        df = spark.read.format("jdbc") \
            .option("url", "jdbc:sqlserver://sql.database.windows.net;databaseName=database") \
            .option("username", "x") \
            .option("password", "x") \
            .option("dbtable", "SalesLt.Customer") \
            .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
            .load()
        return (df)

For subsequent runs, I would take the data from the CDC tables in Azure SQL. Maybe something like this?

@dlt.table(
    name=f"CDC_SalesLT_Customer_CT",
    comment=f"Original CDC data for SalesLT_Customer_CT"
)
def CDC_SalesLT_Customer_CT():
    df = spark.read.format("jdbc") \
        .option("url", "jdbc:sqlserver://sql.database.windows.net;databaseName=database") \
        .option("username", "x") \
        .option("password", "x") \
        .option("dbtable", "cdc.SalesLT_Customer_CT") \
        .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
        .load()
    return (df)

Does that seem reasonable so far?

ravinchi · ‎12-01-2022

@Bear Woods Hi! were you able to create DLT tables using CDC feature from sources like sql tables ? even I'm kinda in your situation, you need to leverage apply_changes function and create_streaming_live_table() function but it required intermediate table which I'm trying to avoid.

Databricks Community

Importing Azure SQL data into Databricks

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon