Hi,
I am looking at building a data warehouse using Databricks. Most of the data will be coming from Azure SQL, and we now have Azure SQL CDC enabled to capture changes. Also I would like to import this without paying for additional connectors like FiveTran.
1. Would it be reasonable to create one Notebook / Delta Live Pipeline per source table?
2. The first time the Delta Live Pipeline runs there will be no tables or data available in Databricks so I am guessing I need a quick check to see if the table already exists, and if not, pull the entire table from Azure SQL? I was thinking something like this (although it doesn't seem to work)?
db_name = "AdventureWorks"
table_name = 'SalesLT_Customer'
tables_collection = spark.catalog.listTables(db_name)
table_names_in_db = [table.name for table in tables_collection]
table_exists = table_name in table_names_in_db
if not table_exists:
@dlt.table(
name=f"SalesLT_Customer",
comment=f"Original data for SalesLT.Customer"
)
def SalesLT_Customer():
df = spark.read.format("jdbc") \
.option("url", "jdbc:sqlserver://sql.database.windows.net;databaseName=database") \
.option("username", "x") \
.option("password", "x") \
.option("dbtable", "SalesLt.Customer") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
return (df)
For subsequent runs, I would take the data from the CDC tables in Azure SQL. Maybe something like this?
@dlt.table(
name=f"CDC_SalesLT_Customer_CT",
comment=f"Original CDC data for SalesLT_Customer_CT"
)
def CDC_SalesLT_Customer_CT():
df = spark.read.format("jdbc") \
.option("url", "jdbc:sqlserver://sql.database.windows.net;databaseName=database") \
.option("username", "x") \
.option("password", "x") \
.option("dbtable", "cdc.SalesLT_Customer_CT") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
return (df)
Does that seem reasonable so far?