Databricks Community

Khalil · ‎07-13-2023

Hello,

I have some data which are lying into Snowflake, so I want to apply CDC on them using delta live table but I am having some issues.

Here is what I am trying to do:

@dlt.view()
def table1():
   return spark.read.format("snowflake").options(**options).option('query', query).load()

def.create_streaming_table(target)
dlt.apply_changes(
source = 'table1'
target = 'target'
....
)

The same code run well if I am reading a delta table but if its snowflake am having the following error

'org.apache.spark.sql.AalysisException: Source data for the APPLY CHANGES target 'XXXXX' must be a streaming query'

Is there a solution or a workaround you can help me with?

-werners- · ‎07-17-2023

The CDC for delta live works fine for delta tables, as you have noticed. However it is not a full blown CDC implementation/software.

If you want to capture changes in Snowflake, you will have to implement some CDC method on Snowflake itself, and read those changes into Databricks.

There are several approaches to this, like using Snowflake Streams
or a commercial CDC software.

Depending on your scenario, you will also have to put some event queue between Databricks and Snowflake (like Kafka or Pulsar or ...).

Khalil · ‎07-23-2023

Ok I got the point and thank you for your respond.

So here is how my data is organised

I have 2 tables in Snowflake
- table1 : weekly table containing all the good data
- table2 : table contains only 1 week of logs for the changes that happened in the first data (updates, deletes, ...)

I should be working with the table1, but as it grows fast and I can't always load it into databricks anytime in a materialised table, the idea were