cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Incremental ingestion of Snowflake data with Delta Live Table (CDC)

Khalil
Contributor

Hello,

I have some data which are lying into Snowflake, so I want to apply CDC on them using delta live table but I am having some issues.

Here is what I am trying to do:

 

 

@dlt.view()
def table1():
   return spark.read.format("snowflake").options(**options).option('query', query).load()

def.create_streaming_table(target)
dlt.apply_changes(
source = 'table1'
target = 'target'
....
)

 

 

The same code run well if I am reading a delta table but if its snowflake am having the following error

'org.apache.spark.sql.AalysisException: Source data for the APPLY CHANGES target 'XXXXX' must be a streaming query'

Is there a solution or a workaround you can help me with?

5 REPLIES 5

-werners-
Esteemed Contributor III

The CDC for delta live works fine for delta tables, as you have noticed.  However it is not a full blown CDC implementation/software.

If you want to capture changes in Snowflake, you will have to implement some CDC method on Snowflake itself, and read those changes into Databricks.

There are several approaches to this, like using Snowflake Streams
or a commercial CDC software.

Depending on your scenario, you will also have to put some event queue between Databricks and Snowflake (like Kafka or Pulsar or ...).

Ok I got the point and thank you for your respond.

So here is how my data is organised

  • I have 2 tables in Snowflake
    • table1 : weekly table containing all the good data
    • table2 : table contains only 1 week of logs for the changes that happened in the first data (updates, deletes, ...)

I should be working with the table1, but as it grows fast and I can't always load it into databricks anytime in a materialised table, the idea were

  • to load the table1 once in databricks
  • use the table2 every week to update the table1 by using CDC with dlt

What do you think can be the best approach in this case if we are working with dlt?

Finally I followed steps from this blog, and everything works fine.

https://www.databricks.com/blog/2022/04/25/simplifying-change-data-capture-with-databricks-delta-liv...

I just assumed that I have tables as sources and not flat files.

Happy reading!

 

 

Hi @Khalil , 
Can you share if you worked on Unity catalog or HMS?

Hi @data-engineer-d ,

I am using HMS, but at the same time I am presently experimenting UC as We are planning to use for a good data management.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group