cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

CDC with Snapshot - next_snapshot_and_version() function

Pw76
New Contributor II

I am trying to use create_auto_cdc_from_snapshot_flow (formerly apply_changes_from_snapshot())  (see: https://docs.databricks.com/aws/en/dlt/cdc#cdc-from-snapshot)

I am attempting to do SCD type 2 changes using historic snapshot data.

In the first couple steps of my process I need to parse JSON, then explode and flatten it.  Back in storage I basically and getting a single json file every day that represents an entire table from my source system.

I am running into issues trying to get the CDC process to work as described in the documentation.

1.  While trying to implement the "next_snapshot_and_version" function in order to do this, I am running into issues trying to reference previous tables (or views) in my medallion structure that are needed to create the 1st element of output of the function which is the DataFrame representing the snapshot.  I am getting the following error whenever I try to reference any of the other tables I have created in my process:

[REFERENCE_DLT_DATASET_OUTSIDE_QUERY_DEFINITION] Referencing DLT dataset [table I am trying to access in fully qualified form] outside the dataset query definition (i.e., @Dlt.table annotation) is not supported. Please read it instead inside the dataset query definition.

I've only been able to find a few examples of the implementation of the "next_snapshot_and_version" function, but in each case they seem to be working with the dataframe based upon storage level data and not other DLT tables/views.  I am not sure if this is a limitation of this approach, and that I am not able to reference a DLT table inside my implementation of the required "next_snapshot_and_version" function.  I do however need to first parse/explode/flatten the json data, so going back to the storage data is not an option for me.

Does anyone have any suggestions or examples that could help me or know whether my approach is incorrect?  Thank you.

 

3 REPLIES 3

nkarwa
New Contributor II

@Pw76 - I was wondering if you found a solution? I have a similar use-case. I want to create a archive/snapshot table using DLT from a non-streaming (MV) and a Backfill table (1 time). I thought create_auto_cdc_from_snapshot_flow could help here, but no luck.

Pw76
New Contributor II

So far have not found a solution to this.  I was however able to prove that this works for me by not having the DLT tables I am accessing to define my snapshot in the same pipeline.  The snapshot and the next_snapshot function I had to write worked and gave the resulting targe table I wanted.  It now just seems when i try to put it altogether it chokes.  Also, I've found that unless you have the scenario where you need to have your target table recognize that records are no longer in your source snapshot, you can just use the regular AUTO CDC (without snapshot), and it is much easier and you dont need to try to implement the problematic next_snapshot function.

nkarwa
New Contributor II

@Pw76 - but auto cdc (similar to apply_changes) needs to have a streaming source, I have a DLT MV which gets flush/fill.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now