cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Change data feed from target tables of APPLY CHANGES

afk
New Contributor III

Up until yesterday I was (sort of) able to read changes from target tables of apply changes operations (either through tables_changes() or using readChangeFeed). I say sort of because the meta columns (_change_type, _commit_version, _commit_timestamp) were missing from readChangeFeed but since there were no deletes on my source table I could read upserts just fine. I'm using a pattern similar to this message on stackoverflow: Databricks Delta Live Tables - Apply Changes from delta table

As of today I'm getting 

 

AnalysisException: [STREAMING_TABLE_OPERATION_NOT_ALLOWED.APPLY_CHANGES_UNSUPPORTED_OPERATION] The operation SELECT CHANGE DATA FEED is not allowed: The operation is not supported on APPLY CHANGES Streaming Tables.

 

I guess I should have taken the fact that meta columns were missing as indication that I was doing something wrong. How do you read changes from APPLY CHANGES Streaming Tables?

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @afk, It seems youโ€™ve been navigating the intricacies of Databricks Delta Live Tables and Change Data Capture (CDC). Letโ€™s unravel this together!

 

Change Data Capture (CDC):

  • CDC is a process that identifies and captures incremental changes (data deletes, inserts, and updates) in databases. Itโ€™s like tracking customer, order, or product status for near-real-time data applications.
  • By capturing CDC events, Databricks users can re-materialize the source table as a Delta Table in a Lakehouse and run their analysis on top of it, while combining data with external systems.

Delta Live Tables (DLT):

  • DLT simplifies CDC by allowing users to ingest CDC data seamlessly using SQL and Python.
  • Earlier CDC solutions with delta tables used the MERGE INTO operation, which required manual ordering of data to avoid failure when multiple rows of the source dataset matched while attempting to update the same rows of the target Delta table.
  • To handle out-of-order data, an extra step was needed to preprocess the source table using a foreachBatch implementation to eliminate the possibility of multiple matches, retaining only the latest change for each key.

APPLY CHANGES INTO:

The new APPLY CHANGES INTO operation in DLT pipelines automatically and seamlessly handles out-of-order data without manual intervention.

It simplifies your architecture, making it more efficient and scalable.

You can propagate changes to downstream target tables using the APPLY CHANGES INTO statement.

Before executing the query, ensure that the target streaming table exists to hold the most up-to-date data.

Hereโ€™s an example of creating a target streaming table:

  • -- Create a target streaming table CREATE TABLE target_streaming_table USING delta AS SELECT * FROM source_table WHERE 1 = 0;

Reading Changes:

  • To get change data while reading the table, set the option readChangeFeed to true.
  • The startingVersion or startingTimestamp are optional. If not provided, the stream returns the latest snapshot of the table at the time of streaming as an INSERT, and future changes as change data.

Important Note:

Keep exploring the fascinating world of data pipelines, and may your streams flow smoothly! ๐ŸŒŠ๐Ÿ”๐Ÿš€

afk
New Contributor III

Thanks a lot @Kaniz for providing more context for my question. Yes, I've been using CDC in a DLT pipeline and was reading changes to a table from one stage (say, bronze) and applying those to the next stage (say, silver) using APPLY CHANGES INTO. So the target of an APPLY CHANGES INTO operation becomes the source for a change data feed (readChangeFeed) for the next stage. Your post quotes the documentation (which I was also familiar with) in pointing out that "The target of the APPLY CHANGES INTO query cannot be used as a source for a streaming table" so I was using a materialized view to read from that target as recommended.

The error I've been getting since Friday (see my first post, it's a new error, the pipeline was working before) seems to indicate I can't read change data using CDC from a target table of an APPLY CHANGES INTO operation anymore. Am I reading this right?

If so, is an APPLY CHANGES INTO supposed to be the last operation in every pipeline (applied only to my gold tables, essentially)? And if that's the case, what's best practice with propagating changes across stages using DLT?

 

Kaniz
Community Manager
Community Manager

Hi @afk, Thank you for providing context about your DLT pipeline and your challenges with CDC. Letโ€™s dive into the details.

 

CDC and APPLY CHANGES INTO:

  • Youโ€™re correct that the target of an APPLY CHANGES INTO operation cannot be used as a source for a streaming table. This limitation is essential to consider when designing your pipeline.
  • To work around this, youโ€™ve been using a materialized view to read from the target, which aligns with the recommended approach.

Error Since Friday:

  • The new error youโ€™re encountering indicates that you canโ€™t read change data using CDC from a target table of an APPLY CHANGES INTO operation. This behaviour might be related to recent changes or updates.
  • Investigating the specific error message and any relevant logs is essential to pinpoint the issue. Double-check the configuration and ensure that there are no conflicting operations.

Best Practices for Propagating Changes:

  • While APPLY CHANGES INTO doesnโ€™t necessarily need to be the last operation in every pipeline, itโ€™s often applied to gold tables (final, trusted data) rather than intermediate stages.
  • Here are some best practices for propagating changes across stages using DLT:
    • Bronze to Silver: Use CDC to capture changes in the bronze layer and apply them to the silver layer. Materialized views or other mechanisms can help bridge the gap between the target and the next stage.
    • Silver to Gold: Apply changes from the silver layer to the gold layer using APPLY CHANGES INTO. Gold tables represent the authoritative, refined data.
    • Considerations:
      • Order of Operations: Ensure that the order of operations aligns with your data flow. For example, apply transformations and business logic before reaching the gold layer.
      • Data Consistency: Maintain consistency across stages. Explore alternative approaches like materialized views or intermediate tables if you encounter limitations.
      • Monitoring and Debugging: Regularly monitor your pipeline, review logs, and address any issues promptly.

Remember that DLT pipelines can be complex, and the right design depends on your use case.

 

Feel free to share more details about the error message or any other challenges youโ€™re facing, and we can explore further solutions! ๐Ÿš€

afk
New Contributor III

Thanks @Kaniz for your response. I'm still stuck. How do I "apply [changes] to the silver layer" in a DLT pipeline without using APPLY CHANGES INTO (which I'm reserving for my gold layer)? The only way I see is to revert to using MERGE INTO and foreachBatch.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.