cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT CDC/SCD - Taking the latest ID per day

dm7
New Contributor II

Hi I'm creating a DLT pipeline which uses DLT CDC to implement SCD Type 1 to take the latest record using a datetime column which works with no issues:

@dlt.view
def users():
  return spark.readStream.table("source_table")

dlt.create_streaming_table("target_table")

dlt.apply_changes(
  target = "target_table",
  source = "source_table",
  keys = ["Id"],
  sequence_by = col("PublishDateTime"),
  stored_as_scd_type = 1
)

This gives me the following result:

SOURCE

Id DateTime
123100424 1717
123100424 1710
164100424 1704
167100424 1619

TARGET

Id DateTime
123100424 1717
164100424 1704
167100424 1619

Essentially taking the latest record using ID and DateTime fields.

My question now is, how do I edit this code to take the LATEST record PER DAY. Please see below the example using the same table:

SOURCE

Id DateTime Date
123100424 1717100424
123100424 1710100424
123110424 1717110424
164100424 1704100424
164110424 1728110424
165120424 1447120424
165120424 1316120424

TARGET

Id DateTime Date
123100424 1717100424
123110424 1717110424
164100424 1704100424
164110424 1728110424
165120424 1447120424

As you can see, the target table takes the latest Id using datetime, but for EACH DAY not just the latest period.

I'm aware SCD Type 1 does not capture history so may not be the right option here but SCD Type 2 does but unsure of how to implement this. Would be eternally grateful for any advice here thanks

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now