Databricks Community

ImranA · ‎12-12-2024

Hi,
I am having an issue when I try to load the DLT bronze table for the first time, it goes through multiple parquet file using a loop(loop is just helping to provide the snapshot and version sequentially to "dlt.apply_changes_from_snapshot()"), And stores only the latest parquet file in the table.

However I want it to keep the records as their values are changing through parquet files in the first load as scd type 2.

Interesting thing is after the first load if any new parquet files comes in, it is working fine as it stores the new record which have different values for the same 'ID' column and puts an end date to the old record for the same 'ID'.
Has anyone experienced this and what could be the cause?

using the below code:

dlt.create_streaming_table(
name=(f"{table_name}"),
table_properties={"quality": "bronze","delta.enableChangeDataFeed":"true"}
)

def apply_changes(snapshot, version, keys, table_name):
    dlt.apply_changes_from_snapshot(
    target=(f"{table_name}"),
    snapshot_and_version= (snapshot,version),
    keys=keys,
    stored_as_scd_type=2,
    track_history_except_column_list=["_metadata","extract_datetime", "formatted_datetime"]  
    )

if data_point:
    for dp in data_point:
        nsav = next_snapshot_and_version(dp)
        snapshot = nsav[0]  # The DataFrame containing the snapshot
        version = nsav[1]   # The sortable version, however brings one date at a time
        print(version)  # Print the version (formatted datetime)
        apply_changes(snapshot, version, keys, table_name)
        
else:
    print("No data available in data_point. Skipping processing.")

And the loop (for dp in datapoint), does provide me with all the snapshots and datetime so its I am not missing any snapshots or dates.

ImranA · ‎02-12-2025

For anyone else struggling with the same, I was overcomplicating this.
basically next_snapshot runs as a loop itself. so you don't need to provide a separate loop for it to work.

Please check the below code:

def next_snapshot(latest_version):
  """
    latest_version is the version of the latest snapshot
  """
  print("***** Start next_snapshot() ", latest_version)
  if latest_version is None:
    version = versions[0]
  else:
    version_idx = versions.index(latest_version) + 1
    if version_idx > len(versions)-1:
      return None
    version = versions[version_idx]
  
  snapshot_path = f"{snapshot_root}"
  if exist(snapshot_path):
    return (spark.read.format("parquet").load(snapshot_path).select("*", "_metadata"), version)
  else:
    return None




dlt.create_streaming_table(
  name =(table_name),
  table_properties={"quality": "bronze","delta.enableChangeDataFeed":"true"}
)

  
dlt.apply_changes_from_snapshot(
  target = table_name,
  snapshot_and_version = next_snapshot,
  keys = ["id"],
  stored_as_scd_type = 2,
  # track_history_column_list = ["TrackingCol"] ### if you want to track history of specific columns
  track_history_except_column_list=["TrackingCol","TrackingCol","TrackingCol"]  ### if you want to exclude specific columns
)

View solution in original post

ImranA · ‎12-12-2024

By First Load I mean Full Refresh in DLT pipeline.

ImranA · ‎12-12-2024

@gchandracan you help me with the above?

In summary, When I loop through the snapshots, the API doesn't create a history in the table that how the data is changing but only keeps the latest records from the most recent parquet file when I do full refresh.

ImranA · ‎02-12-2025

For anyone else struggling with the same, I was overcomplicating this.
basically next_snapshot runs as a loop itself. so you don't need to provide a separate loop for it to work.

Please check the below code:

def next_snapshot(latest_version):
  """
    latest_version is the version of the latest snapshot
  """
  print("***** Start next_snapshot() ", latest_version)
  if latest_version is None:
    version = versions[0]
  else:
    version_idx = versions.index(latest_version) + 1
    if version_idx > len(versions)-1:
      return None
    version = versions[version_idx]
  
  snapshot_path = f"{snapshot_root}"
  if exist(snapshot_path):
    return (spark.read.format("parquet").load(snapshot_path).select("*", "_metadata"), version)
  else:
    return None




dlt.create_streaming_table(
  name =(table_name),
  table_properties={"quality": "bronze","delta.enableChangeDataFeed":"true"}
)

  
dlt.apply_changes_from_snapshot(
  target = table_name,
  snapshot_and_version = next_snapshot,
  keys = ["id"],
  stored_as_scd_type = 2,
  # track_history_column_list = ["TrackingCol"] ### if you want to track history of specific columns
  track_history_except_column_list=["TrackingCol","TrackingCol","TrackingCol"]  ### if you want to exclude specific columns
)