<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance optimization on auto_cdc_flow in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156567#M54450</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/210475"&gt;@yit337&lt;/a&gt;&amp;nbsp; !&lt;/P&gt;&lt;P&gt;This is expected for AUTO CDC with SCD2 and it is not doing a simple append because it must upsert incoming CDC rows into the target based on the declared keys&amp;nbsp;and for SCD2 it also maintains historical rows with __START_AT or __END_AT.&lt;/P&gt;&lt;P&gt;So DBKS has to find the matching existing target records before it can close the old version and insert the new one.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Liquid clustering only helps if the generated plan can prune files using a selective predicate and clustering on factory_code helps queries like:&amp;nbsp;WHERE factory_code = 'X'&amp;nbsp;&lt;/P&gt;&lt;P&gt;but if every micro batch contains many or all of the 60 factory_code values or if the AUTO CDC generated merge or join plan does not produce a selective target side predicate and you can still see 0% file pruning.&lt;/P&gt;&lt;P&gt;DPP/DFP can be skipped by the optimizer when the cost of building the pruning filter is not expected to pay off.&amp;nbsp;&lt;/P&gt;&lt;P&gt;For improvement, you need to make sure both columns are in the CDC keys and if the real business key is (factory_code, pk) you need to&amp;nbsp;define it like this:&lt;/P&gt;&lt;PRE&gt;dp.create_auto_cdc_flow(
    target="gold_fact",
    source="silver_cdf_view",
    keys=["factory_code", "pk"],
    sequence_by=col("_commit_version"),
    stored_as_scd_type=2,
    apply_as_deletes=expr("_change_type = 'delete'")
)&lt;/PRE&gt;&lt;P&gt;and of factory_code is only a normal column and not part of keys clustering by it will not help the target lookup much.&lt;/P&gt;&lt;P&gt;Or try to reduce the CDC source before AUTO CDC and do not pass unnecessary CDF rows or columns into the flow. For example, you can exclude update_preimage unless you need it:&lt;/P&gt;&lt;PRE&gt;@dp.view
def silver_cdf_view():
    return (
        spark.readStream
            .option("readChangeFeed", "true")
            .table("silver_table")
            .where("_change_type IN ('insert', 'update_postimage', 'delete')")
            .select(
                "factory_code",
                "pk",
                "col1",
                "col2",
                "_change_type",
                "_commit_version",
                "_commit_timestamp"
            )
    )&lt;/PRE&gt;&lt;P&gt;materializeSourceTimeMs&amp;nbsp;is source side time: "time taken to materialize source or determine it is not needed" so a complex CDF source view can be a big part of the cost too.&lt;/P&gt;&lt;P&gt;You can cluster in your case y by factory_code&amp;nbsp;because I am assuming it is too low cardinality. Since pk is the actual lookup key test one of these on a lower environment:&lt;/P&gt;&lt;PRE&gt;@dp.table(cluster_by=["factory_code", "pk"])
def ...&lt;/PRE&gt;&lt;P&gt;or:&lt;/P&gt;&lt;PRE&gt;@dp.table(cluster_by_auto=True)
def ...&lt;/PRE&gt;&lt;P&gt;I am thinking about these points so please test and share your finding &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 11 May 2026 11:06:40 GMT</pubDate>
    <dc:creator>amirabedhiafi</dc:creator>
    <dc:date>2026-05-11T11:06:40Z</dc:date>
    <item>
      <title>Performance optimization on auto_cdc_flow</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156562#M54448</link>
      <description>&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;I've got a fact streaming table, which is updated by SCD2 records from the CDF of a silver table. The join is on pk&amp;nbsp; (hash key generated from dimensions business keys) and factory_code (60 unique values). On each incremental processing, it reads all the existing data from the gold model.&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;STRONG&gt;Why? And how to improve this?&lt;/STRONG&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;I have already enabled liquid clustering on&amp;nbsp;factory_code. It doesn't make sense to set liquid clustering on my hash keys cause these are uniformly distributed through files.&lt;/DIV&gt;&lt;DIV class=""&gt;Based on the Query History, most time is spend on '&lt;SPAN&gt;Time taken to materialize source (or determine it's not needed)'.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Percentage of file and bytes prunned are both 0%.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 11 May 2026 10:40:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156562#M54448</guid>
      <dc:creator>yit337</dc:creator>
      <dc:date>2026-05-11T10:40:46Z</dc:date>
    </item>
    <item>
      <title>Re: Performance optimization on auto_cdc_flow</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156567#M54450</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/210475"&gt;@yit337&lt;/a&gt;&amp;nbsp; !&lt;/P&gt;&lt;P&gt;This is expected for AUTO CDC with SCD2 and it is not doing a simple append because it must upsert incoming CDC rows into the target based on the declared keys&amp;nbsp;and for SCD2 it also maintains historical rows with __START_AT or __END_AT.&lt;/P&gt;&lt;P&gt;So DBKS has to find the matching existing target records before it can close the old version and insert the new one.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Liquid clustering only helps if the generated plan can prune files using a selective predicate and clustering on factory_code helps queries like:&amp;nbsp;WHERE factory_code = 'X'&amp;nbsp;&lt;/P&gt;&lt;P&gt;but if every micro batch contains many or all of the 60 factory_code values or if the AUTO CDC generated merge or join plan does not produce a selective target side predicate and you can still see 0% file pruning.&lt;/P&gt;&lt;P&gt;DPP/DFP can be skipped by the optimizer when the cost of building the pruning filter is not expected to pay off.&amp;nbsp;&lt;/P&gt;&lt;P&gt;For improvement, you need to make sure both columns are in the CDC keys and if the real business key is (factory_code, pk) you need to&amp;nbsp;define it like this:&lt;/P&gt;&lt;PRE&gt;dp.create_auto_cdc_flow(
    target="gold_fact",
    source="silver_cdf_view",
    keys=["factory_code", "pk"],
    sequence_by=col("_commit_version"),
    stored_as_scd_type=2,
    apply_as_deletes=expr("_change_type = 'delete'")
)&lt;/PRE&gt;&lt;P&gt;and of factory_code is only a normal column and not part of keys clustering by it will not help the target lookup much.&lt;/P&gt;&lt;P&gt;Or try to reduce the CDC source before AUTO CDC and do not pass unnecessary CDF rows or columns into the flow. For example, you can exclude update_preimage unless you need it:&lt;/P&gt;&lt;PRE&gt;@dp.view
def silver_cdf_view():
    return (
        spark.readStream
            .option("readChangeFeed", "true")
            .table("silver_table")
            .where("_change_type IN ('insert', 'update_postimage', 'delete')")
            .select(
                "factory_code",
                "pk",
                "col1",
                "col2",
                "_change_type",
                "_commit_version",
                "_commit_timestamp"
            )
    )&lt;/PRE&gt;&lt;P&gt;materializeSourceTimeMs&amp;nbsp;is source side time: "time taken to materialize source or determine it is not needed" so a complex CDF source view can be a big part of the cost too.&lt;/P&gt;&lt;P&gt;You can cluster in your case y by factory_code&amp;nbsp;because I am assuming it is too low cardinality. Since pk is the actual lookup key test one of these on a lower environment:&lt;/P&gt;&lt;PRE&gt;@dp.table(cluster_by=["factory_code", "pk"])
def ...&lt;/PRE&gt;&lt;P&gt;or:&lt;/P&gt;&lt;PRE&gt;@dp.table(cluster_by_auto=True)
def ...&lt;/PRE&gt;&lt;P&gt;I am thinking about these points so please test and share your finding &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 11 May 2026 11:06:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156567#M54450</guid>
      <dc:creator>amirabedhiafi</dc:creator>
      <dc:date>2026-05-11T11:06:40Z</dc:date>
    </item>
    <item>
      <title>Re: Performance optimization on auto_cdc_flow</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156587#M54452</link>
      <description>&lt;P&gt;I have already tried all of these. Nothing works&lt;/P&gt;</description>
      <pubDate>Mon, 11 May 2026 12:33:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-optimization-on-auto-cdc-flow/m-p/156587#M54452</guid>
      <dc:creator>yit337</dc:creator>
      <dc:date>2026-05-11T12:33:07Z</dc:date>
    </item>
  </channel>
</rss>

