<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Predicate pushdown query in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106166#M42408</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/139168"&gt;@kasiviss42&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This might sound like a rhetorical question, but let’s delve into the complexity of joins and filters and examine how generating a list of 2lakh values affects it.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Let's assume we have fact table with 1 billion record and dimension table with 2 lakh records&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Fact Table = 1 Billion&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Dimension Table = 2 lakh&lt;/SPAN&gt;&lt;/P&gt;&lt;H4&gt;Join - Complexity Considerations:&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Data Volume:&lt;/STRONG&gt; Joining 1 billion rows with 200,000 unique IDs can be computationally expensive.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Shuffling:&lt;/STRONG&gt; The join operation may require significant data shuffling across the cluster nodes, which can be time-consuming and resource intensive.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Indexing:&lt;/STRONG&gt; If the join columns (&lt;EM&gt;&lt;STRONG&gt;ID columns&lt;/STRONG&gt;&lt;/EM&gt;) are indexed, the performance can be improved. Without indexes, the join operation will be slower.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Result Size:&lt;/STRONG&gt; The result set includes columns from both tables, which can increase the amount of data processed and transferred.&lt;/LI&gt;&lt;LI&gt;Reference example, below join query on id column, which is taking&lt;SPAN&gt; average 0.45 to 0.60 seconds. This may vary based on data volume and number of join conditions.&lt;/SPAN&gt;&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_1-1737199034992.png" style="width: 880px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14210i7F02BED1B0AF114C/image-dimensions/880x442?v=v2" width="880" height="442" role="button" title="hariprasad_1-1737199034992.png" alt="hariprasad_1-1737199034992.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H4&gt;Filter -&amp;nbsp;Complexity Considerations:&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Filter Operation:&lt;/STRONG&gt; This query applies a filter to the fact table, which is generally less complex than a join operation.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Scanning:&lt;/STRONG&gt; The performance of the filter operation depends on the size of the fact table. If the table is large, scanning it can still be time-consuming.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Indexing:&lt;/STRONG&gt; If the &lt;EM&gt;&lt;STRONG&gt;ID column&lt;/STRONG&gt;&lt;/EM&gt; is indexed, the filter operation can be significantly faster. Without an index, the query will perform a full table scan.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Result Size:&lt;/STRONG&gt; The query selects all columns (*), which can increase the amount of data processed and transferred, but it is still likely to be less than the join operation.&lt;/LI&gt;&lt;LI&gt;Reference Example:&lt;UL&gt;&lt;LI&gt;Below shows the filter based on sub-query from dimension table, which is taking average 0.35 to 0.50 seconds.&amp;nbsp;This may vary based on data volume and indexing.&amp;nbsp;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_3-1737199129693.png" style="width: 859px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14212i3D38D17B4053B551/image-dimensions/859x218?v=v2" width="859" height="218" role="button" title="hariprasad_3-1737199129693.png" alt="hariprasad_3-1737199129693.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;Below shows the filter based on values listed manually, which is taking around 0.21 to 0.35 seconds. This&amp;nbsp;may vary based on data volume and indexing. &lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_5-1737200470951.png" style="width: 1116px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14214i383AF04985199B07/image-dimensions/1116x282?v=v2" width="1116" height="282" role="button" title="hariprasad_5-1737200470951.png" alt="hariprasad_5-1737200470951.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now let's talk about generating 2lakh value using Python or Scala and passing value to select query as filter.&lt;/P&gt;&lt;P&gt;Let's consider roughly the time taken by different steps in pyspark, read data &amp;gt;&amp;gt; collect from dataframe &amp;gt;&amp;gt; loop through values &amp;gt;&amp;gt; join values and pass to filter. This would take additional time read and prepare values for filter. Below code is taking average 0.57 to 0.80 seconds based on number of values from dimension table.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_4-1737200250275.png" style="width: 921px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14213i5D85F27E0170F88E/image-dimensions/921x360?v=v2" width="921" height="360" role="button" title="hariprasad_4-1737200250275.png" alt="hariprasad_4-1737200250275.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope this helps.&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;Hari Prasad&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sat, 18 Jan 2025 11:41:45 GMT</pubDate>
    <dc:creator>hari-prasad</dc:creator>
    <dc:date>2025-01-18T11:41:45Z</dc:date>
    <item>
      <title>Predicate pushdown query</title>
      <link>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106151#M42404</link>
      <description>&lt;P&gt;Does predicate pushdown works when we provide a filter on a dataframe reading a delta table with 2 lakh values&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i.e&lt;/P&gt;&lt;P&gt;filter condition:column is in(list)&lt;/P&gt;&lt;P&gt;list contains 2lakh elements&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i need to get n number of columns from a table&amp;nbsp;&lt;/P&gt;&lt;P&gt;i am currently using join here&lt;/P&gt;&lt;P&gt;2lakh records join 1 billion records&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thinking if i add the filter before join on the 1 billion table will it give performance improvement?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2025 04:03:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106151#M42404</guid>
      <dc:creator>kasiviss42</dc:creator>
      <dc:date>2025-01-18T04:03:37Z</dc:date>
    </item>
    <item>
      <title>Re: Predicate pushdown query</title>
      <link>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106166#M42408</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/139168"&gt;@kasiviss42&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This might sound like a rhetorical question, but let’s delve into the complexity of joins and filters and examine how generating a list of 2lakh values affects it.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Let's assume we have fact table with 1 billion record and dimension table with 2 lakh records&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Fact Table = 1 Billion&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Dimension Table = 2 lakh&lt;/SPAN&gt;&lt;/P&gt;&lt;H4&gt;Join - Complexity Considerations:&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Data Volume:&lt;/STRONG&gt; Joining 1 billion rows with 200,000 unique IDs can be computationally expensive.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Shuffling:&lt;/STRONG&gt; The join operation may require significant data shuffling across the cluster nodes, which can be time-consuming and resource intensive.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Indexing:&lt;/STRONG&gt; If the join columns (&lt;EM&gt;&lt;STRONG&gt;ID columns&lt;/STRONG&gt;&lt;/EM&gt;) are indexed, the performance can be improved. Without indexes, the join operation will be slower.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Result Size:&lt;/STRONG&gt; The result set includes columns from both tables, which can increase the amount of data processed and transferred.&lt;/LI&gt;&lt;LI&gt;Reference example, below join query on id column, which is taking&lt;SPAN&gt; average 0.45 to 0.60 seconds. This may vary based on data volume and number of join conditions.&lt;/SPAN&gt;&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_1-1737199034992.png" style="width: 880px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14210i7F02BED1B0AF114C/image-dimensions/880x442?v=v2" width="880" height="442" role="button" title="hariprasad_1-1737199034992.png" alt="hariprasad_1-1737199034992.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H4&gt;Filter -&amp;nbsp;Complexity Considerations:&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Filter Operation:&lt;/STRONG&gt; This query applies a filter to the fact table, which is generally less complex than a join operation.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Scanning:&lt;/STRONG&gt; The performance of the filter operation depends on the size of the fact table. If the table is large, scanning it can still be time-consuming.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Indexing:&lt;/STRONG&gt; If the &lt;EM&gt;&lt;STRONG&gt;ID column&lt;/STRONG&gt;&lt;/EM&gt; is indexed, the filter operation can be significantly faster. Without an index, the query will perform a full table scan.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Result Size:&lt;/STRONG&gt; The query selects all columns (*), which can increase the amount of data processed and transferred, but it is still likely to be less than the join operation.&lt;/LI&gt;&lt;LI&gt;Reference Example:&lt;UL&gt;&lt;LI&gt;Below shows the filter based on sub-query from dimension table, which is taking average 0.35 to 0.50 seconds.&amp;nbsp;This may vary based on data volume and indexing.&amp;nbsp;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_3-1737199129693.png" style="width: 859px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14212i3D38D17B4053B551/image-dimensions/859x218?v=v2" width="859" height="218" role="button" title="hariprasad_3-1737199129693.png" alt="hariprasad_3-1737199129693.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;Below shows the filter based on values listed manually, which is taking around 0.21 to 0.35 seconds. This&amp;nbsp;may vary based on data volume and indexing. &lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_5-1737200470951.png" style="width: 1116px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14214i383AF04985199B07/image-dimensions/1116x282?v=v2" width="1116" height="282" role="button" title="hariprasad_5-1737200470951.png" alt="hariprasad_5-1737200470951.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now let's talk about generating 2lakh value using Python or Scala and passing value to select query as filter.&lt;/P&gt;&lt;P&gt;Let's consider roughly the time taken by different steps in pyspark, read data &amp;gt;&amp;gt; collect from dataframe &amp;gt;&amp;gt; loop through values &amp;gt;&amp;gt; join values and pass to filter. This would take additional time read and prepare values for filter. Below code is taking average 0.57 to 0.80 seconds based on number of values from dimension table.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="hariprasad_4-1737200250275.png" style="width: 921px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14213i5D85F27E0170F88E/image-dimensions/921x360?v=v2" width="921" height="360" role="button" title="hariprasad_4-1737200250275.png" alt="hariprasad_4-1737200250275.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope this helps.&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;Hari Prasad&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2025 11:41:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106166#M42408</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2025-01-18T11:41:45Z</dc:date>
    </item>
    <item>
      <title>Re: Predicate pushdown query</title>
      <link>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106167#M42409</link>
      <description>&lt;P&gt;Thanks for the detailed analysis.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My scenario is like same base dataframe with 2 lakh unique records goes through the entire notebook traversing through 10-15 joins with tables around 1-5 billion records each.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so it might take time intially to get the list variable populated. But i can reuse the same list in 10-15 joins only if this really improves the performance.&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2025 11:56:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106167#M42409</guid>
      <dc:creator>kasiviss42</dc:creator>
      <dc:date>2025-01-18T11:56:34Z</dc:date>
    </item>
    <item>
      <title>Re: Predicate pushdown query</title>
      <link>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106168#M42410</link>
      <description>&lt;P&gt;I recommend using a &lt;STRONG&gt;Join&lt;/STRONG&gt; or &lt;STRONG&gt;Filter with a subquery&lt;/STRONG&gt; for a safer approach. Manual looping can increase runtime, especially as the number of records in your dimension table grows.&lt;/P&gt;&lt;P&gt;&lt;FONT color="#008000"&gt;If you are performing repetitive joins across multiple blocks or stages and your dimension table consistently contains around 2lakh records, consider using a broadcast join. This involves making the dimension table a broadcast DataFrame, which can then be used across multiple joins.&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2025 12:33:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/predicate-pushdown-query/m-p/106168#M42410</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2025-01-18T12:33:34Z</dc:date>
    </item>
  </channel>
</rss>

