<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Filter not using partition in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/66066#M33005</link>
    <description>&lt;P&gt;I have the following code:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()

for station in unique_stations:
    station_id = station["IdStation"]
    # Filter rows for the current station ID
    station_df = result_df.filter(col("IdStation") == station_id)&lt;/LI-CODE&gt;&lt;P&gt;I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?&lt;/P&gt;</description>
    <pubDate>Thu, 11 Apr 2024 11:46:16 GMT</pubDate>
    <dc:creator>jenshumrich</dc:creator>
    <dc:date>2024-04-11T11:46:16Z</dc:date>
    <item>
      <title>Filter not using partition</title>
      <link>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/66066#M33005</link>
      <description>&lt;P&gt;I have the following code:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()

for station in unique_stations:
    station_id = station["IdStation"]
    # Filter rows for the current station ID
    station_df = result_df.filter(col("IdStation") == station_id)&lt;/LI-CODE&gt;&lt;P&gt;I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?&lt;/P&gt;</description>
      <pubDate>Thu, 11 Apr 2024 11:46:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/66066#M33005</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-04-11T11:46:16Z</dc:date>
    </item>
    <item>
      <title>Re: Filter not using partition</title>
      <link>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/66631#M33176</link>
      <description>&lt;P&gt;Please check the physical query plan. Add .explain() API to your existing call and check the physical query plan for any filter push-down&amp;nbsp; values happening in your query.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Apr 2024 21:56:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/66631#M33176</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2024-04-18T21:56:56Z</dc:date>
    </item>
    <item>
      <title>Re: Filter not using partition</title>
      <link>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/67325#M33320</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Thanks a lot for your response. It seems the Filter is not pushed down, no?&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;LI-CODE lang="python"&gt;station_df.explain()
== Physical Plan ==
*(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))
+- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Acceleration#2676,Density#2677,IdStation#2678,WindShear#2684,Upflow#2691,Control_Mode#2699,Tw_Frequency#2708]&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 25 Apr 2024 14:40:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/67325#M33320</guid>
      <dc:creator>jenshumrich</dc:creator>
      <dc:date>2024-04-25T14:40:01Z</dc:date>
    </item>
    <item>
      <title>Re: Filter not using partition</title>
      <link>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/67759#M33436</link>
      <description>&lt;P&gt;it seems like there is a filter being apply according to this.&amp;nbsp;&lt;/P&gt;
&lt;PRE class="lia-code-sample  language-python"&gt;&lt;CODE&gt;Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;I would like to share the following notebook that covers in detail this topic, in case you would like to check it out&amp;nbsp;&lt;A href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html" target="_blank"&gt;https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 01 May 2024 00:10:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/filter-not-using-partition/m-p/67759#M33436</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2024-05-01T00:10:11Z</dc:date>
    </item>
  </channel>
</rss>

