<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark aws s3 folder partition pruning doesn't work in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/122739#M46853</link>
    <description>&lt;P&gt;Hi, we tried to use basePath, it doesn't work. We are thinking if the root cause is due to we only store the data in s3, but underlying there are not partition metadata, so spark cannot correctly infer the exact path with given filter, and ended up scanning entire folder. Because we saw different behavior when we stored/read a hudi table on s3(hudi table has partition metadata) and it reads hudi fast.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
    <pubDate>Tue, 24 Jun 2025 20:46:48 GMT</pubDate>
    <dc:creator>fostermink</dc:creator>
    <dc:date>2025-06-24T20:46:48Z</dc:date>
    <item>
      <title>Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118724#M45695</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi, I have a use case where my spark job running on EMR AWS, and it is reading from a s3 path: some-bucket/some-path/region=na/days=1&lt;/P&gt;&lt;P&gt;during my read, I pass&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;DataFrame df = &lt;SPAN&gt;sparkSession&lt;/SPAN&gt;.read().option(&lt;SPAN&gt;"mergeSchema"&lt;/SPAN&gt;, &lt;SPAN&gt;true&lt;/SPAN&gt;).parquet("some-bucket/some-path/");&lt;/PRE&gt;&lt;P&gt;and then I apply filters on df where region=na and days=1.&lt;BR /&gt;Shouldn't spark do the partition pruning automatically and then only read this path some-bucket/some-path/region=na/days=1 ?&lt;/P&gt;&lt;P&gt;In my case, I see the spark job reading entire some-bucket/some-path. Why this happen?&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 09 May 2025 23:49:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118724#M45695</guid>
      <dc:creator>fostermink</dc:creator>
      <dc:date>2025-05-09T23:49:54Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118725#M45696</link>
      <description>&lt;P&gt;some of my configuration&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;DIV class=""&gt;spark.sql.hive.convertMetastoreParquet&lt;/DIV&gt;&lt;/TD&gt;&lt;TD&gt;&lt;DIV class=""&gt;fals&lt;/DIV&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;DIV class=""&gt;spark.sql.sources.partitionOverwriteMode&lt;/DIV&gt;&lt;/TD&gt;&lt;TD&gt;&lt;DIV class=""&gt;dynamic&lt;/DIV&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Fri, 09 May 2025 23:51:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118725#M45696</guid>
      <dc:creator>fostermink</dc:creator>
      <dc:date>2025-05-09T23:51:48Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118730#M45698</link>
      <description>&lt;P&gt;In your case, Spark isn't automatically pruning partitions because:&lt;/P&gt;&lt;P&gt;Missing Partition Discovery: For Spark to perform partition pruning when reading directly from paths (without a metastore table), you need to explicitly tell it about the partition structure.&lt;/P&gt;&lt;P&gt;Solutions&lt;BR /&gt;&lt;STRONG&gt;Option 1: Use basePath with Partition Discovery&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;DataFrame df = sparkSession.read()&lt;BR /&gt;.option("mergeSchema", true)&lt;BR /&gt;.option("basePath", "s3://some-bucket/some-path/")&lt;BR /&gt;.parquet("s3://some-bucket/some-path/region=na/days=1/");&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Option 2: Enable Partition Discovery (Recommended)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;DataFrame df = sparkSession.read()&lt;BR /&gt;.option("mergeSchema", true)&lt;BR /&gt;.option("recursiveFileLookup", "false")&lt;BR /&gt;.option("partitionOverwriteMode", "dynamic")&lt;BR /&gt;.parquet("s3://some-bucket/some-path/")&lt;BR /&gt;.filter("region = 'na' AND days = 1");&lt;/P&gt;&lt;P&gt;// Or more explicitly:&lt;BR /&gt;DataFrame df = sparkSession.read()&lt;BR /&gt;.option("mergeSchema", true)&lt;BR /&gt;.option("basePath", "s3://some-bucket/some-path/")&lt;BR /&gt;.parquet("s3://some-bucket/some-path/")&lt;BR /&gt;.filter("region = 'na' AND days = 1");&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 10 May 2025 03:23:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118730#M45698</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-05-10T03:23:33Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118731#M45699</link>
      <description>&lt;P&gt;Hi, for option 2, why set&amp;nbsp;&lt;SPAN&gt;recursiveFileLookup false will enable&amp;nbsp;Partition Discovery? from what I read, `recursiveFileLookup default value is&amp;nbsp;false`, so I think in my case it is already option2, the only missing part is&amp;nbsp;basePath?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 10 May 2025 03:41:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118731#M45699</guid>
      <dc:creator>fostermink</dc:creator>
      <dc:date>2025-05-10T03:41:02Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118762#M45706</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/163972"&gt;@fostermink&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're correct that recursiveFileLookup defaults to false, so explicitly setting it doesn't actually change the behavior from the default. I should have been more precise in my explanation.&lt;BR /&gt;What's really happening is that when you read from a path without specifying partition information, Spark needs to properly identify the directory structure as partitions rather than just subdirectories.&lt;/P&gt;&lt;P&gt;The most important part is indeed the basePath option:&lt;/P&gt;&lt;P&gt;DataFrame df = sparkSession.read()&lt;BR /&gt;.option("mergeSchema", true)&lt;BR /&gt;.option("basePath", "s3://some-bucket/some-path/")&lt;BR /&gt;.parquet("s3://some-bucket/some-path/")&lt;BR /&gt;.filter("region = 'na' AND days = 1");&lt;/P&gt;&lt;P&gt;The basePath tells Spark:&lt;/P&gt;&lt;P&gt;-- This is the root directory for the dataset&lt;BR /&gt;-- Any directory structure below this that follows the pattern key=value should be interpreted as partitions&lt;BR /&gt;-- When filters are applied on these partition columns, use them for partition pruning&lt;/P&gt;&lt;P&gt;Without the basePath option, Spark might not correctly recognize the partition structure, especially if the schema doesn't explicitly define these columns as partitions.&lt;BR /&gt;Additionally, to fully enable partition pruning, these configs can help:&lt;/P&gt;&lt;P&gt;spark.sql.parquet.filterPushdown true&lt;BR /&gt;spark.sql.optimizer.dynamicPartitionPruning.enabled true&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 10 May 2025 15:38:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/118762#M45706</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-05-10T15:38:10Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/122739#M46853</link>
      <description>&lt;P&gt;Hi, we tried to use basePath, it doesn't work. We are thinking if the root cause is due to we only store the data in s3, but underlying there are not partition metadata, so spark cannot correctly infer the exact path with given filter, and ended up scanning entire folder. Because we saw different behavior when we stored/read a hudi table on s3(hudi table has partition metadata) and it reads hudi fast.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jun 2025 20:46:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/122739#M46853</guid>
      <dc:creator>fostermink</dc:creator>
      <dc:date>2025-06-24T20:46:48Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/122756#M46859</link>
      <description>&lt;P&gt;okay, thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 04:38:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/122756#M46859</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-06-25T04:38:09Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/146941#M52737</link>
      <description>&lt;P&gt;I know it is old post, but people who are looking for answers like me below is what worked for me.&lt;BR /&gt;&lt;BR /&gt;Option 1 (it is crude) but it works. We can also add regex '*' to include list of folders region=*. If you need to satisfy multiple conditions add them as comma separated.&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;DataFrame df = sparkSession.read()&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;.option("mergeSchema", true)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;.option("basePath", "s3://some-bucket/some-path/")&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;.parquet("s3://some-bucket/some-path/region=na/days=1/",s3://some-bucket/some-path/region=us-east/days=1*");&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Feb 2026 21:46:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/146941#M52737</guid>
      <dc:creator>Bheemana</dc:creator>
      <dc:date>2026-02-05T21:46:55Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/146943#M52738</link>
      <description>&lt;P&gt;Does df.printSchema show region and days as partition columns at the end? If not, partition discovery isn’t working.&lt;BR /&gt;Can you remove mergeSchema or provide an explicit schema? -&amp;nbsp;With mergeSchema, Spark must read the Parquet footers of all files under the base path to merge column definitions before planning the scan. This happens prior to partition pruning, so you’ll see a read/list of the full tree even if the final scan prunes most files.&lt;/P&gt;</description>
      <pubDate>Thu, 05 Feb 2026 21:58:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/146943#M52738</guid>
      <dc:creator>pradeep_singh</dc:creator>
      <dc:date>2026-02-05T21:58:22Z</dc:date>
    </item>
    <item>
      <title>Re: Spark aws s3 folder partition pruning doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/147162#M52749</link>
      <description>&lt;P&gt;You can create a table in the catalog and use it for pruning.&lt;/P&gt;</description>
      <pubDate>Fri, 06 Feb 2026 17:13:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-aws-s3-folder-partition-pruning-doesn-t-work/m-p/147162#M52749</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-02-06T17:13:51Z</dc:date>
    </item>
  </channel>
</rss>

