cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark aws s3 folder partition pruning doesn't work

fostermink
New Contributor

 

Hi, I have a use case where my spark job running on EMR AWS, and it is reading from a s3 path: some-bucket/some-path/region=na/days=1

during my read, I pass 

DataFrame df = sparkSession.read().option("mergeSchema", true).parquet("some-bucket/some-path/");

and then I apply filters on df where region=na and days=1.
Shouldn't spark do the partition pruning automatically and then only read this path some-bucket/some-path/region=na/days=1 ?

In my case, I see the spark job reading entire some-bucket/some-path. Why this happen?

4 REPLIES 4

fostermink
New Contributor

some of my configuration

spark.sql.hive.convertMetastoreParquet
fals
spark.sql.sources.partitionOverwriteMode
dynamic

LRALVA
Honored Contributor

In your case, Spark isn't automatically pruning partitions because:

Missing Partition Discovery: For Spark to perform partition pruning when reading directly from paths (without a metastore table), you need to explicitly tell it about the partition structure.

Solutions
Option 1: Use basePath with Partition Discovery

DataFrame df = sparkSession.read()
.option("mergeSchema", true)
.option("basePath", "s3://some-bucket/some-path/")
.parquet("s3://some-bucket/some-path/region=na/days=1/");


Option 2: Enable Partition Discovery (Recommended)

DataFrame df = sparkSession.read()
.option("mergeSchema", true)
.option("recursiveFileLookup", "false")
.option("partitionOverwriteMode", "dynamic")
.parquet("s3://some-bucket/some-path/")
.filter("region = 'na' AND days = 1");

// Or more explicitly:
DataFrame df = sparkSession.read()
.option("mergeSchema", true)
.option("basePath", "s3://some-bucket/some-path/")
.parquet("s3://some-bucket/some-path/")
.filter("region = 'na' AND days = 1");

 

LR

Hi, for option 2, why set recursiveFileLookup false will enable Partition Discovery? from what I read, `recursiveFileLookup default value is false`, so I think in my case it is already option2, the only missing part is basePath? 

LRALVA
Honored Contributor

@fostermink 

You're correct that recursiveFileLookup defaults to false, so explicitly setting it doesn't actually change the behavior from the default. I should have been more precise in my explanation.
What's really happening is that when you read from a path without specifying partition information, Spark needs to properly identify the directory structure as partitions rather than just subdirectories.

The most important part is indeed the basePath option:

DataFrame df = sparkSession.read()
.option("mergeSchema", true)
.option("basePath", "s3://some-bucket/some-path/")
.parquet("s3://some-bucket/some-path/")
.filter("region = 'na' AND days = 1");

The basePath tells Spark:

-- This is the root directory for the dataset
-- Any directory structure below this that follows the pattern key=value should be interpreted as partitions
-- When filters are applied on these partition columns, use them for partition pruning

Without the basePath option, Spark might not correctly recognize the partition structure, especially if the schema doesn't explicitly define these columns as partitions.
Additionally, to fully enable partition pruning, these configs can help:

spark.sql.parquet.filterPushdown true
spark.sql.optimizer.dynamicPartitionPruning.enabled true

 

 

LR