@fostermink
You're correct that recursiveFileLookup defaults to false, so explicitly setting it doesn't actually change the behavior from the default. I should have been more precise in my explanation.
What's really happening is that when you read from a path without specifying partition information, Spark needs to properly identify the directory structure as partitions rather than just subdirectories.
The most important part is indeed the basePath option:
DataFrame df = sparkSession.read()
.option("mergeSchema", true)
.option("basePath", "s3://some-bucket/some-path/")
.parquet("s3://some-bucket/some-path/")
.filter("region = 'na' AND days = 1");
The basePath tells Spark:
-- This is the root directory for the dataset
-- Any directory structure below this that follows the pattern key=value should be interpreted as partitions
-- When filters are applied on these partition columns, use them for partition pruning
Without the basePath option, Spark might not correctly recognize the partition structure, especially if the schema doesn't explicitly define these columns as partitions.
Additionally, to fully enable partition pruning, these configs can help:
spark.sql.parquet.filterPushdown true
spark.sql.optimizer.dynamicPartitionPruning.enabled true
LR