Hello @lecarusin ,
You can absolutely make Databricks only read the dates you care about. The trick is to constrain the input paths (so Spark lists only those folders) instead of reading the whole directory.
Build the exact S3 prefixes for your date range and give Spark a list of paths. The company part can stay a wildcard (*) so it covers all 29 companies.
Code:
from datetime import date, timedelta
bucket = "s3://your-bucket"
root = f"{bucket}/bronze/sap/holding"
start = date(2025, 9, 1)
end = date(2025, 12, 30)
def day_paths(start_d, end_d):
cur = start_d
paths = []
while cur <= end_d:
# company wildcard stays in place
paths.append(f"{root}/*/jdt1/{cur:%Y/%m/%d}/*.json")
cur += timedelta(days=1)
return paths
paths = day_paths(start, end)
df = (
spark.read
.json(paths)
)
// then apply your transformations
Spark will only list the folders you passed in (e.g., .../2025/09/01/ โฆ 2025/12/30/). It never scans other dates, so thereโs no unnecessary I/O and no need to filter after the read.
Please do let me know if you have any further questions.
Anudeep