topic Re: Help regarding a python notebook and s3 file structure in Data Engineering

Help regarding a python notebook and s3 file structure

lecarusin — Mon, 10 Nov 2025 19:25:51 GMT

Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).

I am looking for help with an optimization of a python code I have. This python notebook I have a version that currently runs in AWS Glue, and has a logic that helps me deal with the data, which has the following structure:
bucket/bronze/sap/holding/{company_name}*/jdt1/{year}/{month}/{day}/file.json
*29 total

The problem I have is that in Glue I can make it so, as an incremental load, I filter by latest 90 days and those dates are the only ones that are searched for. In what I managed to do in databricks, however, it always read all the files and then it filters the dataframe generated by those files. I want to know how can I make it so it reads only the latest 90 days, for example:
- Start: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/09/01/file.json
- End: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/12/30/file.json

This would be done for all the companies that exist. Anyone can tell me how to make a logic that reads only the files of said dates instead of the whole thing? Thanks

Re: Help regarding a python notebook and s3 file structure

K_Anudeep — Tue, 11 Nov 2025 02:09:34 GMT

Hello @lecarusin ,

You can absolutely make Databricks only read the dates you care about. The trick is to constrain the input paths (so Spark lists only those folders) instead of reading the whole directory.

Build the exact S3 prefixes for your date range and give Spark a list of paths. The company part can stay a wildcard (*) so it covers all 29 companies.

Code:

from datetime import date, timedelta bucket = "s3://your-bucket" root = f"{bucket}/bronze/sap/holding" start = date(2025, 9, 1) end = date(2025, 12, 30) def day_paths(start_d, end_d): cur = start_d paths = [] while cur <= end_d: # company wildcard stays in place paths.append(f"{root}/*/jdt1/{cur:%Y/%m/%d}/*.json") cur += timedelta(days=1) return paths paths = day_paths(start, end) df = ( spark.read .json(paths) ) // then apply your transformations

Spark will only list the folders you passed in (e.g., .../2025/09/01/ … 2025/12/30/). It never scans other dates, so there’s no unnecessary I/O and no need to filter after the read.

Please do let me know if you have any further questions.

Re: Help regarding a python notebook and s3 file structure

arunpalanoor — Tue, 11 Nov 2025 10:33:20 GMT

I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top of that. I am also new to Databricks, but looks like something you should be able to do during your data ingestion step.

Re: Help regarding a python notebook and s3 file structure

lecarusin — Tue, 11 Nov 2025 18:05:00 GMT

Hello @K_Anudeep

Thanks for the answer, it seems I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:
file_path = f"/Volumes/bronze_external_apis/sap/data_sap/holding/{company_name_lower}/jdt1/"
And after "jdt1" it comes the same logic of year/month/day/file.json.

How can I do it if the routes, the place where the files are, is in a unity catalog?

Re: Help regarding a python notebook and s3 file structure

lecarusin — Tue, 11 Nov 2025 18:10:16 GMT

I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:
file_path = f"/Volumes/bronze_external_apis/sap/data_sap/holding/{company_name_lower}/jdt1/"
And after "jdt1" it comes the same logic of year/month/day/file.json.

What I need is to read only the last 90 days of that, because data could be added to those files, so my script after getting the data of those 90 days, makes a merge so it only appends new data. Or in the case of another case I have, gets the most recent one using an "update_date" field