a week ago
Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).
I am looking for help with an optimization of a python code I have. This python notebook I have a version that currently runs in AWS Glue, and has a logic that helps me deal with the data, which has the following structure:
bucket/bronze/sap/holding/{company_name}*/jdt1/{year}/{month}/{day}/file.json
*29 total
The problem I have is that in Glue I can make it so, as an incremental load, I filter by latest 90 days and those dates are the only ones that are searched for. In what I managed to do in databricks, however, it always read all the files and then it filters the dataframe generated by those files. I want to know how can I make it so it reads only the latest 90 days, for example:
- Start: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/09/01/file.json
- End: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/12/30/file.json
This would be done for all the companies that exist. Anyone can tell me how to make a logic that reads only the files of said dates instead of the whole thing? Thanks
a week ago
Hello @lecarusin ,
You can absolutely make Databricks only read the dates you care about. The trick is to constrain the input paths (so Spark lists only those folders) instead of reading the whole directory.
Build the exact S3 prefixes for your date range and give Spark a list of paths. The company part can stay a wildcard (*) so it covers all 29 companies.
Code:
from datetime import date, timedelta
bucket = "s3://your-bucket"
root = f"{bucket}/bronze/sap/holding"
start = date(2025, 9, 1)
end = date(2025, 12, 30)
def day_paths(start_d, end_d):
cur = start_d
paths = []
while cur <= end_d:
# company wildcard stays in place
paths.append(f"{root}/*/jdt1/{cur:%Y/%m/%d}/*.json")
cur += timedelta(days=1)
return paths
paths = day_paths(start, end)
df = (
spark.read
.json(paths)
)
// then apply your transformations
Spark will only list the folders you passed in (e.g., .../2025/09/01/ … 2025/12/30/). It never scans other dates, so there’s no unnecessary I/O and no need to filter after the read.
Please do let me know if you have any further questions.
a week ago
Hello @K_Anudeep
Thanks for the answer, it seems I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:
file_path = f"/Volumes/bronze_external_apis/sap/data_sap/holding/{company_name_lower}/jdt1/"
And after "jdt1" it comes the same logic of year/month/day/file.json.
How can I do it if the routes, the place where the files are, is in a unity catalog?
a week ago
I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top of that. I am also new to Databricks, but looks like something you should be able to do during your data ingestion step.
a week ago
I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:
file_path = f"/Volumes/bronze_external_apis/sap/data_sap/holding/{company_name_lower}/jdt1/"
And after "jdt1" it comes the same logic of year/month/day/file.json.
What I need is to read only the last 90 days of that, because data could be added to those files, so my script after getting the data of those 90 days, makes a merge so it only appends new data. Or in the case of another case I have, gets the most recent one using an "update_date" field
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now