cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Help regarding a python notebook and s3 file structure

lecarusin
New Contributor

Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).

I am looking for help with an optimization of a python code I have. This python notebook I have a version that  currently runs in AWS Glue, and has a logic that helps me deal with the data, which has the following structure:
bucket/bronze/sap/holding/{company_name}*/jdt1/{year}/{month}/{day}/file.json
*29 total

The problem I have is that in Glue I can make it so, as an incremental load, I filter by latest 90 days and those dates are the only ones that are searched for. In what I managed to do in databricks, however, it always read all the files and then it filters the dataframe generated by those files. I want to know how can I make it so it reads only the latest 90 days, for example:
- Start: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/09/01/file.json
- End: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/12/30/file.json

This would be done for all the companies that exist. Anyone can tell me how to make a logic that reads only the files of said dates instead of the whole thing? Thanks

4 REPLIES 4

K_Anudeep
Databricks Employee
Databricks Employee

Hello @lecarusin ,

You can absolutely make Databricks only read the dates you care about. The trick is to constrain the input paths  (so Spark lists only those folders) instead of reading the whole directory.

 

Build the exact S3 prefixes for your date range and give Spark a list of paths. The company part can stay a wildcard (*) so it covers all 29 companies.

Code:

from datetime import date, timedelta

bucket = "s3://your-bucket"
root   = f"{bucket}/bronze/sap/holding"
start  = date(2025, 9, 1)
end    = date(2025, 12, 30)

def day_paths(start_d, end_d):
    cur = start_d
    paths = []
    while cur <= end_d:
        # company wildcard stays in place
        paths.append(f"{root}/*/jdt1/{cur:%Y/%m/%d}/*.json")
        cur += timedelta(days=1)
    return paths

paths = day_paths(start, end)

df = (
    spark.read
         .json(paths)
)

// then apply your transformations

Spark will only list the folders you passed in (e.g., .../2025/09/01/ โ€ฆ 2025/12/30/). It never scans other dates, so thereโ€™s no unnecessary I/O and no need to filter after the read.

 

Please do let me know if you have any further questions.

 

 

Anudeep

Hello @K_Anudeep 

Thanks for the answer, it seems I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route: 
file_path = f"/Volumes/bronze_external_apis/sap/data_sap/holding/{company_name_lower}/jdt1/"
And after "jdt1" it comes the same logic of year/month/day/file.json.

How can I do it if the routes, the place where the files are, is in a unity catalog?

arunpalanoor
New Contributor II

I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top of that. I am also new to Databricks, but looks like something you should be able to do during your data ingestion step.

I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route: 
file_path = f"/Volumes/bronze_external_apis/sap/data_sap/holding/{company_name_lower}/jdt1/"
And after "jdt1" it comes the same logic of year/month/day/file.json.

What I need is to read only the last 90 days of that, because data could be added to those files, so my script after getting the data of those 90 days, makes a merge so it only appends new data. Or in the case of another case I have, gets the most recent one using an "update_date" field