<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Help regarding a python notebook and s3 file structure in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138638#M50984</link>
    <description>&lt;P&gt;I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;file_path&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"/Volumes/bronze_external_apis/sap/data_sap/holding/&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;company_name_lower&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;/jdt1/"&lt;BR /&gt;And after "jdt1" it comes the same logic of year/month/day/file.json.&lt;BR /&gt;&lt;BR /&gt;What I need is to read only the last 90 days of that, because data could be added to those files, so my script after getting the data of those 90 days, makes a merge so it only appends new data. Or in the case of another case I have, gets the most recent one using an "update_date" field&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 11 Nov 2025 18:10:16 GMT</pubDate>
    <dc:creator>lecarusin</dc:creator>
    <dc:date>2025-11-11T18:10:16Z</dc:date>
    <item>
      <title>Help regarding a python notebook and s3 file structure</title>
      <link>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138465#M50933</link>
      <description>&lt;P&gt;Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).&lt;/P&gt;&lt;P&gt;I am looking for help with an optimization of a python code I have. This python notebook I have a version that&amp;nbsp; currently runs in AWS Glue, and has a logic that helps me deal with the data, which has the following structure:&lt;BR /&gt;bucket/bronze/sap/holding/{company_name}*/jdt1/{year}/{month}/{day}/file.json&lt;BR /&gt;*29 total&lt;/P&gt;&lt;P&gt;The problem I have is that in Glue I can make it so, as an incremental load, I filter by latest 90 days and those dates are the only ones that are searched for. In what I managed to do in databricks, however, it always read all the files and then it filters the dataframe generated by those files. I want to know how can I make it so it reads only the latest 90 days, for example:&lt;BR /&gt;- Start: bucket/bronze/sap/holding/{company_name}*/jdt1/2025/09/01/file.json&lt;BR /&gt;- End:&amp;nbsp;bucket/bronze/sap/holding/{company_name}*/jdt1/2025/12/30/file.json&lt;/P&gt;&lt;P&gt;This would be done for all the companies that exist. Anyone can tell me how to make a logic that reads only the files of said dates instead of the whole thing? Thanks&lt;/P&gt;</description>
      <pubDate>Mon, 10 Nov 2025 19:25:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138465#M50933</guid>
      <dc:creator>lecarusin</dc:creator>
      <dc:date>2025-11-10T19:25:51Z</dc:date>
    </item>
    <item>
      <title>Re: Help regarding a python notebook and s3 file structure</title>
      <link>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138496#M50945</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/197276"&gt;@lecarusin&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;You can absolutely make Databricks &lt;EM data-start="44" data-end="50"&gt;only&lt;/EM&gt; read the dates you care about. The trick is to constrain the input paths &amp;nbsp;(so Spark lists only those folders) instead of reading the whole directory.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Build the exact S3 prefixes for your date range and give Spark a list of paths. The company part can stay a wildcard (&lt;CODE data-start="524" data-end="527"&gt;*&lt;/CODE&gt;) so it covers all 29 companies.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Code&lt;/U&gt;&lt;/STRONG&gt;&lt;EM&gt;:&lt;/EM&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from datetime import date, timedelta

bucket = "s3://your-bucket"
root   = f"{bucket}/bronze/sap/holding"
start  = date(2025, 9, 1)
end    = date(2025, 12, 30)

def day_paths(start_d, end_d):
    cur = start_d
    paths = []
    while cur &amp;lt;= end_d:
        # company wildcard stays in place
        paths.append(f"{root}/*/jdt1/{cur:%Y/%m/%d}/*.json")
        cur += timedelta(days=1)
    return paths

paths = day_paths(start, end)

df = (
    spark.read
         .json(paths)
)

// then apply your transformations
&lt;/LI-CODE&gt;
&lt;P&gt;Spark will only list the folders you passed in (e.g., &lt;CODE data-start="1398" data-end="1415"&gt;.../2025/09/01/&lt;/CODE&gt; … &lt;CODE data-start="1418" data-end="1431"&gt;2025/12/30/&lt;/CODE&gt;). It never scans other dates, so there’s no unnecessary I/O and no need to filter after the read.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please do let me know if you have any further questions.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2025 02:09:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138496#M50945</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-11-11T02:09:34Z</dc:date>
    </item>
    <item>
      <title>Re: Help regarding a python notebook and s3 file structure</title>
      <link>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138552#M50953</link>
      <description>&lt;P&gt;I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top of that. I am also new to Databricks, but looks like something you should be able to do during your data ingestion step.&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2025 10:33:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138552#M50953</guid>
      <dc:creator>arunpalanoor</dc:creator>
      <dc:date>2025-11-11T10:33:20Z</dc:date>
    </item>
    <item>
      <title>Re: Help regarding a python notebook and s3 file structure</title>
      <link>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138637#M50983</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/60098"&gt;@K_Anudeep&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for the answer, it seems I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;file_path &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"/Volumes/bronze_external_apis/sap/data_sap/holding/&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;company_name_lower&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;/jdt1/"&lt;BR /&gt;And after "jdt1" it comes the same logic of year/month/day/file.json.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;How can I do it if the routes, the place where the files are, is in a unity catalog?&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2025 18:05:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138637#M50983</guid>
      <dc:creator>lecarusin</dc:creator>
      <dc:date>2025-11-11T18:05:00Z</dc:date>
    </item>
    <item>
      <title>Re: Help regarding a python notebook and s3 file structure</title>
      <link>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138638#M50984</link>
      <description>&lt;P&gt;I forgot to mention an important thing. The S3 is currently as a catalog, so my current script has as route:&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;file_path&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"/Volumes/bronze_external_apis/sap/data_sap/holding/&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;company_name_lower&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;/jdt1/"&lt;BR /&gt;And after "jdt1" it comes the same logic of year/month/day/file.json.&lt;BR /&gt;&lt;BR /&gt;What I need is to read only the last 90 days of that, because data could be added to those files, so my script after getting the data of those 90 days, makes a merge so it only appends new data. Or in the case of another case I have, gets the most recent one using an "update_date" field&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Nov 2025 18:10:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/help-regarding-a-python-notebook-and-s3-file-structure/m-p/138638#M50984</guid>
      <dc:creator>lecarusin</dc:creator>
      <dc:date>2025-11-11T18:10:16Z</dc:date>
    </item>
  </channel>
</rss>

