<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripte in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99052#M39895</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/132713"&gt;@ashap551&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I would vote for modular approach which lets you reuse code and write unit test in simpler manner. Notebooks are for me only "clients" of these shared modules. You can take a look at official documentation where they're following similar approach:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/notebooks/best-practices.html#step-3-move-code-into-a-shared-module" target="_blank" rel="noopener"&gt;Software engineering best practices for notebooks | Databricks on AWS&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 17 Nov 2024 17:54:18 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2024-11-17T17:54:18Z</dc:date>
    <item>
      <title>Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripted</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99037#M39894</link>
      <description>&lt;P&gt;I’m curious about&amp;nbsp;Data Engineering best practices for a large-scale data engineering project using Databricks to build a Lakehouse architecture (Bronze -&amp;gt; Silver -&amp;gt; Gold layers).&lt;/P&gt;&lt;P&gt;I’m presently comparing two approaches of code writing to engineer the solution and want to be sure on which if any is considered the best approach:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;“Scripted” approach:&lt;/LI&gt;&lt;/OL&gt;&lt;UL&gt;&lt;LI&gt;Each notebook contains all operations, including common ones&lt;/LI&gt;&lt;LI&gt;Minimal use of functions, no classes&lt;/LI&gt;&lt;LI&gt;All code written out in each notebook for easy debugging&lt;/LI&gt;&lt;LI&gt;Declare table attributes as standalone using common naming conventions, but no encapsulation (eg., table1_silver_tablename, table1_processed_df, table2_silver_tablename, table2_processed_df… etc)&lt;/LI&gt;&lt;/UL&gt;&lt;OL&gt;&lt;LI&gt;“Modular” approach:&lt;/LI&gt;&lt;/OL&gt;&lt;UL&gt;&lt;LI&gt;Common operations (e.g., environment setup, incremental reads, standard transformations, schema checks, Delta merges) stored in a shared codebase&lt;/LI&gt;&lt;LI&gt;Use of classes for encapsulating table attributes and operations&lt;/LI&gt;&lt;LI&gt;Custom transformations specific to each source kept separate&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Both approaches handle the same tasks, including:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Environment variable management&lt;/LI&gt;&lt;LI&gt;Incremental source reading&lt;/LI&gt;&lt;LI&gt;Standard transformations (e.g., file name parsing, deduplication)&lt;/LI&gt;&lt;LI&gt;Schema validation&lt;/LI&gt;&lt;LI&gt;Delta merging with insert/update date management&lt;/LI&gt;&lt;LI&gt;Checkpointing and metadata management&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, “Modular” creates a separate module (or notebook) to which primary notebooks can call via import, magic command, or dbutils function; whereas “Scripting” rewrites these individually but for simplicity everything stays self contained inside its own notebook.&lt;/P&gt;&lt;P&gt;Question:&lt;BR /&gt;&lt;STRONG&gt;What is the industry best practice for Data Engineering for large scale projects in Databricks? To use a scripting approach for simplicity or modular approach for longterm sustainability? &amp;nbsp;Is there a clear favorite?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Please provide references to established best practices or official documentation of such exists. Thank you!&lt;/P&gt;</description>
      <pubDate>Sun, 17 Nov 2024 10:01:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99037#M39894</guid>
      <dc:creator>ashap551</dc:creator>
      <dc:date>2024-11-17T10:01:07Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripte</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99052#M39895</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/132713"&gt;@ashap551&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I would vote for modular approach which lets you reuse code and write unit test in simpler manner. Notebooks are for me only "clients" of these shared modules. You can take a look at official documentation where they're following similar approach:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/notebooks/best-practices.html#step-3-move-code-into-a-shared-module" target="_blank" rel="noopener"&gt;Software engineering best practices for notebooks | Databricks on AWS&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 17 Nov 2024 17:54:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99052#M39895</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-11-17T17:54:18Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripte</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99053#M39896</link>
      <description>&lt;P&gt;Thank you szymon_dybczak. &amp;nbsp;I agree that it is software best practice, and the documentation substantiates it.&lt;/P&gt;&lt;P&gt;I’m just thinking if new data engineering practices are starting to move away from functional and modular practices. &amp;nbsp;If there is a movement towards self-contained notebooks. &amp;nbsp;A few of my colleagues find it very difficult to follow a modular coding style, and strongly prefers to code it in place via a single script / single notebook. &amp;nbsp;More traditional data engineers, who are used to modular, tend to code to not like to rewrite code, and prefer it the way you recommend here. &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Wasn’t sure if you came across the same pattern. &amp;nbsp;&lt;/P&gt;&lt;P&gt;Just trying to keep up with industry trends myself!&lt;/P&gt;</description>
      <pubDate>Sun, 17 Nov 2024 18:48:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practices-for-code-organization-in-large-scale-databricks/m-p/99053#M39896</guid>
      <dc:creator>ashap551</dc:creator>
      <dc:date>2024-11-17T18:48:28Z</dc:date>
    </item>
  </channel>
</rss>

