<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DLT notebook dynamic declaration in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105147#M42012</link>
    <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/134575"&gt;@eballinger&lt;/a&gt;&amp;nbsp;, thank you for your question. To better assist you, could you clarify a few details?&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Are you seeing delays in specific stages (e.g., metadata fetching, schema validation, or table setup)?&lt;/LI&gt;
&lt;LI&gt;Could you provide more details on how the dynamic declaration is implemented (e.g., looping structure or table metadata source)?&lt;/LI&gt;
&lt;LI&gt;Have you profiled the pipeline to identify which part of the initialization is taking longer?&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The increased runtime with dynamic declarations is likely due to the overhead of processing each table dynamically, compared to the static approach where these computations are predefined. To address this:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Batch Processing: Process tables in smaller batches instead of handling all 300+ tables in one go.&lt;/LI&gt;
&lt;LI&gt;Parallel Execution: Explore parallel or asynchronous processing for table declarations.&lt;/LI&gt;
&lt;LI&gt;Metadata Optimization: Cache reusable metadata (e.g., schemas or paths) to minimize repeated operations in the loop.&lt;/LI&gt;
&lt;LI&gt;External Configuration: Use a configuration file (e.g., JSON or YAML) for table definitions to simplify and speed up initialization.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;If these don’t resolve the issue, let me know more specifics, and I’ll provide further suggestions!&lt;/P&gt;</description>
    <pubDate>Fri, 10 Jan 2025 10:32:47 GMT</pubDate>
    <dc:creator>VZLA</dc:creator>
    <dc:date>2025-01-10T10:32:47Z</dc:date>
    <item>
      <title>DLT notebook dynamic declaration</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105075#M41985</link>
      <description>&lt;P&gt;Hi Guys,&lt;/P&gt;&lt;P&gt;We have a DLT pipeline that is reading data from landing to raw (csv files into tables) for approximately 80 tables.&amp;nbsp;&lt;/P&gt;&lt;P&gt;In our first attempt at this we declared each table separately in a python notebook. One &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97035"&gt;@Dlt&lt;/a&gt; table declared per cell.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then when another database came along with 300 tables we looked for a better solution and found a way to dynamically declare the dlt tables using a loop and a table with the table names we want to declare. This works good and now there is no repetition of code like before. However I discovered a trade off I hope we can get around.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Using the first method where the tables are statically declared on each line the INITAILIZATION and&amp;nbsp; &amp;nbsp;SETTING UP TABLES stages are taking only 4 minutes together. But when we use the dynamic declaration method its now taking 30 to 35 minutes for those same stages.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Has anyone else who is using DLT dynamic table declaration encountered this big jump in run-time using this style? A 30 minute jump in run time seems excessive to me just to produce the dlt declarations.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Again, thanks for any suggestions or help&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Jan 2025 20:25:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105075#M41985</guid>
      <dc:creator>eballinger</dc:creator>
      <dc:date>2025-01-09T20:25:53Z</dc:date>
    </item>
    <item>
      <title>Re: DLT notebook dynamic declaration</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105139#M42008</link>
      <description>&lt;P&gt;Hello ,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/134575"&gt;@eballinger&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Table Metadata Initialization Overhead&lt;BR /&gt;When dynamically declaring tables, your loop might be causing additional overhead by reinitializing metadata or creating resources redundantly.&lt;/P&gt;&lt;P&gt;Suggestions:&lt;/P&gt;&lt;P&gt;Ensure that the loop only processes the essential metadata and avoids redundant operations.&lt;BR /&gt;Use caching mechanisms for metadata if applicable, to avoid fetching the same information multiple times.&lt;BR /&gt;2. Parallel Execution&lt;BR /&gt;In the static approach, each table is declared independently and likely allows for better parallel execution. A dynamic loop may serialize operations or block parallelism.&lt;/P&gt;&lt;P&gt;Suggestions:&lt;/P&gt;&lt;P&gt;Use multi-threading or asynchronous execution to declare tables in parallel, especially if the DLT framework supports parallel processing.&lt;BR /&gt;Group tables logically and batch their creation.&lt;BR /&gt;3. Code Complexity in the Loop&lt;BR /&gt;The logic inside the dynamic loop could be more computationally expensive than expected, such as repeated operations, large data manipulations, or complex branching.&lt;/P&gt;&lt;P&gt;Suggestions:&lt;/P&gt;&lt;P&gt;Profile your loop logic to identify bottlenecks (e.g., using Python's cProfile or timeit modules).&lt;BR /&gt;Simplify the logic, removing unnecessary operations.&lt;BR /&gt;4. Validation and Dependency Checks&lt;BR /&gt;DLT may validate each dynamically declared table and check dependencies, which could take longer dynamically compared to static declarations where such checks may be cached.&lt;/P&gt;&lt;P&gt;Suggestions:&lt;/P&gt;&lt;P&gt;Check if DLT provides configuration options to limit validation overhead for dynamic declarations.&lt;BR /&gt;If you’re not using all tables, filter the table list to reduce unnecessary declarations.&lt;BR /&gt;5. Metadata Table Lookup&lt;BR /&gt;If your dynamic approach depends on querying a metadata table or schema, delays may occur due to inefficient database queries.&lt;/P&gt;&lt;P&gt;Suggestions:&lt;/P&gt;&lt;P&gt;Optimize database queries for fetching table metadata (e.g., indexes, caching).&lt;BR /&gt;Pre-fetch the metadata into memory if feasible and iterate over it locally.&lt;/P&gt;&lt;P&gt;6. Initialization in DLT&lt;BR /&gt;DLT may optimize differently between static and dynamic methods, such as precomputing dependencies for static declarations.&lt;/P&gt;&lt;P&gt;Suggestions:&lt;/P&gt;&lt;P&gt;Review DLT-specific documentation for optimizations when using dynamic declarations.&lt;BR /&gt;Check if there are recommended practices for declaring large numbers of tables dynamically.&lt;BR /&gt;7. Use Partitioned Runs&lt;BR /&gt;If tables can be grouped into logical partitions, split the pipeline initialization into smaller chunks. For instance, initialize 50 tables at a time in separate runs and monitor performance.&lt;/P&gt;&lt;P&gt;Example Optimized Dynamic Declaration:&lt;BR /&gt;python&lt;BR /&gt;Copy code&lt;BR /&gt;import dlt&lt;/P&gt;&lt;P&gt;# Pre-fetch metadata for tables&lt;BR /&gt;table_metadata = get_table_metadata() # Replace with your function to fetch metadata&lt;/P&gt;&lt;P&gt;@dlt.pipeline(name="my_pipeline", storage_path="/path/to/storage")&lt;BR /&gt;def my_pipeline():&lt;BR /&gt;for table in table_metadata:&lt;BR /&gt;@dlt.table(&lt;BR /&gt;name=table['name'],&lt;BR /&gt;schema=table['schema'],&lt;BR /&gt;primary_key=table['primary_key'],&lt;BR /&gt;)&lt;BR /&gt;def load_table():&lt;BR /&gt;return read_csv(table['path']) # Replace with your data loading logic&lt;/P&gt;&lt;P&gt;if __name__ == "__main__":&lt;BR /&gt;my_pipeline().run()&lt;BR /&gt;This ensures minimal overhead by pre-fetching metadata and only declaring what’s necessary.&lt;/P&gt;&lt;P&gt;Next Steps&lt;BR /&gt;Profile your dynamic logic to identify bottlenecks.&lt;BR /&gt;Implement parallelism where possible.&lt;BR /&gt;Optimize metadata fetching and DLT initialization configurations.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best Regards&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jan 2025 10:04:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105139#M42008</guid>
      <dc:creator>glori923</dc:creator>
      <dc:date>2025-01-10T10:04:58Z</dc:date>
    </item>
    <item>
      <title>Re: DLT notebook dynamic declaration</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105147#M42012</link>
      <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/134575"&gt;@eballinger&lt;/a&gt;&amp;nbsp;, thank you for your question. To better assist you, could you clarify a few details?&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Are you seeing delays in specific stages (e.g., metadata fetching, schema validation, or table setup)?&lt;/LI&gt;
&lt;LI&gt;Could you provide more details on how the dynamic declaration is implemented (e.g., looping structure or table metadata source)?&lt;/LI&gt;
&lt;LI&gt;Have you profiled the pipeline to identify which part of the initialization is taking longer?&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The increased runtime with dynamic declarations is likely due to the overhead of processing each table dynamically, compared to the static approach where these computations are predefined. To address this:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Batch Processing: Process tables in smaller batches instead of handling all 300+ tables in one go.&lt;/LI&gt;
&lt;LI&gt;Parallel Execution: Explore parallel or asynchronous processing for table declarations.&lt;/LI&gt;
&lt;LI&gt;Metadata Optimization: Cache reusable metadata (e.g., schemas or paths) to minimize repeated operations in the loop.&lt;/LI&gt;
&lt;LI&gt;External Configuration: Use a configuration file (e.g., JSON or YAML) for table definitions to simplify and speed up initialization.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;If these don’t resolve the issue, let me know more specifics, and I’ll provide further suggestions!&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jan 2025 10:32:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105147#M42012</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2025-01-10T10:32:47Z</dc:date>
    </item>
    <item>
      <title>Re: DLT notebook dynamic declaration</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105209#M42046</link>
      <description>&lt;P&gt;Thanks for the excellent suggestions VZLA. Here is a copy of the dynamic code just FYI:&lt;/P&gt;&lt;P&gt;Event Logs:&lt;/P&gt;&lt;P&gt;2015-01-08 15:18:17 EST User xxx started an update&lt;BR /&gt;2015-01-08 15:18:17 EST Update 15d539 started by API call&lt;BR /&gt;2015-01-08 15:18:18 EST Update 15d539 is INITIALIZING&lt;BR /&gt;2015-01-08 15:38:31 EST Update 15d539 is SETTING_UP_TABLES&lt;BR /&gt;2015-01-08 15:48:45 EST Flow &amp;lt;table_name&amp;gt; is defined as APPEND&lt;BR /&gt;...&lt;BR /&gt;2015-01-08 15:49:59 EST Update 15d539 is COMPLETED&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;---------------------------------------------------&lt;/P&gt;&lt;P&gt;def create_table(name):&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97035"&gt;@Dlt&lt;/a&gt;.table(name=name,&lt;BR /&gt;comment="Raw Data Tables",&lt;BR /&gt;table_properties={&lt;BR /&gt;"quality":"bronze"&lt;BR /&gt;})&lt;BR /&gt;def t():&lt;BR /&gt;# Define schema for the table based on the table schema reference&lt;BR /&gt;schema = StructType([StructField(col_name, getattr(pyspark.sql.types, type_fun)(), True) for (col_name, type_fun) in table_schema.where((col('table_name')==name)|(col('table_name')=='all_tables')).orderBy(col('column_id')).select('col_name','type_fun').collect()])&lt;/P&gt;&lt;P&gt;# Adjust DecimalType fields with precision and scale&lt;BR /&gt;#for field in schema:&lt;BR /&gt;# ## Check on the decimal columns and update them. Some are set as precision of 0 but that wouldn't make any sense, so we're defaulting those ones to 25 to be safe&lt;BR /&gt;# if isinstance(field.dataType, DecimalType):&lt;BR /&gt;# precision_scale = table_schema.where((col('table_name') == name) &amp;amp; (col('col_name') == field.name)).select('data_precision', 'data_scale').collect()[0]&lt;BR /&gt;# if precision_scale['data_precision'] == '0':&lt;BR /&gt;# field.dataType = DecimalType(precision=int("25"), scale=int(precision_scale['data_scale']))&lt;BR /&gt;# else:&lt;BR /&gt;# field.dataType = DecimalType(precision=int(precision_scale['data_precision']), scale=int(precision_scale['data_scale']))&lt;BR /&gt;# Read CSV data into the table&lt;BR /&gt;return readCSV(schema, name)&lt;/P&gt;&lt;P&gt;# Loop through each table name and create the corresponding DLT table&lt;BR /&gt;for table_name in table_list:&lt;BR /&gt;create_table(table_name)&lt;/P&gt;&lt;P&gt;-----------------------------------------------&lt;/P&gt;&lt;P&gt;I just found the issue though. Its the new code added above (Adjust Decimal Type section) that's causing the delay and is not part of the original static declaration code. So I wasn't exactly comparing apples to apples. When I comment out that part this dynamic code runs in 45 seconds. So I will dig further into that code now and see if I can improve that part (which is the true bottleneck now).&lt;/P&gt;&lt;P&gt;Thanks again&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jan 2025 16:20:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105209#M42046</guid>
      <dc:creator>eballinger</dc:creator>
      <dc:date>2025-01-10T16:20:05Z</dc:date>
    </item>
    <item>
      <title>Re: DLT notebook dynamic declaration</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105368#M42097</link>
      <description>&lt;P&gt;Good catch and glad to hear you've identified the source of delay!&lt;/P&gt;</description>
      <pubDate>Mon, 13 Jan 2025 10:09:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-notebook-dynamic-declaration/m-p/105368#M42097</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2025-01-13T10:09:23Z</dc:date>
    </item>
  </channel>
</rss>

