<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96197#M3746</link>
    <description>&lt;P&gt;Thank you for your reply! I have one more question.&lt;/P&gt;&lt;P&gt;Since write_table performs a MERGE when writing, does this mean it scans the entire landing feature table for updates during each write? For large tables, this can become a slow and resource-intensive process.&lt;/P&gt;&lt;P&gt;Besides leveraging the daily partitioning, what strategies would you recommend to optimize this?&lt;/P&gt;</description>
    <pubDate>Fri, 25 Oct 2024 17:32:59 GMT</pubDate>
    <dc:creator>zed</dc:creator>
    <dc:date>2024-10-25T17:32:59Z</dc:date>
    <item>
      <title>Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE</title>
      <link>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96047#M3742</link>
      <description>&lt;P&gt;Hello Databricks Community,&lt;/P&gt;&lt;P&gt;I am currently using the Feature Engineering client and have a few questions about best practices for writing to Feature Store Tables.&lt;/P&gt;&lt;P&gt;I would like to know more about not using the write_table method directly from the feature engineering client. Instead, I’m thinking of writing &lt;STRONG&gt;daily&lt;/STRONG&gt; partitions to the Delta table by using the INSERT OVERWRITE statement with a PARTITION clause.&lt;/P&gt;&lt;P&gt;Before I proceed, I want to understand:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;What are the potential consequences of not using the write_table function for Feature Store tables in this scenario?&amp;nbsp;Specifically, how will this have any silent behaviour of the Feature Store tables if I do not write with write_table? (e.g. that data is not properly catalogued, or other out of the box functionality of the Feature Store)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Is INSERT OVERWRITE a bad practice for managing the write daily partition updates in a Feature Store table?&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;On the one side, I understand that using&amp;nbsp;INSERT OVERWRITE may lead to data loss. Furthermore, using write_table can help identifying not idempotent pipelines. Moreover, for a given daily run it generated a set of records, and when backfilling 1 of those records was neither UPDATED nor INSERTED, which means that was written in the prior daily run. Therefore, there might be an issue with that pipeline.&lt;/P&gt;&lt;P&gt;On the other side, I may want to update the transformation code that generates a given partition and would like to&amp;nbsp;OVERWRITE the data for a set of partitions, INSERT OVERWRITE can solve that with ease by simply backfilling.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Would write_table be more suitable for ensuring that records are consistently inserted or updated during re-runs, and to prevent data loss and identifying idempotent issues in backfill scenarios?&lt;/P&gt;&lt;P&gt;Any advice on how to best handle this scenario would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks in advance for your insights.&lt;/P&gt;</description>
      <pubDate>Thu, 24 Oct 2024 20:03:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96047#M3742</guid>
      <dc:creator>zed</dc:creator>
      <dc:date>2024-10-24T20:03:04Z</dc:date>
    </item>
    <item>
      <title>Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE</title>
      <link>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96062#M3743</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/129200"&gt;@zed&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;How are you doing? As per my understanding,&amp;nbsp;Consider using the &lt;STRONG&gt;write_table&lt;/STRONG&gt; method from the Feature Engineering client to ensure that all &lt;STRONG&gt;Feature Store functionality&lt;/STRONG&gt; is properly leveraged, such as cataloging, lineage tracking, and handling updates. By directly using &lt;STRONG&gt;INSERT OVERWRITE&lt;/STRONG&gt;, you might miss out on these key features, potentially leading to &lt;STRONG&gt;silent issues&lt;/STRONG&gt; like missing metadata or inconsistent tracking of changes in the Feature Store. While &lt;STRONG&gt;INSERT OVERWRITE&lt;/STRONG&gt; might seem convenient for partition management and backfilling, it introduces the risk of &lt;STRONG&gt;data loss&lt;/STRONG&gt; if not handled carefully. The &lt;STRONG&gt;write_table&lt;/STRONG&gt; function is designed to handle &lt;STRONG&gt;idempotency&lt;/STRONG&gt; and ensure that records are either updated or inserted consistently, reducing the risk of issues during re-runs or backfills. In cases where you need to overwrite specific partitions, &lt;STRONG&gt;write_table&lt;/STRONG&gt; still offers the advantage of maintaining data consistency while enabling better tracking.&lt;/P&gt;&lt;P&gt;Give a try and let me know.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Brahma&lt;/P&gt;</description>
      <pubDate>Fri, 25 Oct 2024 03:18:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96062#M3743</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2024-10-25T03:18:24Z</dc:date>
    </item>
    <item>
      <title>Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE</title>
      <link>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96197#M3746</link>
      <description>&lt;P&gt;Thank you for your reply! I have one more question.&lt;/P&gt;&lt;P&gt;Since write_table performs a MERGE when writing, does this mean it scans the entire landing feature table for updates during each write? For large tables, this can become a slow and resource-intensive process.&lt;/P&gt;&lt;P&gt;Besides leveraging the daily partitioning, what strategies would you recommend to optimize this?&lt;/P&gt;</description>
      <pubDate>Fri, 25 Oct 2024 17:32:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96197#M3746</guid>
      <dc:creator>zed</dc:creator>
      <dc:date>2024-10-25T17:32:59Z</dc:date>
    </item>
    <item>
      <title>Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE</title>
      <link>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96202#M3747</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/129200"&gt;@zed&lt;/a&gt;, May be you can try c&lt;SPAN&gt;onsider &lt;/SPAN&gt;&lt;STRONG&gt;using partition pruning&lt;/STRONG&gt;&lt;SPAN&gt; to limit the scope of the merge operation when calling write_table.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;Partitioning your feature table by daily or monthly increments will help &lt;/SPAN&gt;&lt;STRONG&gt;reduce the data scanned&lt;/STRONG&gt;&lt;SPAN&gt; during each merge, as only the relevant partitions are processed. Additionally, &lt;/SPAN&gt;&lt;STRONG&gt;indexing frequently used columns&lt;/STRONG&gt;&lt;SPAN&gt; or leveraging &lt;/SPAN&gt;&lt;STRONG&gt;Z-order clustering&lt;/STRONG&gt;&lt;SPAN&gt; on relevant columns can further optimize the read performance. You may also want to &lt;/SPAN&gt;&lt;STRONG&gt;batch updates&lt;/STRONG&gt;&lt;SPAN&gt; or focus only on new and changed records to reduce the load on large tables. Finally, monitor &lt;/SPAN&gt;&lt;STRONG&gt;job performance metrics&lt;/STRONG&gt;&lt;SPAN&gt; to track any bottlenecks and ensure resources are efficiently used during each write operation.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Oct 2024 20:02:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/consequences-of-not-using-write-table-with-feature-engineering/m-p/96202#M3747</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2024-10-25T20:02:21Z</dc:date>
    </item>
  </channel>
</rss>

