topic Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE in Machine Learning

Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE

zed — Thu, 24 Oct 2024 20:03:04 GMT

Hello Databricks Community,

I am currently using the Feature Engineering client and have a few questions about best practices for writing to Feature Store Tables.

I would like to know more about not using the write_table method directly from the feature engineering client. Instead, I’m thinking of writing daily partitions to the Delta table by using the INSERT OVERWRITE statement with a PARTITION clause.

Before I proceed, I want to understand:

What are the potential consequences of not using the write_table function for Feature Store tables in this scenario? Specifically, how will this have any silent behaviour of the Feature Store tables if I do not write with write_table? (e.g. that data is not properly catalogued, or other out of the box functionality of the Feature Store)
Is INSERT OVERWRITE a bad practice for managing the write daily partition updates in a Feature Store table?

On the one side, I understand that using INSERT OVERWRITE may lead to data loss. Furthermore, using write_table can help identifying not idempotent pipelines. Moreover, for a given daily run it generated a set of records, and when backfilling 1 of those records was neither UPDATED nor INSERTED, which means that was written in the prior daily run. Therefore, there might be an issue with that pipeline.

On the other side, I may want to update the transformation code that generates a given partition and would like to OVERWRITE the data for a set of partitions, INSERT OVERWRITE can solve that with ease by simply backfilling.

Would write_table be more suitable for ensuring that records are consistently inserted or updated during re-runs, and to prevent data loss and identifying idempotent issues in backfill scenarios?

Any advice on how to best handle this scenario would be greatly appreciated!

Thanks in advance for your insights.

Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE

Brahmareddy — Fri, 25 Oct 2024 03:18:24 GMT

Hi @zed,

How are you doing? As per my understanding, Consider using the write_table method from the Feature Engineering client to ensure that all Feature Store functionality is properly leveraged, such as cataloging, lineage tracking, and handling updates. By directly using INSERT OVERWRITE, you might miss out on these key features, potentially leading to silent issues like missing metadata or inconsistent tracking of changes in the Feature Store. While INSERT OVERWRITE might seem convenient for partition management and backfilling, it introduces the risk of data loss if not handled carefully. The write_table function is designed to handle idempotency and ensure that records are either updated or inserted consistently, reducing the risk of issues during re-runs or backfills. In cases where you need to overwrite specific partitions, write_table still offers the advantage of maintaining data consistency while enabling better tracking.

Give a try and let me know.

Regards,

Brahma

Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE

zed — Fri, 25 Oct 2024 17:32:59 GMT

Thank you for your reply! I have one more question.

Since write_table performs a MERGE when writing, does this mean it scans the entire landing feature table for updates during each write? For large tables, this can become a slow and resource-intensive process.

Besides leveraging the daily partitioning, what strategies would you recommend to optimize this?

Re: Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE

Brahmareddy — Fri, 25 Oct 2024 20:02:21 GMT

Hello @zed, May be you can try consider using partition pruning to limit the scope of the merge operation when calling write_table. Partitioning your feature table by daily or monthly increments will help reduce the data scanned during each merge, as only the relevant partitions are processed. Additionally, indexing frequently used columns or leveraging Z-order clustering on relevant columns can further optimize the read performance. You may also want to batch updates or focus only on new and changed records to reduce the load on large tables. Finally, monitor job performance metrics to track any bottlenecks and ensure resources are efficiently used during each write operation.