cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Consequences of Not Using write_table with Feature Engineering Client and INSERT OVERWRITE

zed
Visitor

Hello Databricks Community,

I am currently using the Feature Engineering client and have a few questions about best practices for writing to Feature Store Tables.

I would like to know more about not using the write_table method directly from the feature engineering client. Instead, Iā€™m thinking of writing daily partitions to the Delta table by using the INSERT OVERWRITE statement with a PARTITION clause.

Before I proceed, I want to understand:

  1. What are the potential consequences of not using the write_table function for Feature Store tables in this scenario? Specifically, how will this have any silent behaviour of the Feature Store tables if I do not write with write_table? (e.g. that data is not properly catalogued, or other out of the box functionality of the Feature Store)

  2. Is INSERT OVERWRITE a bad practice for managing the write daily partition updates in a Feature Store table? 

On the one side, I understand that using INSERT OVERWRITE may lead to data loss. Furthermore, using write_table can help identifying not idempotent pipelines. Moreover, for a given daily run it generated a set of records, and when backfilling 1 of those records was neither UPDATED nor INSERTED, which means that was written in the prior daily run. Therefore, there might be an issue with that pipeline.

On the other side, I may want to update the transformation code that generates a given partition and would like to OVERWRITE the data for a set of partitions, INSERT OVERWRITE can solve that with ease by simply backfilling. 

Would write_table be more suitable for ensuring that records are consistently inserted or updated during re-runs, and to prevent data loss and identifying idempotent issues in backfill scenarios?

Any advice on how to best handle this scenario would be greatly appreciated!

Thanks in advance for your insights.

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonā€™t want to miss the chance to attend and share knowledge.

If there isnā€™t a group near you, start one and help create a community that brings people together.

Request a New Group