Databricks Community

Phani1 · ‎04-26-2024

We have to generate over 70 intermediate tables. Should we use temporary tables or dataframes, or should we create delta tables and truncate and reload? Having too many temporary tables could lead to memory problems. In this situation, what is the most effective approach when one intermediate table relies on another?

NandiniN · ‎04-30-2024

Hi Phani1,

It would be a use case specific answer, so if it is possible I would suggest to work with the Solution Architect on this or share some more insights for a better guidance.

When I say that, I just would want to understand would we really need 70 intermediate tables or there can be a design where a categorical column could be leveraged to distinguish the rows from a larger table instead of multiple tables.

As you said, "or should we create delta tables and truncate and reload?" I understand you don't need the earlier snapsots of the data and it would be just for this transaction. So, persisting in a delta table can be used if there are non-deterministic functions being used, to avoid unpredictable results. Delta tables also would be a good option to help you debug and peek into the intermediate results.

Using the Dataframe or temporary tables, depends on the size of these tables and how much resource(and cost) you want to allocate to your compute. If they are light and can be kept in memory, this would be a faster approach

But once again, I would like to emphasize that it would be better if the Account owners can have a better understanding of the data and then suggest you the most optimized approach for your use case.

Thanks!