Databricks Community

Phani1 · ‎04-26-2024

Hi Team,

We have to generate over 70 intermediate tables. Should we use temporary tables or dataframes, or should we create delta tables and truncate and reload? Having too many temporary tables could lead to memory problems. In this situation, what is the most effective approach when one intermediate table relies on another?

Regards,

Janga

Walter_C · ‎04-27-2024

Using temporary tables or dataframes can be a good approach when the data is only needed for the duration of a single session. However, as you mentioned, having too many temporary tables could lead to memory problems.

On the other hand, Delta tables could be a better option when you need to persist the data across multiple sessions or jobs. Delta tables also provide ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. However, creating Delta tables, truncating, and reloading could be more time-consuming and resource-intensive.

In terms of memory management, Databricks' Spark deployment has a specific memory layout with distinct memory zones for storage, execution, and user heap. Spark attempts to dynamically grow and shrink these regions based on usage and certain limits. For large-memory instances, Databricks enables off-heap memory and sets the size of the off-heap zone to 75% of the usable container memory for the instance, leaving the remaining 25% for heap memory.