Databricks Community

cmathieu · 2 weeks ago

I'm trying to run the OPTIMIZE command on a table with less than 2000 rows, but it is causing an out of memory issue. The problem seems to come from the fact that it is a heavily nested table in staging between a json file and flattened table. The table is used as a stream sink and a source.

Any clue?

Brahmareddy · 2 weeks ago

Hi cmathieu,

How are you doing today?, As per my understanding, Yeah, this sounds like one of those cases where the data volume is small, but the complexity of the schema is what’s causing the trouble. If your table has deeply nested JSON structures, especially with arrays or structs, the OPTIMIZE command can use a lot of memory to analyze and reorganize files—even if the row count is low. Since this table is used as both a streaming sink and source, the metadata and execution plan can also grow large, especially if there's a lot of schema evolution or micro-batch state tracking involved. One option is to avoid running OPTIMIZE on such staging tables altogether, especially if file size isn't a problem. Or, if needed, try running OPTIMIZE with a filter on a small partition or recent data, and see if that helps reduce memory pressure. You can also try bumping up spark.sql.shuffle.partitions and spark.executor.memoryOverhead temporarily, but for lightweight, nested streaming tables, often the best move is to skip heavy operations like OPTIMIZE unless really needed. Let me know if you want help testing with smaller partitions or tuning memory!

Regards,

Brahma

View solution in original post

Brahmareddy · 2 weeks ago

Hi cmathieu,

How are you doing today?, As per my understanding, Yeah, this sounds like one of those cases where the data volume is small, but the complexity of the schema is what’s causing the trouble. If your table has deeply nested JSON structures, especially with arrays or structs, the OPTIMIZE command can use a lot of memory to analyze and reorganize files—even if the row count is low. Since this table is used as both a streaming sink and source, the metadata and execution plan can also grow large, especially if there's a lot of schema evolution or micro-batch state tracking involved. One option is to avoid running OPTIMIZE on such staging tables altogether, especially if file size isn't a problem. Or, if needed, try running OPTIMIZE with a filter on a small partition or recent data, and see if that helps reduce memory pressure. You can also try bumping up spark.sql.shuffle.partitions and spark.executor.memoryOverhead temporarily, but for lightweight, nested streaming tables, often the best move is to skip heavy operations like OPTIMIZE unless really needed. Let me know if you want help testing with smaller partitions or tuning memory!

Regards,

Brahma