cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

OPTIMIZE command on heavily nested table OOM error

cmathieu
New Contributor III

I'm trying to run the OPTIMIZE command on a table with less than 2000 rows, but it is causing an out of memory issue. The problem seems to come from the fact that it is a heavily nested table in staging between a json file and flattened table. The table is used as a stream sink and a source.

Any clue? 

1 ACCEPTED SOLUTION

Accepted Solutions

Brahmareddy
Honored Contributor III

Hi cmathieu,

How are you doing today?, As per my understanding, Yeah, this sounds like one of those cases where the data volume is small, but the complexity of the schema is what’s causing the trouble. If your table has deeply nested JSON structures, especially with arrays or structs, the OPTIMIZE command can use a lot of memory to analyze and reorganize files—even if the row count is low. Since this table is used as both a streaming sink and source, the metadata and execution plan can also grow large, especially if there's a lot of schema evolution or micro-batch state tracking involved. One option is to avoid running OPTIMIZE on such staging tables altogether, especially if file size isn't a problem. Or, if needed, try running OPTIMIZE with a filter on a small partition or recent data, and see if that helps reduce memory pressure. You can also try bumping up spark.sql.shuffle.partitions and spark.executor.memoryOverhead temporarily, but for lightweight, nested streaming tables, often the best move is to skip heavy operations like OPTIMIZE unless really needed. Let me know if you want help testing with smaller partitions or tuning memory!

Regards,

Brahma

View solution in original post

1 REPLY 1

Brahmareddy
Honored Contributor III

Hi cmathieu,

How are you doing today?, As per my understanding, Yeah, this sounds like one of those cases where the data volume is small, but the complexity of the schema is what’s causing the trouble. If your table has deeply nested JSON structures, especially with arrays or structs, the OPTIMIZE command can use a lot of memory to analyze and reorganize files—even if the row count is low. Since this table is used as both a streaming sink and source, the metadata and execution plan can also grow large, especially if there's a lot of schema evolution or micro-batch state tracking involved. One option is to avoid running OPTIMIZE on such staging tables altogether, especially if file size isn't a problem. Or, if needed, try running OPTIMIZE with a filter on a small partition or recent data, and see if that helps reduce memory pressure. You can also try bumping up spark.sql.shuffle.partitions and spark.executor.memoryOverhead temporarily, but for lightweight, nested streaming tables, often the best move is to skip heavy operations like OPTIMIZE unless really needed. Let me know if you want help testing with smaller partitions or tuning memory!

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now