cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Serverless Compute - Spark - Jobs failing with Max iterations (1000) reached for batch Resolution

Ramana
Valued Contributor

Hello Community,

We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.

When we try to execute the existing jobs with Serverless Compute, if the job deals with a small amount of data or a small number of stages, Serverless Compute works great. But when we try to use the Serverless Compute for processing a large amount of data with a large number of intermediate transformations, the job fails with the following error:

Exception: (java.lang.RuntimeException) Max iterations (1000) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value

This error indicates that the query plan required more than the default 1000 iterations to resolve, likely due to deeply nested logic or complex transformations in our code. However, in serverless environments, the spark.sql.analyzer.maxIterations configuration is not accessible or overridable, as it is not exposed via Spark Connect.

Has anyone faced the similar issue?

Any suggestion or recommendation is greatly appreciated.

Screenshots:

Ramana_1-1757620107637.png

 

Ramana_0-1757620075091.png

#ServerlessCompute

#DataEngineering

#ClassicCompute-to-ServerlessCompute-Migration

#Migration

 

 

 

Thanks
Ramana
5 REPLIES 5

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Ramana ,

The above error occurs when the Spark SQL optimiser is unable to resolve a query within the fixed maximum number of rule-application iterations (default 1000) in its internal logical plan "Resolution" phase. This typically happens with particularly complex queries, especially those that involve:

  • Excessively deep or "chained" query plans, often produced when repeatedly applying DataFrame transformations, like many chained .withColumn() calls
  • Highly nested views or subqueries, especially those involving multiple self-joins or recursive structures
  • Generated query plans with conflicting or redundant attributes

In Databricks serverless and some managed environments, most Spark SQL configsโ€”including spark.sql.analyzer.maxIterationsโ€”cannot be changed/Thus, increasing the setting is not possible as a workaround on serverless.

So the only way is to reduce the complexity of the logical plan the analyser generates, and that can be done by optimising your query by breaking it down to smaller steps, materializing each before proceeding to the next transformation.

Please let me know if you have any further questions 

Ramana
Valued Contributor

Thanks for sharing your thoughts.

The job(s) actually failing are simple SELECT * FROM source_table WHERE where_clause with some level of JDBC partitioning and then simple transformations like casting the data types, applying some regex, etc., (may be some 10-15 different transformations) at the dataframe level. These jobs are not at all complex, but still fail when we have a large amount of data. I don't think a logical plan varies based on data. I know logical plan varies based on the number/type of transformations, but not by the size of the data (mostly), there may be some exceptions.

Persisting data at every intermediate state is not a good idea, especially with Serverless, because of no support for caching/persisting/ some level of tempview creations.

I can do optimizations of complex queries, but I don't have any scope to optimize the simple SELECT and simple dataframe transformations.

The main goal of Serverless is to reduce this kind of burden on the processes by implementing these techniques dynamically, which is the reason Databricks does not allow us to set any custom configurations. If we need to do all of this, then Databricks should allow us to set these configurations.

Thanks
Ramana

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Ramana ,

Youโ€™re right that data volume doesnโ€™t change the logical plan, but your pattern (Example: SELECT * from a wide table + 10โ€“15 column transforms) can still exceed the analyzerโ€™s fixed iteration cap on Serverless, because each * expansion and chained withColumn/casts/regex adds more alias resolution work, resulting in a huge stack of projections causing the analyzer to break.

I would suggest that you understand what optimiser rules are being applied and why they are exceeding the default max values by setting spark.sql.planChangeLog.level to INFO and then simplify the code as required. 

Ramana
Valued Contributor

Thank you.

But the same job with a limit clause works great. I don't think it is related to the logical plan, but I will look into your suggestion for tracing the issue down. 

If serverless doesn't work for these basic transformations, it will be tough to utilize for complex jobs (like dynamic code generation jobs), which is what I am trying to convey here.

Migration from Classic to Serverless is not a straightforward approach, and it appears that most classic compute jobs should be rewritten to execute in Serverless.

Thanks
Ramana

Ramana
Valued Contributor

@K_Anudeep FYI: Serverless Compute doesn't support spark.sql.planChangeLog.level.

If we try to set up, the job will fail with [CONFIG_NOT_AVAILABLE] Configuration spark.sql.planChangeLog.level is not available. SQLSTATE: 42K0I error.

Classic Compute supports (as expected). I am trying to capture the statistics on Classic, but so far, I don't see any suspicious statistics on Classic. Since I set the node type, min, and max workers, Classic can accommodate the load. However, when it comes to Serverless, I have no idea how to view these stats because there is no Spark UI available in Serverless Jobs (I know that the Serverless has the Query history option but not sure how this replaces the Spark UI).

I feel like switching from Classic to Serverless is an architectural change versus a simple migration/Spark version upgrade.

I will share the Classic stats soon.

That being said, I don't think Serverless is fit for any of my company's Spark jobs, at least for now; this may change in the future.

 

Thanks
Ramana

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now