cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Driver Out of Memory Issue

chandan_a_v
Valued Contributor

Hi,

I am executing a simple job in Databricks for which I am getting below error. I increased the Driver size still I faced same issue.

Spark config :

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("Demand Forecasting").config("spark.yarn.executor.memoryOverhead", 2048).getOrCreate()

Driver and worker node type -r5.2xlarge

10 worker nodes.

Error Log:

Caused by: org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296.

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @Kaniz Fatma​ ,

Switching the runtime version to 10.4 fixed the issue for me.

Thanks,

Chandan

View solution in original post

6 REPLIES 6

-werners-
Esteemed Contributor III

looking at the error message you try to broadcast a large table. Remove the broadcast statement on the large table and you will be fine.

HI @Werner Stinckens​ ,

I am getting the above issue while writing a Spark DF as a parquet file to AWS S3. Not doing any broadcast join actually.

Thanks,

Chandan

Hubert-Dudek
Esteemed Contributor III

In my opinion on databricks, you don't need to specify (spark_session = SparkSession.builder.appName("Demand Forecasting").config("spark.yarn.executor.memoryOverhead", 2048).getOrCreate()) and rest is as @Werner Stinckens​ said

chandan_a_v
Valued Contributor

I am getting the above issue while writing a Spark DF as a parquet file to AWS S3. Not doing any broadcast join actually.

-werners-
Esteemed Contributor III

As Hubert mentioned: you should not create a spark session on databricks, it is provided.

The fact you do not broadcast manually makes me think Spark uses a broadcastjoin.

There is a KB about issues with that:

https://kb.databricks.com/sql/bchashjoin-exceeds-bcjointhreshold-oom.html

Can you check if it is applicable?

Hi @Kaniz Fatma​ ,

Switching the runtime version to 10.4 fixed the issue for me.

Thanks,

Chandan

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group