- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-05-2022 11:23 PM
Hi,
I am executing a simple job in Databricks for which I am getting below error. I increased the Driver size still I faced same issue.
Spark config :
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Demand Forecasting").config("spark.yarn.executor.memoryOverhead", 2048).getOrCreate()
Driver and worker node type -r5.2xlarge
10 worker nodes.
Error Log:
Caused by: org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=4294967296.
- Labels:
-
Memory
-
Spark config
-
Spark Driver
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2022 01:51 AM
Hi @Kaniz Fatma ,
Switching the runtime version to 10.4 fixed the issue for me.
Thanks,
Chandan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-05-2022 11:54 PM
looking at the error message you try to broadcast a large table. Remove the broadcast statement on the large table and you will be fine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-08-2022 12:05 PM
HI @Werner Stinckens ,
I am getting the above issue while writing a Spark DF as a parquet file to AWS S3. Not doing any broadcast join actually.
Thanks,
Chandan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-06-2022 09:04 AM
In my opinion on databricks, you don't need to specify (spark_session = SparkSession.builder.appName("Demand Forecasting").config("spark.yarn.executor.memoryOverhead", 2048).getOrCreate()) and rest is as @Werner Stinckens said
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-08-2022 12:05 PM
I am getting the above issue while writing a Spark DF as a parquet file to AWS S3. Not doing any broadcast join actually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-11-2022 06:25 AM
As Hubert mentioned: you should not create a spark session on databricks, it is provided.
The fact you do not broadcast manually makes me think Spark uses a broadcastjoin.
There is a KB about issues with that:
https://kb.databricks.com/sql/bchashjoin-exceeds-bcjointhreshold-oom.html
Can you check if it is applicable?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-02-2022 01:51 AM
Hi @Kaniz Fatma ,
Switching the runtime version to 10.4 fixed the issue for me.
Thanks,
Chandan
![](/skins/images/8C2A30E5B696B676846234E4B14F2C7B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/8C2A30E5B696B676846234E4B14F2C7B/responsive_peak/images/icon_anonymous_message.png)