โ01-14-2022 12:16 AM
Hi All,
I am facing the GC metadata issue while performing distributed computing on Spark.
2022-01-13T22:02:28.467+0000: [GC (Metadata GC Threshold) [PSYoungGen: 458969K->18934K(594944K)] 458969K->18958K(1954816K), 0.0144028 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2022-01-13T22:02:28.482+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 18934K->0K(594944K)] [ParOldGen: 24K->17853K(823296K)] 18958K->17853K(1418240K), [Metaspace: 20891K->20891K(1067008K)], 0.0201195 secs] [Times: user=0.14 sys=0.01, real=0.02 secs]
2022-01-13T22:02:29.459+0000: [GC (Metadata GC Threshold) [PSYoungGen: 432690K->84984K(594944K)] 450544K->105009K(1418240K), 0.0226140 secs] [Times: user=0.17 sys=0.05, real=0.03 secs]
2022-01-13T22:02:29.481+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 84984K->0K(594944K)] [ParOldGen: 20025K->91630K(1360384K)] 105009K->91630K(1955328K), [Metaspace: 34943K->34943K(1079296K)], 0.0307833 secs] [Times: user=0.13 sys=0.07, real=0.03 secs]
Cluster config :
Nodes - r5.4xlarge (128 GB, 16 cores)
8 Worker nodes
Spark Config :
spark_home_set("/databricks/spark")
config <- spark_config()
config$spark.sql.shuffle.partitions = 480
config$spark.executor.cores = 5
config$spark.executor.memory = "30G"
config$spark.rpc.message.maxSize = 1945
config$spark.executor.instances = 24
config$spark.driver.memory = "30G"
config$spark.sql.execution.arrow.sparkr.enabled = TRUE
config$spark.driver.maxResultSize = 0
options(sparklyr.sanitize.column.names.verbose = TRUE)
options(sparklyr.verbose = TRUE)
options(sparklyr.na.omit.verbose = TRUE)
options(sparklyr.na.action.verbose = TRUE)
options(java.parameters = "-Xmx8000m")
sc <- spark_connect(method = "databricks", master = "yarn-client", config = config, spark_home = "/databricks/spark")
Please let me know how to fix this issue. Tried different approaches but I am getting same error all the time.
Thanks,
Chandan
โ04-30-2022 11:20 AM
Hi @Jose Gonzalezโ ,
Yes, the issue got resolved with the following spark config.
conf = spark_config()
conf$sparklyr.apply.packages <- FALSE
sc <- spark_connect(method = "databricks", config = conf)
โ01-14-2022 12:18 AM
Hi @Kaniz Fatmaโ ,
If you have any idea regarding this, please let me know.
Thanks,
Chandan
โ01-14-2022 05:18 AM
Can you try to run a test with maximum simplified spark_connect (so just method and spark_home).
Additionally please check following:
โ01-14-2022 07:59 AM
Hi @Hubert Dudekโ ,
Thanks for the reply, I am running R code. I tried the approach you have mentioned got the same issue.
โ02-23-2022 05:04 PM
Hi @Chandan Angadiโ ,
Are you getting any other error or warning message? for example in your log4j or the std error logs?
I would also recommend to run your code with the default values. Without these settings:
config <- spark_config()
config$spark.sql.shuffle.partitions = 480
config$spark.executor.cores = 5
config$spark.executor.memory = "30G"
config$spark.rpc.message.maxSize = 1945
config$spark.executor.instances = 24
config$spark.driver.memory = "30G"
config$spark.sql.execution.arrow.sparkr.enabled = TRUE
config$spark.driver.maxResultSize = 0
Just to narrow down and identify the the message happens with all the default values or not. Some of these Spark configs are not needed in Databricks, unless you want to fine tune your job. In this case we need to make sure your job runs fine, to have a reference point.
โ04-05-2022 04:53 PM
hi @Chandan Angadiโ ,
Just a friendly follow-up. Are you still affected by this error message? please let us know if we can help.
โ04-30-2022 11:20 AM
Hi @Jose Gonzalezโ ,
Yes, the issue got resolved with the following spark config.
conf = spark_config()
conf$sparklyr.apply.packages <- FALSE
sc <- spark_connect(method = "databricks", config = conf)
โ04-28-2022 09:46 AM
Hey @Chandan Angadiโ
Hope you are doing great!
Just checking in. Were you able to resolve your issue? If yes, would you like to mark an answer as best? It would be really helpful for the other members.
We'd love to hear from you.
โ04-30-2022 11:18 AM
Hi @Vartika Nainโ ,
Sorry for the late reply and sorry for others as well had some health issues so couldn't reply early.
Yes, the issue got resolved with the following spark config.
conf = spark_config()
conf$sparklyr.apply.packages <- FALSE
sc <- spark_connect(method = "databricks", config = conf)
โ05-02-2022 06:05 AM
Hi @Chandan Angadiโ
Hope you are doing well now.
Thanks for getting back to us and sending in your solution. Would you like to mark an answer as best?
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group