01-14-2022 12:16 AM
Hi All,
I am facing the GC metadata issue while performing distributed computing on Spark.
2022-01-13T22:02:28.467+0000: [GC (Metadata GC Threshold) [PSYoungGen: 458969K->18934K(594944K)] 458969K->18958K(1954816K), 0.0144028 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]
2022-01-13T22:02:28.482+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 18934K->0K(594944K)] [ParOldGen: 24K->17853K(823296K)] 18958K->17853K(1418240K), [Metaspace: 20891K->20891K(1067008K)], 0.0201195 secs] [Times: user=0.14 sys=0.01, real=0.02 secs]
2022-01-13T22:02:29.459+0000: [GC (Metadata GC Threshold) [PSYoungGen: 432690K->84984K(594944K)] 450544K->105009K(1418240K), 0.0226140 secs] [Times: user=0.17 sys=0.05, real=0.03 secs]
2022-01-13T22:02:29.481+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 84984K->0K(594944K)] [ParOldGen: 20025K->91630K(1360384K)] 105009K->91630K(1955328K), [Metaspace: 34943K->34943K(1079296K)], 0.0307833 secs] [Times: user=0.13 sys=0.07, real=0.03 secs]
Cluster config :
Nodes - r5.4xlarge (128 GB, 16 cores)
8 Worker nodes
Spark Config :
spark_home_set("/databricks/spark")
config <- spark_config()
config$spark.sql.shuffle.partitions = 480
config$spark.executor.cores = 5
config$spark.executor.memory = "30G"
config$spark.rpc.message.maxSize = 1945
config$spark.executor.instances = 24
config$spark.driver.memory = "30G"
config$spark.sql.execution.arrow.sparkr.enabled = TRUE
config$spark.driver.maxResultSize = 0
options(sparklyr.sanitize.column.names.verbose = TRUE)
options(sparklyr.verbose = TRUE)
options(sparklyr.na.omit.verbose = TRUE)
options(sparklyr.na.action.verbose = TRUE)
options(java.parameters = "-Xmx8000m")
sc <- spark_connect(method = "databricks", master = "yarn-client", config = config, spark_home = "/databricks/spark")
Please let me know how to fix this issue. Tried different approaches but I am getting same error all the time.
Thanks,
Chandan
04-30-2022 11:20 AM
Hi @Jose Gonzalez ,
Yes, the issue got resolved with the following spark config.
conf = spark_config()
conf$sparklyr.apply.packages <- FALSE
sc <- spark_connect(method = "databricks", config = conf)
01-14-2022 12:18 AM
Hi @Kaniz Fatma ,
If you have any idea regarding this, please let me know.
Thanks,
Chandan
01-14-2022 05:18 AM
Can you try to run a test with maximum simplified spark_connect (so just method and spark_home).
Additionally please check following:
01-14-2022 07:59 AM
Hi @Hubert Dudek ,
Thanks for the reply, I am running R code. I tried the approach you have mentioned got the same issue.
02-23-2022 05:04 PM
Hi @Chandan Angadi ,
Are you getting any other error or warning message? for example in your log4j or the std error logs?
I would also recommend to run your code with the default values. Without these settings:
config <- spark_config()
config$spark.sql.shuffle.partitions = 480
config$spark.executor.cores = 5
config$spark.executor.memory = "30G"
config$spark.rpc.message.maxSize = 1945
config$spark.executor.instances = 24
config$spark.driver.memory = "30G"
config$spark.sql.execution.arrow.sparkr.enabled = TRUE
config$spark.driver.maxResultSize = 0
Just to narrow down and identify the the message happens with all the default values or not. Some of these Spark configs are not needed in Databricks, unless you want to fine tune your job. In this case we need to make sure your job runs fine, to have a reference point.
04-05-2022 04:53 PM
hi @Chandan Angadi ,
Just a friendly follow-up. Are you still affected by this error message? please let us know if we can help.
04-30-2022 11:20 AM
Hi @Jose Gonzalez ,
Yes, the issue got resolved with the following spark config.
conf = spark_config()
conf$sparklyr.apply.packages <- FALSE
sc <- spark_connect(method = "databricks", config = conf)
04-28-2022 09:46 AM
Hey @Chandan Angadi
Hope you are doing great!
Just checking in. Were you able to resolve your issue? If yes, would you like to mark an answer as best? It would be really helpful for the other members.
We'd love to hear from you.
04-30-2022 11:18 AM
Hi @Vartika Nain ,
Sorry for the late reply and sorry for others as well had some health issues so couldn't reply early.
Yes, the issue got resolved with the following spark config.
conf = spark_config()
conf$sparklyr.apply.packages <- FALSE
sc <- spark_connect(method = "databricks", config = conf)
05-02-2022 06:05 AM
Hi @Chandan Angadi
Hope you are doing well now.
Thanks for getting back to us and sending in your solution. Would you like to mark an answer as best?
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group