โ10-11-2023 12:40 AM
Hi, Im trying to find out what is bottleneck on cluster when running loading process.
Scenario: Loading CDC changes from sql server to Raw zone and merge changes into Bronze zone and then merge Bronze to Silver. All is orchestrated in data factory as a loop per table. Parallel run in data factory is set to 5. So max 5 tables and notebooks are running simultaneously.
Strange is that when process starts, first 25 minutes cluster is not performing merges and first 5 notebooks take about 25 minutes. Then, cluster is autoscaled from 2 to 6 workers and the rest 45 notebooks finish in 10 minutes.
I tried to set 6 workers to cluster and start it with startup job 10 minutes before the load, but the result was the same.
I can see there is peak of memory use on driver from the beginning, workers memory or cpu dont peak. So I believe that bottleneck is driver trying to spread work across the workers. Am I correct?
I attached screenshots from ganglia. First is state 10 minutes after start, second is state after 30 minutes and third is after autoscaling 35 minutes after start.
Thanks.
โ10-12-2023 06:06 AM
Hi, The load looks okay. If you see the maximum memory is free and not used (fresh lime colour and the colour index). Could you please send the driver logs to check on this further?
โ10-13-2023 03:24 AM
Logs are too long and txt is not supported. But here are some details:
stderr:
Unexpected internal error while setting REPL context in `pre_command_execute`: An error occurred while calling o375.getLocalProperty. Trace:
py4j.security.Py4JSecurityException: Method public java.lang.String org.apache.spark.api.java.JavaSparkContext.getLocalProperty(java.lang.String) is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext
at py4j.security.WhitelistingPy4JSecurityManager.checkCall(WhitelistingPy4JSecurityManager.java:473)
at py4j.Gateway.invoke(Gateway.java:305)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
stdout:
2023-10-10 07:01:53,801 metastoreChecker WARN RollingFileAppender 'publicFile.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:01:53,803 metastoreChecker WARN RollingFileAppender 'privateFile.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:01:53,803 metastoreChecker WARN RollingFileAppender 'com.databricks.UsageLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:01:53,804 metastoreChecker WARN RollingFileAppender 'com.databricks.ProductLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:01:53,805 metastoreChecker WARN RollingFileAppender 'com.databricks.LineageLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:01:53,805 metastoreChecker WARN RollingFileAppender 'com.databricks.MetricsLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:01:53,806 metastoreChecker WARN RollingFileAppender 'dltExecution.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10T07:04:47.535+0000: [GC (Allocation Failure) [PSYoungGen: 1785344K->82418K(1963520K)] 2550598K->847680K(6237184K), 0.0343345 secs] [Times: user=0.10 sys=0.00, real=0.04 secs]
2023-10-10 07:06:54,852 metastoreChecker WARN RollingFileAppender 'publicFile.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:06:54,853 metastoreChecker WARN RollingFileAppender 'privateFile.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:06:54,854 metastoreChecker WARN RollingFileAppender 'com.databricks.UsageLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:06:54,854 metastoreChecker WARN RollingFileAppender 'com.databricks.ProductLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:06:54,855 metastoreChecker WARN RollingFileAppender 'com.databricks.LineageLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:06:54,855 metastoreChecker WARN RollingFileAppender 'com.databricks.MetricsLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:06:54,856 metastoreChecker WARN RollingFileAppender 'dltExecution.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,928 metastoreChecker WARN RollingFileAppender 'publicFile.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,929 metastoreChecker WARN RollingFileAppender 'privateFile.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,930 metastoreChecker WARN RollingFileAppender 'com.databricks.UsageLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,931 metastoreChecker WARN RollingFileAppender 'com.databricks.ProductLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,931 metastoreChecker WARN RollingFileAppender 'com.databricks.LineageLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,932 metastoreChecker WARN RollingFileAppender 'com.databricks.MetricsLogging.appender': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10 07:11:55,932 metastoreChecker WARN RollingFileAppender 'dltExecution.rolling': The bufferSize is set to 8192 but bufferedIO is not true
2023-10-10T07:13:27.207+0000: [GC (Allocation Failure) [PSYoungGen: 1869298K->89868K(1959936K)] 2634560K->855139K(6233600K), 0.0350628 secs] [Times: user=0.11 sys=0.01, real=0.04 secs]
stdout:
CONF_DRIVER_JAVA_OPTS = -Djava.io.tmpdir=/local_disk0/tmp -XX:-OmitStackTraceInFastThrow -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetails -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabricks.serviceName=driver-1 -Xms6260m -Xmx6260m -Dspark.ui.port=40001
CONF_EXECUTOR_JAVA_OPTS = -Dspark.executor.extraJavaOptions="-Djava.io.tmpdir=/local_disk0/tmp -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:PerMethodRecompilationCutoff=-1 -XX:PerBytecodeRecompilationCutoff=-1 -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetails -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabricks.serviceName=spark-executor-1" -Dspark.executor.memory=1821m
JAVA_OPTS = -Djava.io.tmpdir=/local_disk0/tmp -XX:-OmitStackTraceInFastThrow -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetails -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabricks.serviceName=driver-1 -Xms6260m -Xmx6260m -Dspark.ui.port=40001 -Dspark.executor.extraJavaOptions="-Djava.io.tmpdir=/local_disk0/tmp -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:PerMethodRecompilationCutoff=-1 -XX:PerBytecodeRecompilationCutoff=-1 -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetails -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabricks.serviceName=spark-executor-1" -Dspark.executor.memory=1821m -Dspark.executor.extraClassPath=/databricks/hadoop-safety-jars/*:/databricks/spark/dbconf/log4j/executor:/databricks/spark/dbconf/jets3t/:/databricks/spark/dbconf/hadoop:/databricks/hive/conf:/databricks/jars/*
Using Spark master spark://10.250.21.11:7077
Tue Oct 10 05:56:07 UTC 2023
Starting driver
Driver command is /usr/lib/jvm/zulu8-ca-amd64/jre//bin/java -Djava.io.tmpdir=/local_disk0/tmp -XX:-OmitStackTraceInFastThrow -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetails -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabricks.serviceName=driver-1 -Xms6260m -Xmx6260m -Dspark.ui.port=40001 -Dspark.executor.extraJavaOptions="-Djava.io.tmpdir=/local_disk0/tmp -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:PerMethodRecompilationCutoff=-1 -XX:PerBytecodeRecompilationCutoff=-1 -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:-UseContainerSupport -XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -XX:+PrintGCDetails -verbose:gc -Xss4m -Djava.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni -Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory -Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl -Djavax.net.ssl.sessionCacheSize=10000 -Dscala.reflect.runtime.disable.typetag.cache=true -Dcom.google.cloud.spark.bigquery.repackaged.io.netty.tryReflectionSetAccessible=true -Dlog4j2.formatMsgNoLookups=true -Ddatabricks.serviceName=spark-executor-1" -Dspark.executor.memory=1821m -Dspark.executor.extraClassPath=/databricks/hadoop-safety-jars/*:/databricks/spark/dbconf/log4j/executor:/databricks/spark/dbconf/jets3t/:/databricks/spark/dbconf/hadoop:/databricks/hive/conf:/databricks/jars/* -cp /databricks/hadoop-safety-jars/*:/databricks/spark/dbconf/jets3t/:/databricks/spark/dbconf/log4j/driver:/databricks/hive/conf:/databricks/spark/dbconf/hadoop:/databricks/jars/* com.databricks.backend.daemon.driver.DriverDaemon &
Working directory is /databricks/driver
[Global flags]
intx ActiveProcessorCount = -1 {product}
uintx AdaptiveSizeDecrementScaleFactor = 4 {product}
uintx AdaptiveSizeMajorGCDecayTimeScale = 10 {product}
uintx AdaptiveSizePausePolicy = 0 {product}
uintx AdaptiveSizePolicyCollectionCostMargin = 50 {product}
uintx AdaptiveSizePolicyInitializingSteps = 20 {product}
uintx AdaptiveSizePolicyOutputInterval = 0 {product}
uintx AdaptiveSizePolicyWeight = 10 {product}
uintx AdaptiveSizeThroughPutPolicy = 0 {product}
uintx AdaptiveTimeWeight = 25 {product}
bool AdjustConcurrency = false {product}
bool AggressiveHeap = false {product}
bool AggressiveOpts = false {product}
intx AliasLevel = 3 {C2 product}
bool AlignVector = true {C2 product}
intx AllocateInstancePrefetchLines = 1 {product}
intx AllocatePrefetchDistance = 256 {product}
intx AllocatePrefetchInstr = 3 {product}
intx AllocatePrefetchLines = 3 {product}
intx AllocatePrefetchStepSize = 64 {product}
intx AllocatePrefetchStyle = 1 {product}
bool AllowJNIEnvProxy = false {product}
bool AllowNonVirtualCalls = false {product}
bool AllowParallelDefineClass = false {product}
bool AllowUserSignalHandlers = false {product}
bool AlwaysActAsServerClassMachine = false {product}
bool AlwaysCompileLoopMethods = false {product}
bool AlwaysLockClassLoader = false {product}
โ10-15-2023 09:40 PM
stdout and stderr looks okay, do you have the log4j to share? You can make a doc out of it and share the doc here.
โ10-15-2023 11:57 PM
โ10-16-2023 09:43 PM
Hi,
Looks like there are a few connection errors and it is not able to connect to the default hive metastore.
org.datanucleus.excepฦons.NucleusExcepฦon: Error creaฦng transacฦonal connecฦon factory
Caused by: com.zaxxer.hikari.pool.HikariPool$PoolIniฦalizaฦonExcepฦon: Failed to iniฦalize pool: Couldnot connect to address=(host=consolidated-switzerlandnorth-prod-metastore-0.mysql.database.azure.com)(port=3306)(type=master) : Socket fail to connect to host:consolidated-switzerlandnorth-prod-metastore-0.mysql.database.azure.com, port:3306. consolidated-switzerlandnorth-prod-metastore-0.mysql.database.azure.com
I think there is a network config (firewall rules or routing rules) which is blocking it. Please check. Thanks!
โ10-17-2023 12:55 AM
Thanks @Debayan for your help.
Question is why is cluster trying to connect to default hive metastore when we use unity catalog only and we never set up hive metastore? Do we need to set up also Hive metastore?
See attached docs.
โ
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group