I'm having difficulty with a job (parent) that triggers multiple parallel runs of another job (child) in batches (e.g. 10 parallel runs per batch).
Occasionally some of the parallel "child" jobs will crash a few minutes in-- either during or immediately after cluster initialization. The crashed runs terminate with a 'Cancelled' result status.
Seemingly relevant excerpt from the log4j output:
Caused by: java.sql.SQLNonTransientConnectionException: Too many connections
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:175)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:110)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1107)
at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:502)
at org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:155)
at org.mariadb.jdbc.Driver.connect(Driver.java:86)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:416)
... 116 more
Caused by: java.sql.SQLException: Too many connections
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.authentication(AbstractConnectProtocol.java:856)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:777)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:451)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1103)
... 123 more
22/01/14 21:24:42 WARN PythonDriverWrapper: setupRepl:ReplId-409cf-88936-53fe7-8: at the end, the status is Error(ReplId-409cf-88936-53fe7-8,org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient)
22/01/14 21:24:42 INFO DriverCorral$: Cleaning the wrapper ReplId-409cf-88936-53fe7-8 (currently in status Stopped(ReplId-409cf-88936-53fe7-8))
22/01/14 21:24:42 INFO DriverCorral$: sending shutdown signal for REPL ReplId-409cf-88936-53fe7-8
22/01/14 21:24:42 WARN PythonDriverWrapper: Repl ReplId-409cf-88936-53fe7-8 is already shutting down: Stopped(ReplId-409cf-88936-53fe7-8)
22/01/14 21:24:42 INFO DriverCorral$: sending the interrupt signal for REPL ReplId-409cf-88936-53fe7-8
22/01/14 21:24:42 INFO DriverCorral$: waiting for localThread to stop for REPL ReplId-409cf-88936-53fe7-8
22/01/14 21:24:42 INFO DriverCorral$: ReplId-409cf-88936-53fe7-8 successfully discarded
Full log4j-active output attached.