Hi!
I get the below error when a cluster job starts up and tries to install a Python .whl file. (Which is hosted on an Azure Artefact feed, though this seems more like a problem of trying to read from a disk/network storage). The failure is seemingly random and intermittent, from the error message it is clearly a networking/timeout problem.
I see in the log below it mentions Retry(total=4 ... Is it possible to increase/modify this? Or perhaps adds some exponential backoff?
Thanks!
Alex
Library installation attempted on the driver node of cluster xxxxxxxx and failed. Please refer to the following error message or contact Databricks support. Error code: FAULT_OTHER, error message: org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install 'my.company.library==1.0.0' --disable-pip-version-check) exited with code 1. WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f872e82bf40>, 'Connectio ...