cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster library installation fails

jgen17
New Contributor II

Hello everyone,

I get a weird error when installing additional libraries in my cluster.

I have a predefined Databricks cluster (Standard_L8s_v2) as a Compute instance. I run pipelines on that cluster in Azure ADF. The pipeline consists several tasks. The tasks run Python code.

I install my Python code with a prebuilt wheel. Additionally I need to add four more libraries to the Tasks - Settings - Additional Libraries and install them with pip. This step is necessary, so that pytorch (one of the four libraries) is installed with GPU support, as the libraries and dependencies of the wheel are defined with poetry. 

But the library installation fails regularly. It does not always fail for the same task on the same day. Sometimes it fails for Task1 on day1 and the other day for Task2 on day2. Sometimes all succeed and sometimes all fail.

Here's the error message:

run failed with error message Library installation failed for library due to user error for pypi { package: "sentence-transformers==2.2.2" } Error messages: Library installation failed after PENDING for 10 minutes since cluster entered RUNNING state. Error Code: CHAUFFEUR_RPC_SERVER_UNAVAILABLE. Library request cannot reach driver node on cluster 0511-114900-l5r08j93. This could be caused by network connectivity to the driver node being temporarily down. If this doesn't self correct in a while, please check your network settings or contact Databricks Support.

What I suspected that the configuration of the cluster: Terminate after 10 minutes of inactivity. The assumption I had was that the cluster is not in RUNNING state during the time of installing the libraries in the appended libraries section. Does that make sense?

I increased the time to 20 and 30 minutes but it still sometimes fails. It seems works more stable when increasing it to 40 minutes. But the results I have here are not really validated. It also more regularly fails if the cluster is triggered by an automatic trigger than when starting the pipeline manually (I don't understand why).

Does anyone have an idea why the library installation fails? Let me know if you need further context!

Thanks for your help. Really appreciated!

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @jgen17, Did you check the Databricks logs in Azure Log Analytics for more information on the error?

jgen17
New Contributor II

Hi @Kaniz ,

Thanks for your response. So that's the last 5 minutes of Log4J output, when libraries are installed. Is that what you mean? It fails quite exactly after 10 minutes of starting the Log4J logs. So it might be that the cluster is not in active/running state as long as the libraries as installed and therefore it is shut down?

23/10/19 12:10:11 INFO PoolingHiveClient: Hive metastore connection pool implementation is HikariCP
23/10/19 12:10:11 INFO LocalHiveClientsPool: Create Hive Metastore client pool of size 1
23/10/19 12:10:11 INFO DriverCorral: DBFS health check ok
23/10/19 12:10:12 INFO HiveClientImpl: Warehouse location for Hive client (version 0.13.1) is dbfs:/user/hive/warehouse
23/10/19 12:10:12 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
23/10/19 12:10:12 INFO ObjectStore: ObjectStore, initialize called
23/10/19 12:10:12 INFO Persistence: Property datanucleus.fixedDatastore unknown - will be ignored
23/10/19 12:10:12 INFO Persistence: Property datanucleus.connectionPool.idleTimeout unknown - will be ignored
23/10/19 12:10:12 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
23/10/19 12:10:12 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
23/10/19 12:10:12 INFO HikariDataSource: HikariPool-1 - Started.
23/10/19 12:10:13 INFO HikariDataSource: HikariPool-2 - Started.
23/10/19 12:10:13 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
23/10/19 12:10:16 INFO ObjectStore: Initialized ObjectStore
23/10/19 12:10:16 INFO HiveMetaStore: Added admin role in metastore
23/10/19 12:10:16 INFO HiveMetaStore: Added public role in metastore
23/10/19 12:10:16 INFO HiveMetaStore: No user is added in admin role, since config is empty
23/10/19 12:10:16 INFO HiveMetaStore: 0: get_database: default
23/10/19 12:10:16 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
23/10/19 12:10:17 INFO HiveMetaStore: 0: get_database: default
23/10/19 12:10:17 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
23/10/19 12:10:17 INFO DriverCorral: Metastore health check ok
23/10/19 12:10:43 INFO SharedDriverContext: Successfully attached library dbfs:/mnt/cddm-DEV/application/MIR_task/cddm-0.x-py3-none-any.whl to Spark
23/10/19 12:10:43 INFO LibraryState: [Thread 132] Successfully attached library dbfs:/mnt/cddm-DEV/application/MIR_task/cddm-0.x-py3-none-any.whl
23/10/19 12:10:43 INFO SharedDriverContext: [Thread 132] attachLibrariesToSpark PythonPyPiPkgId(torch,Some(1.13.1),None,List())
23/10/19 12:10:43 INFO SharedDriverContext: Attaching Python lib: python-pypi;torch;;1.13.1; to clusterwide nfs path
23/10/19 12:10:43 INFO Utils: resolved command to be run: List(bash, /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh, /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip, install, torch==1.13.1, --disable-pip-version-check)
23/10/19 12:13:57 INFO SharedDriverContext: Successfully attached library python-pypi;torch;;1.13.1; to Spark
23/10/19 12:13:57 INFO LibraryState: [Thread 132] Successfully attached library python-pypi;torch;;1.13.1;
23/10/19 12:13:57 INFO SharedDriverContext: [Thread 132] attachLibrariesToSpark PythonPyPiPkgId(lightning,Some(2.0.2),None,List())
23/10/19 12:13:57 INFO SharedDriverContext: Attaching Python lib: python-pypi;lightning;;2.0.2; to clusterwide nfs path
23/10/19 12:13:57 INFO Utils: resolved command to be run: List(bash, /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh, /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip, install, lightning==2.0.2, --disable-pip-version-check)
23/10/19 12:14:56 INFO DataSourceFactory$: DataSource Jdbc URL: jdbc:mariadb://consolidated-westeuropec2-prod-metastore-3.mysql.database.azure.com:3306/organization257243788442763?useSSL=true&sslMode=VERIFY_CA&disableSslHostnameVerification=true&trustServerCertificate=false&serverSslCert=/databricks/common/mysql-ssl-ca-cert.crt
23/10/19 12:14:56 INFO HikariDataSource: metastore-monitor - Starting...
23/10/19 12:14:56 INFO HikariDataSource: metastore-monitor - Start completed.
23/10/19 12:14:56 INFO HikariDataSource: metastore-monitor - Shutdown initiated...
23/10/19 12:14:56 INFO HikariDataSource: metastore-monitor - Shutdown completed.
23/10/19 12:14:56 INFO MetastoreMonitor: Metastore healthcheck successful (connection duration = 194 milliseconds)
23/10/19 12:15:09 INFO SharedDriverContext: Successfully attached library python-pypi;lightning;;2.0.2; to Spark
23/10/19 12:15:09 INFO LibraryState: [Thread 132] Successfully attached library python-pypi;lightning;;2.0.2;
23/10/19 12:15:09 INFO SharedDriverContext: [Thread 132] attachLibrariesToSpark PythonPyPiPkgId(pytorch-lightning,Some(2.0.2),None,List())
23/10/19 12:15:09 INFO SharedDriverContext: Attaching Python lib: python-pypi;pytorch-lightning;;2.0.2; to clusterwide nfs path
23/10/19 12:15:09 INFO Utils: resolved command to be run: List(bash, /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh, /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip, install, pytorch-lightning==2.0.2, --disable-pip-version-check)
23/10/19 12:15:11 INFO DriverCorral: DBFS health check ok
23/10/19 12:15:11 INFO HiveMetaStore: 0: get_database: default
23/10/19 12:15:11 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
23/10/19 12:15:11 INFO DriverCorral: Metastore health check ok
23/10/19 12:15:15 INFO SharedDriverContext: Successfully attached library python-pypi;pytorch-lightning;;2.0.2; to Spark
23/10/19 12:15:15 INFO LibraryState: [Thread 132] Successfully attached library python-pypi;pytorch-lightning;;2.0.2;
23/10/19 12:15:15 INFO SharedDriverContext: [Thread 132] attachLibrariesToSpark PythonPyPiPkgId(sentence-transformers,Some(2.2.2),None,List())
23/10/19 12:15:15 INFO SharedDriverContext: Attaching Python lib: python-pypi;sentence-transformers;;2.2.2; to clusterwide nfs path
23/10/19 12:15:15 INFO Utils: resolved command to be run: List(bash, /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh, /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip, install, sentence-transformers==2.2.2, --disable-pip-version-check)

Kaniz
Community Manager
Community Manager

Hi @jgen17 , Could you please share your cluster details?

 

jgen17
New Contributor II

Sure @Kaniz, here are the details:

Summary

1 Driver64 GB Memory, 8 Cores
Runtime11.3.x-scala2.12
Standard_L8s_v2
2 DBU/h

Databricks Runtime Version: 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)

Node type: Standard_L8s_v2

Terminate after 10 minutes of inactivity.

 
Is that helpful?
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.