Databricks Community

ChrisLawford_n1 · 3 weeks ago

Hello,

When running on a serverless cluster in DLT our notebook first tries to install some python whls onto the cluster. We have noticed that when in development and running a pipeline many times over in a short space of time between runs that the pipeline will successfully run the first time it is run and then when running a second time we get the following error:

Processing /Workspace/Shared/libraries/CompanyName/common_library-3.1.1rc1-py3-none-any.whl
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f3d9bb12570>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/build/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f3d9b6e71a0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/build/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f3d9b51cb30>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/build/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f3d9b51cd40>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/build/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f3d9b51cf20>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/build/
INFO: pip is looking at multiple versions of common-library to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement build==1.2.1 (from common-library) (from versions: none)
ERROR: No matching distribution found for build==1.2.1

When looking at the cluster logs I can see the prior runs logs with the successful installs of the libraries, so I am lost as to how the cluster can lose the connection when running the second time.

mark_ott · 2 weeks ago

The error you’re seeing (“Network is unreachable” repeated during pip installs) on a DLT (Delta Live Table) serverless cluster, especially after the first successful run, is a common issue that appears to affect Databricks pipelines run repeatedly on serverless clusters in rapid succession. Here’s a detailed analysis:

Likely Causes

Network Policy Reset or Resource Recycling: Serverless Databricks clusters are managed by the cloud provider and often aggressively recycle resources between runs to optimize costs. This can result in a fresh network environment for each pipeline execution. In some cases, egress (outbound) connections are not immediately or correctly re-established after cluster recycling, leading to intermittent “network unreachable” errors for pip installs.
Temporary IP Blocking or Firewall NAT Exhaustion: When running pipelines frequently, there’s evidence in Databricks and PyPI communities that IP addresses from cloud-managed pools can be subject to temporary blocks, rate limits, or network stack exhaustion, especially when connections are repeatedly opened and closed across rapid cluster lifecycles.
Cached Environments and Init Scripts: Sometimes, after the first run, the cluster environment may be cached or a previously downloaded wheel might still exist, but the network connectivity required to check requirement satisfiability is not re-established, resulting in pip’s repeated failure to connect to PyPI endpoints.
Library/Dependency Handling in DLT: DLT serverless clusters install dependencies anew each time, but because they’re not persistent environments, any custom setup (including .whl files not located in DBFS/S3/ADLS, or direct pip installs against PyPI) can run into transient network access issues or library install race conditions. The official advice is to pre-install via workspace library management and DBFS.

What You Can Try

Cluster Pool and Egress Policy: Review your network egress settings for serverless and ensure that necessary outbound connections (to PyPI, your artifact server, etc.) are not restricted or subject to rate limiting. If using Azure or AWS, verify that egress policies allow repeated rapid outbound traffic and consider working with your cloud admin to whitelist PyPI endpoints.
Staggered Runs: Avoid running pipelines back-to-back in very quick succession. Allow for the managed cluster pool to recycle and fully reinitialize. This can avoid network stack exhaustion.
Use Workspace Libraries/DBFS for Wheels: Store your .whl files on DBFS (Databricks File System) and reference them as workspace libraries at the start of your pipeline, rather than performing pip installs each time within notebook code. This may reduce dependency on live network connectivity at run time.
Init Script Management: Double-check any cluster init scripts and library install commands for idempotency and robustness. Misconfigured scripts may intermittently fail after the first run.
Contact Databricks Support: If the issue persists, document the timing and frequency, and contact Databricks support since there are acknowledged issues with network reliability for serverless pipeline compute under edge use-cases.

Key Best Practices

Prefer using DBFS/S3 to host custom wheels and install from those URIs, not from workspace or ephemeral paths.
Avoid installing packages with pip in user code on serverless clusters; leverage cluster-level library configuration whenever possible.
Ensure your organization’s outbound firewall does not rate-limit or temporarily block IPs due to frequent cluster recycling.

If you consistently see “Network is unreachable” after the first pipeline run, it's likely a side effect of Databricks' serverless cluster recycling, network egress policies, or rapid-fire resource re-allocation. These are not typically seen on classic clusters, which maintain more stable environments between runs.