cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Init Script to Install ODBC Driver Causes Cluster Crash (JVM Thread Dump)

robertayoung520
New Contributor

Hello Databricks Community,

I am facing a critical issue where our cluster fails to start when using an init script designed to install the Databricks ODBC driver. I'm hoping someone can shed light on what might be happening within our specific environment.

The Goal:

My objective is to connect from a Python notebook on an all-purpose compute cluster in Workspace A to a remote Databricks SQL Warehouse in Workspace B.

  • Source: Python notebook on an All-Purpose Cluster (Databricks on Azure).
  • Target: Remote Databricks SQL Warehouse (adb-....azuredatabricks.net).

Summary of My Debugging Journey:

  1. databricks-sql-connector Fails: My first attempt using the standard databricks-sql-connector library hangs indefinitely and times out, suggesting a network firewall is blocking the connection.

  2. pyodbc Fails (Initially): I then switched to using the pyodbc library, as this is more robust in complex network environments. This failed with the error: [unixODBC][Driver Manager]Can't open lib 'Databricks' : file not found This indicated the driver manager tools or the driver itself were not correctly installed or registered on the cluster OS.

  3. Init Script Causes Cluster Failure: To solve the driver issue, I created an all-purpose compute cluster and attached the following init script, which should install all necessary components:

    #!/bin/bash set -e sudo apt-get update sudo apt-get install -y unixodbc unixodbc-dev sudo dpkg -i "/Workspace/Users/my.user@email.com/path/to/databricksodbc_amd64.deb" sudo apt-get install -f
     

    When I start the cluster with this init script, the cluster startup fails completely. The UI shows the error: Init script failure: ... failed: Script exit status is non-zero.

The Critical Finding (The Main Problem):

When I inspect the logs for the failed init script, the stderr log file does not contain any shell errors (like "permission denied" or "command not found").

Instead, the stderr log contains only a full Java Thread Dump, starting with lines like: "LDBasedSafeFlagClient" id=17 state=WAITING... "process reaper" id=15 state=RUNNABLE...

This seems to indicate that the apt-get or dpkg commands are so incompatible with our environment that they are causing a fatal crash of the core Databricks JVM services on the cluster node.

My Specific Questions:

  1. Has anyone ever seen this behavior where an init script using apt-get or dpkg causes a hard JVM crash instead of just a shell error?
  2. Is this thread dump a known symptom of a security-hardened Databricks environment (e.g., using a custom unmodifiable OS image, or a security agent like SentinelOne/CrowdStrike) that actively terminates processes that try to modify the system?
  3. Given this evidence, is it correct to conclude that my environment is fundamentally "locked down," and that installing any system-level software via init scripts is impossible by design?

It feels like I've hit a wall where the platform's security is preventing the necessary setup for an outbound ODBC connection. Any insight would be hugely appreciated.

Thank you

2 REPLIES 2

sameer_yasser
New Contributor II

Hey, I've hit almost this exact wall before and that Java thread dump in stderr is a very specific symptom — let me share what I've learned.

On your question about the JVM crash behavior

Yes, this is a known (if underdocumented) failure mode. What you're seeing isn't really a "crash" in the traditional sense — it's the Databricks cluster health monitor detecting that a child process spawned during init is hanging or consuming resources unexpectedly, and it's dumping JVM state as part of its diagnostic shutdown sequence. The misleading part is that it surfaces in stderr of your init script log, making it look like your bash script caused it. In reality, the script likely triggered a process that got intercepted.

On the security hardening question

Almost certainly yes. If your workspace is deployed behind a VNet with custom DNS, NSGs, or has an EDR agent running (SentinelOne and CrowdStrike are both known to do this), apt-get and dpkg can get killed mid-execution because they attempt to modify /etc, /lib, or run post-install hooks that the security agent flags. The frustrating thing is the exit code gets swallowed or misattributed. You're not imagining it — the environment is locked down, and that's by design.

What actually works in this scenario

Rather than fighting the init script path, I'd step back and challenge the ODBC requirement entirely. If your goal is Python notebook → remote Databricks SQL Warehouse, you have options that don't require any OS-level driver installation:

  1. databricks-sql-connector timing out — this is almost certainly a network issue, not a library issue. The connector uses port 443 to the warehouse's HTTP path, same as your browser. If it's hanging, your cluster's egress is being blocked at the NSG or firewall level for that specific destination. Check whether adb-<your-target-workspace>.azuredatabricks.net on port 443 is allowed outbound from your cluster's subnet. This is fixable without touching init scripts.
  2. If you must use ODBC — instead of installing via dpkg in an init script, try pre-building a custom Docker container image with the ODBC driver baked in, and use Databricks Container Services to run your cluster from that image. This sidesteps the init script execution entirely and is generally tolerated even in hardened environments because you're not modifying the OS at runtime.
  3. Spark remote execution — depending on what you're actually doing with the SQL Warehouse, spark.read with a JDBC URL pointed at the warehouse's built-in JDBC endpoint might be the cleanest path, again with no driver install needed.

Bottom line

Your diagnosis is correct. The environment is locked down. But the right fix is probably unblocking the network path for databricks-sql-connector rather than trying to win a fight against the security layer with init scripts. Work with your infra/networking team to confirm outbound 443 is allowed to the target workspace hostname from your cluster subnet — that's usually the single thing standing between you and a working connection.

Hope this helps narrow it down.

aleksandra_ch
Databricks Employee
Databricks Employee

Hi @robertayoung520,

Could you elaborate on why do you need to connect to a remote DBSQL warehouse in another Databricks workspace? 

If it's to query data in that other workspace, there are other better ways (sorted from the most recommended to least recommended):

  • If Databricks workspaces share the same Unity Catalog metastore, you can manage cross-workspace queries using standard Unity Catalog queries and data governance tools. Note that In Unity Catalog, all catalogs are accessible by default from any workspace attached to the same metastore. 
  • If you want read-only access to data in a Databricks workspace attached to a different Unity Catalog metastore, whether in your Databricks account or not, Delta Sharing is a better choice.
  • Finally, you can use Lakehouse Federation to query data in another Databricks workspace.

Hope it helps.

Best regards,