cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

JDBC / SSH-tunnel to connect to PostgreSQL not working on multi-node clusters

laurenskuiper97
New Contributor

Hi everybody,

I'm trying to setup a connection between Databricks' Notebooks and an external PostgreSQL database through a SSH-tunnel. On a single-node cluster, this is working perfectly fine. However, when this is ran on a multi-node cluster, this code fails. I assume it must be because of the localhost reference in the JDBC url which is interpreted by the executors on a multi-node cluster. However, inserting the hostname of the driver in that part of the URL with something like socket.getHostname() is not working as well, as was explained here: https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh... 

Does anyone know how to make this work on multi-node clusters as well?

 

# Set up SSH tunnel
with SSHTunnelForwarder(
    ("data.host", 22),
    ssh_username="databricks",
    ssh_pkey=paramiko.RSAKey.from_private_key(private_key),
    remote_bind_address=("127.0.0.1", 5432),
    local_bind_address=("localhost", 5432)
) as tunnel:
    print(tunnel.is_active)
    jdbc_url = f"jdbc:postgresql://localhost:{tunnel.local_bind_port}/{database}"
    print(jdbc_url)
    df = spark.read \
        .format("jdbc") \
        .option("url", jdbc_url) \
        .option("user", username) \
        .option("password", password) \
        .option("driver", driver) \
        .option("dbtable", f"(SELECT * FROM schema_test.table_test) as tmp") \
        .load()
    display(df)

 

1 REPLY 1

-werners-
Esteemed Contributor III

I doubt it is possible.
The driver runs the program, and sends tasks to the executors.  But since creating the ssh tunnel is no spark task, I don't think it will be established on any executor.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group