cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

JDBC / SSH-tunnel to connect to PostgreSQL not working on multi-node clusters

laurenskuiper97
New Contributor

Hi everybody,

I'm trying to setup a connection between Databricks' Notebooks and an external PostgreSQL database through a SSH-tunnel. On a single-node cluster, this is working perfectly fine. However, when this is ran on a multi-node cluster, this code fails. I assume it must be because of the localhost reference in the JDBC url which is interpreted by the executors on a multi-node cluster. However, inserting the hostname of the driver in that part of the URL with something like socket.getHostname() is not working as well, as was explained here: https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh... 

Does anyone know how to make this work on multi-node clusters as well?

 

# Set up SSH tunnel
with SSHTunnelForwarder(
    ("data.host", 22),
    ssh_username="databricks",
    ssh_pkey=paramiko.RSAKey.from_private_key(private_key),
    remote_bind_address=("127.0.0.1", 5432),
    local_bind_address=("localhost", 5432)
) as tunnel:
    print(tunnel.is_active)
    jdbc_url = f"jdbc:postgresql://localhost:{tunnel.local_bind_port}/{database}"
    print(jdbc_url)
    df = spark.read \
        .format("jdbc") \
        .option("url", jdbc_url) \
        .option("user", username) \
        .option("password", password) \
        .option("driver", driver) \
        .option("dbtable", f"(SELECT * FROM schema_test.table_test) as tmp") \
        .load()
    display(df)

 

1 REPLY 1

-werners-
Esteemed Contributor III

I doubt it is possible.
The driver runs the program, and sends tasks to the executors.  But since creating the ssh tunnel is no spark task, I don't think it will be established on any executor.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!