Hi everybody,
I'm trying to setup a connection between Databricks' Notebooks and an external PostgreSQL database through a SSH-tunnel. On a single-node cluster, this is working perfectly fine. However, when this is ran on a multi-node cluster, this code fails. I assume it must be because of the localhost reference in the JDBC url which is interpreted by the executors on a multi-node cluster. However, inserting the hostname of the driver in that part of the URL with something like socket.getHostname() is not working as well, as was explained here: https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh...
Does anyone know how to make this work on multi-node clusters as well?
# Set up SSH tunnel
with SSHTunnelForwarder(
("data.host", 22),
ssh_username="databricks",
ssh_pkey=paramiko.RSAKey.from_private_key(private_key),
remote_bind_address=("127.0.0.1", 5432),
local_bind_address=("localhost", 5432)
) as tunnel:
print(tunnel.is_active)
jdbc_url = f"jdbc:postgresql://localhost:{tunnel.local_bind_port}/{database}"
print(jdbc_url)
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("user", username) \
.option("password", password) \
.option("driver", driver) \
.option("dbtable", f"(SELECT * FROM schema_test.table_test) as tmp") \
.load()
display(df)