Databricks Community

Shetty_1338 · 3 weeks ago

Hi,
As a proof of concept, I have created an ADLS and enabled SFTP> Created Local User and SSH Private key.

Now, I am trying to connect this SFTP connection directly in Databricks to create table or data frame.

Below is the code snipper I have used and the maven library -

com.springml : spark-sftp_2.11:1.1.5

code 1:

code 2:

I am getting below error for both the syntax. I am not sure where the code or values needs to be changed.

Can anyone help me resolve this?

BigRoux · 2 weeks ago

```scala
// For Spark 3.x with Scala 2.12 (common in newer Databricks runtimes)
com.springml:spark-sftp_2.12:1.1.5
```

```scala
com.springml:spark-sftp_2.11:1.1.5
```

```scala
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "your-sftp-host.com")
.option("port", "22") // Default SFTP port
.option("username", "your-username")
.option("pem", "/path/to/your/private/key") // Or use password option
.option("fileType", "csv") // Specify your file type
.option("delimiter", ",") // For CSV files
.option("header", "true") // If your file has headers
.load("/remote/path/to/your/file.csv")
```

```python
%sh
# Check DNS resolution
dig +short your-sftp-host.com

# Check port connectivity
nc -vz your-sftp-host.com 22
```

```python
%pip install paramiko

import paramiko
import os

# Set up connection parameters
hostname = "your-sftp-host.com"
username = "your-username"
key_path = "/dbfs/path/to/your/private/key" # Or use password authentication

# Create SFTP client
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# Connect with private key
key = paramiko.RSAKey.from_private_key_file(key_path)
ssh_client.connect(hostname=hostname, username=username, pkey=key)

# Create SFTP session
sftp_client = ssh_client.open_sftp()

# Download file to local storage
local_path = "/tmp/downloaded_file.csv"
remote_path = "/remote/path/to/file.csv"
sftp_client.get(remote_path, local_path)

# Close connection
sftp_client.close()
ssh_client.close()

# Read the downloaded file into a DataFrame
df = spark.read.csv(local_path, header=True)
```
Based on the error message and your setup, there are several potential issues with your SFTP connection in Databricks. Let's address them:

Connection Issues Analysis
The error message indicates a connection problem to your SFTP server. This could be due to:
1. Network connectivity issues between Databricks and your SFTP server
2. Incorrect credentials or configuration parameters
3. Library compatibility problems with your Databricks runtime

Solutions to Try

1. Check Library Compatibility
Ensure you're using a compatible version of the spark-sftp library for your Databricks runtime:

If you're using an older Databricks runtime with Spark 2.x:
2. Fix Connection Parameters
Try this revised code (see above) with explicit parameters:
3. Network Configuration

Make sure Databricks can reach your SFTP server:
1. Check if your SFTP server allows connections from Databricks IP ranges

2. Verify the port (typically 22) is open for SFTP connections

3. Run the diagnostic commands (see above) in a Databricks notebook to check connectivity:

4. Alternative Approach Using Paramiko
If the spark-sftp connector continues to cause issues, try using the Paramiko library: (see above)

5. Check for Scala Version Mismatch

The error `java.lang.NoClassDefFoundError: scala/Product$class` suggests a Scala version mismatch. Make sure the library version matches your Databricks runtime's Scala version (likely 2.12 for newer runtimes).

Remember to whitelist the Databricks IP ranges on your SFTP server, not just your local machine's IP, as the actual connection will be made from Databricks compute nodes.

Databricks Community

Trying to connect SFTP directly in Databricks

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!