cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Trying to connect SFTP directly in Databricks

Shetty_1338
New Contributor

Hi, 
As a proof of concept, I have created an ADLS and enabled SFTP> Created Local User and SSH Private key. 

Now, I am trying to connect this SFTP connection directly in Databricks to create table or data frame. 

Below is the code snipper I have used and the maven library -

com.springml : spark-sftp_2.11:1.1.5
 
 
code 1:
Shetty_1338_0-1743625413680.png

 

code 2: 
Shetty_1338_1-1743625438294.png

I am getting below error for both the syntax. I am not sure where the code or values needs to be changed.

Shetty_1338_2-1743625482972.png

 


Can anyone help me resolve this?
1 REPLY 1

BigRoux
Databricks Employee
Databricks Employee

```scala
// For Spark 3.x with Scala 2.12 (common in newer Databricks runtimes)
com.springml:spark-sftp_2.12:1.1.5
```

 


```scala
com.springml:spark-sftp_2.11:1.1.5
```


```scala
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "your-sftp-host.com")
.option("port", "22") // Default SFTP port
.option("username", "your-username")
.option("pem", "/path/to/your/private/key") // Or use password option
.option("fileType", "csv") // Specify your file type
.option("delimiter", ",") // For CSV files
.option("header", "true") // If your file has headers
.load("/remote/path/to/your/file.csv")
```

 


```python
%sh
# Check DNS resolution
dig +short your-sftp-host.com

# Check port connectivity
nc -vz your-sftp-host.com 22
```


```python
%pip install paramiko

import paramiko
import os

# Set up connection parameters
hostname = "your-sftp-host.com"
username = "your-username"
key_path = "/dbfs/path/to/your/private/key" # Or use password authentication

# Create SFTP client
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# Connect with private key
key = paramiko.RSAKey.from_private_key_file(key_path)
ssh_client.connect(hostname=hostname, username=username, pkey=key)

# Create SFTP session
sftp_client = ssh_client.open_sftp()

# Download file to local storage
local_path = "/tmp/downloaded_file.csv"
remote_path = "/remote/path/to/file.csv"
sftp_client.get(remote_path, local_path)

# Close connection
sftp_client.close()
ssh_client.close()

# Read the downloaded file into a DataFrame
df = spark.read.csv(local_path, header=True)
```
Based on the error message and your setup, there are several potential issues with your SFTP connection in Databricks. Let's address them:


Connection Issues Analysis
The error message indicates a connection problem to your SFTP server. This could be due to:
1. Network connectivity issues between Databricks and your SFTP server
2. Incorrect credentials or configuration parameters
3. Library compatibility problems with your Databricks runtime


Solutions to Try


1. Check Library Compatibility
Ensure you're using a compatible version of the spark-sftp library for your Databricks runtime:

If you're using an older Databricks runtime with Spark 2.x:
2. Fix Connection Parameters
Try this revised code (see above) with explicit parameters:
3. Network Configuration


Make sure Databricks can reach your SFTP server:
1. Check if your SFTP server allows connections from Databricks IP ranges


2. Verify the port (typically 22) is open for SFTP connections


3. Run the diagnostic commands (see above) in a Databricks notebook to check connectivity:

4. Alternative Approach Using Paramiko
If the spark-sftp connector continues to cause issues, try using the Paramiko library: (see above)

5. Check for Scala Version Mismatch

The error `java.lang.NoClassDefFoundError: scala/Product$class` suggests a Scala version mismatch. Make sure the library version matches your Databricks runtime's Scala version (likely 2.12 for newer runtimes).

Remember to whitelist the Databricks IP ranges on your SFTP server, not just your local machine's IP, as the actual connection will be made from Databricks compute nodes.