Trying to connect SFTP directly in Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi,
As a proof of concept, I have created an ADLS and enabled SFTP> Created Local User and SSH Private key.
Now, I am trying to connect this SFTP connection directly in Databricks to create table or data frame.
Below is the code snipper I have used and the maven library -
I am getting below error for both the syntax. I am not sure where the code or values needs to be changed.
Can anyone help me resolve this?
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
```scala
// For Spark 3.x with Scala 2.12 (common in newer Databricks runtimes)
com.springml:spark-sftp_2.12:1.1.5
```
```scala
com.springml:spark-sftp_2.11:1.1.5
```
```scala
val df = spark.read
.format("com.springml.spark.sftp")
.option("host", "your-sftp-host.com")
.option("port", "22") // Default SFTP port
.option("username", "your-username")
.option("pem", "/path/to/your/private/key") // Or use password option
.option("fileType", "csv") // Specify your file type
.option("delimiter", ",") // For CSV files
.option("header", "true") // If your file has headers
.load("/remote/path/to/your/file.csv")
```
```python
%sh
# Check DNS resolution
dig +short your-sftp-host.com
# Check port connectivity
nc -vz your-sftp-host.com 22
```
```python
%pip install paramiko
import paramiko
import os
# Set up connection parameters
hostname = "your-sftp-host.com"
username = "your-username"
key_path = "/dbfs/path/to/your/private/key" # Or use password authentication
# Create SFTP client
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
# Connect with private key
key = paramiko.RSAKey.from_private_key_file(key_path)
ssh_client.connect(hostname=hostname, username=username, pkey=key)
# Create SFTP session
sftp_client = ssh_client.open_sftp()
# Download file to local storage
local_path = "/tmp/downloaded_file.csv"
remote_path = "/remote/path/to/file.csv"
sftp_client.get(remote_path, local_path)
# Close connection
sftp_client.close()
ssh_client.close()
# Read the downloaded file into a DataFrame
df = spark.read.csv(local_path, header=True)
```
Based on the error message and your setup, there are several potential issues with your SFTP connection in Databricks. Let's address them:
Connection Issues Analysis
The error message indicates a connection problem to your SFTP server. This could be due to:
1. Network connectivity issues between Databricks and your SFTP server
2. Incorrect credentials or configuration parameters
3. Library compatibility problems with your Databricks runtime
Solutions to Try
1. Check Library Compatibility
Ensure you're using a compatible version of the spark-sftp library for your Databricks runtime:
If you're using an older Databricks runtime with Spark 2.x:
2. Fix Connection Parameters
Try this revised code (see above) with explicit parameters:
3. Network Configuration
Make sure Databricks can reach your SFTP server:
1. Check if your SFTP server allows connections from Databricks IP ranges
2. Verify the port (typically 22) is open for SFTP connections
3. Run the diagnostic commands (see above) in a Databricks notebook to check connectivity:
4. Alternative Approach Using Paramiko
If the spark-sftp connector continues to cause issues, try using the Paramiko library: (see above)
5. Check for Scala Version Mismatch
The error `java.lang.NoClassDefFoundError: scala/Product$class` suggests a Scala version mismatch. Make sure the library version matches your Databricks runtime's Scala version (likely 2.12 for newer runtimes).
Remember to whitelist the Databricks IP ranges on your SFTP server, not just your local machine's IP, as the actual connection will be made from Databricks compute nodes.

