cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

SFTP Autoloader

erigaud
Honored Contributor

Hello, 

Don't know if it is possible, but I am wondering if it is possible to ingest files from a SFTP server using autoloader ? Or do I have to first copy the files to my dbfs and then use autoloader on that location ? 

Thank you !

1 ACCEPTED SOLUTION

Accepted Solutions

Hey, 

There is two possible way. 

First, is to use this spark library to connect to the SFTP: https://github.com/springml/spark-sftp 
You connect to the SFTP and load the data into your storage. 

The second one is to use Azure Data Factory (if you are on Azure, i don't know the equivalent in other Cloud Provider) to ingest the data into your blob mounted in your databricks. When your ADF pipeline is finished you trigger your Databricks pipeline. 

Cheers ๐Ÿ™‚ 

 

View solution in original post

6 REPLIES 6

BriceBuso
Contributor II

Hello Etienne, 

I think it's not possible.The autoloader is used to process new files from a cloud storage as said in the documentation: 

BriceBuso_0-1689604622879.png

Moreover, to respect the Medallion cloud Architecture it's better to first ingest the files on the DBFS and after use the Autoloader on the specified location. 

 

erigaud
Honored Contributor

Thank you @BriceBuso !

So what is the recommended best practice to ingest data from SFTP data sources ? 

Hey, 

There is two possible way. 

First, is to use this spark library to connect to the SFTP: https://github.com/springml/spark-sftp 
You connect to the SFTP and load the data into your storage. 

The second one is to use Azure Data Factory (if you are on Azure, i don't know the equivalent in other Cloud Provider) to ingest the data into your blob mounted in your databricks. When your ADF pipeline is finished you trigger your Databricks pipeline. 

Cheers ๐Ÿ™‚ 

 

Anonymous
Not applicable

Hi @erigaud 

We haven't heard from you since the last response fromโ€‹, @BriceBuso  and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

erigaud
Honored Contributor

His suggestion helped, I accepted it as an answer, thank you !

 

There is one more option, i.e. on Azure you can enable SFTP on blob storage - so it will act as SFTP server, and from other end, you can mount the storage on your Databricks as volume - than use for it - autoloader/file trigger etc.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group