Databricks Community

akdm · ‎09-02-2022

When I try to convert a notebook into a job I frequently run into an issue with writing to the local filesystem. For this particular example, I did all my notebook testing with a bytestream for small files. When I tried to run as a job, I used the method I had to save the download to disk but I keep getting a `FileNotFoundError`. An example code snippet is below with two methods I've tried:

# method 1
def sftp_read(sftp_object, prefix):
    key = f'{prefix}/{sftp_object}'
    if not os.path.exists('/local_disk0/tmp/'):
        os.makedirs('/local_disk0/tmp/')
    sftp.get(sftp_object, f'/local_disk0/tmp/{sftp_object}')
    # do stuff
    os.remove(f'/local_disk0/tmp/{sftp_object}')
 
# method 2
def sftp_read(sftp_object, prefix):
    key = f'{prefix}/{sftp_object}'
    dbutils.fs.mkdirs('file:/tmp/')
    sftp.get(sftp_object, f'file:/tmp/{sftp_object}')
    # do stuff
    dbutils.fs.rm(f'file:/tmp/{sftp_object}')
        
 
FileNotFoundError                         Traceback (most recent call last)
<command-3394785040378964> in <cell line: 1>()
      3     if dryrun:
      4         print(sftp_object)
----> 5     sftp_read(sftp_object, prefix)
 
<command-3394785040378909> in sftp_read(sftp_object, prefix)
     57             if not os.path.exists('/local_disk0/tmp/'):
     58                 os.makedirs('/local_disk0/tmp/')
---> 59             sftp.get(sftp_object, f'/local_disk0/tmp/{sftp_object}')
     60             # do stuff
     61             os.remove(f'/local_disk0/tmp/{sftp_object}')
 
/local_disk0/.ephemeral_nfs/envs/pythonEnv-1e9ce7e1-d7d5-4473-b8d6-dbe59be12302/lib/python3.9/site-packages/paramiko/sftp_client.py in get(self, remotepath, localpath, callback, prefetch)
    808             Added the ``prefetch`` keyword argument.
    809         """
--> 810         with open(localpath, "wb") as fl:
    811             size = self.getfo(remotepath, fl, callback, prefetch)
    812         s = os.stat(localpath)
 
FileNotFoundError: [Errno 2] No such file or directory: '/local_disk0/tmp/path/to/file.ext'

I have referenced the DFBS local files documentation as well: https://docs.databricks.com/files/index.html

Any suggestions or something I need to know about Jobs running in a different manner than notebooks?

akdm · ‎09-07-2022

I was able to fix it. It was an issue with the nested files on the SFTP. I had to ensure that the parent folders were being created as well. Splitting out the local path and file made it easier to ensure that it existed with os.path.exists() and os.makedirs()

def sftp_read(sftp_object, bucket, prefix):
    key = f'{prefix}/{sftp_object}'
    local_path = '/local_disk0/tmp'
    local_file = f'{local_path}/{os.path.basename(sftp_object)}'
    if not os.path.exists(local_path):
            os.makedirs(local_path)
    sftp.get(sftp_object, local_file)
     # do stuff
     os.remove(local_file)

All in all, not a databricks issue, just an issue that appeared on databricks.

View solution in original post

Debayan · ‎09-02-2022

Hi, Thanks for reaching out to community.databricks.com.

Could you please also mention where did you declare sftp_object? Also, how did you set the connection to SFTP? Was it with password or password less?

akdm · ‎09-03-2022

SFTP connection is done with a password

def connect_to_sftp(host: str, port: (str,int), username: str, password:str) -> paramiko.sftp_client.SFTPClient:
    stfp, transport = None, None
    try:
        transport = paramiko.Transport(host, port)
        transport.connect(username=username, password=password)
    except Exception as e:
        print(e)
        if transport is not None:
            transport.close()
    try:
        sftp = paramiko.SFTPClient.from_transport(transport)
        return sftp
    except Exception as e:
        print(e)
        if sftp is not None:
            sftp.close()

sftp_object is the name of the remote object:

for file in sftp.listdir(sftp_dir):
    sftp_object = f'{sftp_dir}/{file}'
    if dryrun:
        print(sftp_object)
    sftp_read(sftp_object, prefix)

akdm · ‎09-07-2022

I was able to fix it. It was an issue with the nested files on the SFTP. I had to ensure that the parent folders were being created as well. Splitting out the local path and file made it easier to ensure that it existed with os.path.exists() and os.makedirs()

def sftp_read(sftp_object, bucket, prefix):
    key = f'{prefix}/{sftp_object}'
    local_path = '/local_disk0/tmp'
    local_file = f'{local_path}/{os.path.basename(sftp_object)}'
    if not os.path.exists(local_path):
            os.makedirs(local_path)
    sftp.get(sftp_object, local_file)
     # do stuff
     os.remove(local_file)

All in all, not a databricks issue, just an issue that appeared on databricks.

Databricks Community

FileNotFoundError when using sftp to write to disk within jobs

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon