cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark will error while I pack source zip package without dir.

pen
New Contributor II

If I send the package made by zipfile on spark.submit.pyFiles which zip by this code.

    import zipfile, os
    def make_zip(source_dir, output_filename):
        with zipfile.ZipFile(output_filename, 'w') as zipf:
            pre_len = len(os.path.dirname(source_dir))
            count = 0
            for parent, dirnames, filenames in os.walk(source_dir):
                # if count == 0:  ######I ignore this  will error ####
                #     zipf.write(parent, parent[pre_len:].strip(os.path.sep))
                for filename in filenames:
                    pathfile = os.path.join(parent, filename)
                    arcname = pathfile[pre_len:].strip(os.path.sep)  # path
                    zipf.write(pathfile, arcname)
                else:
                    count = 0
            print(zipf.infolist())

It will return could not find source error.

When I to add the path into zip package, it can run success.

2 REPLIES 2

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi @pen poon​ , could you please refer https://docs.databricks.com/external-data/zip-files.html and let us know if this helps?

Hubert-Dudek
Esteemed Contributor III

I checked, and your code is ok. If you set source_dir and output_filename please remember to start path with /dbfs

If you work on the community edition you can get problems with access to underlying filesystem.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.