Pyspark will error while I pack source zip package without dir.

pen
New Contributor II

If I send the package made by zipfile on spark.submit.pyFiles which zip by this code.

    import zipfile, os
    def make_zip(source_dir, output_filename):
        with zipfile.ZipFile(output_filename, 'w') as zipf:
            pre_len = len(os.path.dirname(source_dir))
            count = 0
            for parent, dirnames, filenames in os.walk(source_dir):
                # if count == 0:  ######I ignore this  will error ####
                #     zipf.write(parent, parent[pre_len:].strip(os.path.sep))
                for filename in filenames:
                    pathfile = os.path.join(parent, filename)
                    arcname = pathfile[pre_len:].strip(os.path.sep)  # path
                    zipf.write(pathfile, arcname)
                else:
                    count = 0
            print(zipf.infolist())

It will return could not find source error.

When I to add the path into zip package, it can run success.

Debayan
Databricks Employee
Databricks Employee

Hi @pen poon​ , could you please refer https://docs.databricks.com/external-data/zip-files.html and let us know if this helps?

Hubert-Dudek
Databricks MVP

I checked, and your code is ok. If you set source_dir and output_filename please remember to start path with /dbfs

If you work on the community edition you can get problems with access to underlying filesystem.


My blog: https://databrickster.medium.com/