Pyspark will error while I pack source zip package without dir.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2022 03:30 AM
If I send the package made by zipfile on spark.submit.pyFiles which zip by this code.
import zipfile, os
def make_zip(source_dir, output_filename):
with zipfile.ZipFile(output_filename, 'w') as zipf:
pre_len = len(os.path.dirname(source_dir))
count = 0
for parent, dirnames, filenames in os.walk(source_dir):
# if count == 0: ######I ignore this will error ####
# zipf.write(parent, parent[pre_len:].strip(os.path.sep))
for filename in filenames:
pathfile = os.path.join(parent, filename)
arcname = pathfile[pre_len:].strip(os.path.sep) # path
zipf.write(pathfile, arcname)
else:
count = 0
print(zipf.infolist())It will return could not find source error.
When I to add the path into zip package, it can run success.
- Labels:
-
Pyspark
-
Source Error
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2022 11:37 PM
Hi @pen poon , could you please refer https://docs.databricks.com/external-data/zip-files.html and let us know if this helps?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-20-2022 10:15 AM
I checked, and your code is ok. If you set source_dir and output_filename please remember to start path with /dbfs
If you work on the community edition you can get problems with access to underlying filesystem.
My blog: https://databrickster.medium.com/