How to upload large files to Databricks? and how to unzip files successfully?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 09:02 PM
I have two JSON files, one ~3 gb and one ~5 gb. I am unable to upload them to databricks community edition as they exceed the max allowed up-loadable file size (~2 gb).
If I zip them I am able to upload them, but I am also having issues figuring out how to unzip the files into a readable format, currently it's only outputting unreadable characters in the import preview.
I'm relatively new to Databricks, just using it for a SQL certification, so I'd like to import the JSON into a query-able table.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 09:08 PM
@Sage Olson instead of uploading in databricks you can use any cloud provider and dump your data there and then read file from using databricks , it is safe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 09:19 PM
Hi, You can create a notebook inside a Databricks cluster and unzip the files using linux commands in the notebook, please refer: https://docs.databricks.com/notebooks/notebooks-code.html
Also, while after entering the command, please run the notebook in PYTHON mode and start the notebook cell with %sh which will pick up the commands as shell commands and unzip the file.
For unzipping you can refer to : https://docs.databricks.com/files/unzip-files.html and https://community.databricks.com/s/question/0D58Y00009az9bGSAQ/unzip-files .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 09:45 PM
Thanks for your kind response. I've already found the article on shell commands and the unzipping information, however I just don't have the python background yet to set this up with just the documentation to go off of.
I understand that I need to set up the %sh command at the beginning, but I don't understand what to do with the "import" block of code. Where is that data being put? I can follow the notebook setup template after I can locate where the unzipped data is going via that import/unzip command.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-01-2023 01:40 AM
After uploading the zip, copy the path to it from UI and unzip with something similar to:
import zipfile
import io
import os
zip_file = "/dbfs/tmp/tmp.zip"
with zipfile.ZipFile(zip_file, "r") as z:
for filename in z.namelist():
with z.open(filename) as f:
extracted_file = os.path.join("/dbfs/tmp/", filename)
with open(extracted_file, "wb") as output:
output.write(f.read())
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-08-2023 09:04 PM
Hi @Sage Olson
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!