cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to upload large files to Databricks? and how to unzip files successfully?

Sagacious
New Contributor II

I have two JSON files, one ~3 gb and one ~5 gb. I am unable to upload them to databricks community edition as they exceed the max allowed up-loadable file size (~2 gb).

If I zip them I am able to upload them, but I am also having issues figuring out how to unzip the files into a readable format, currently it's only outputting unreadable characters in the import preview.

I'm relatively new to Databricks, just using it for a SQL certification, so I'd like to import the JSON into a query-able table.

Thanks.

5 REPLIES 5

Aviral-Bhardwaj
Esteemed Contributor III

@Sage Olson​  instead of uploading in databricks you can use any cloud provider and dump your data there and then read file from using databricks , it is safe

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, You can create a notebook inside a Databricks cluster and unzip the files using linux commands in the notebook, please refer: https://docs.databricks.com/notebooks/notebooks-code.html

Also, while after entering the command, please run the notebook in PYTHON mode and start the notebook cell with %sh which will pick up the commands as shell commands and unzip the file.

For unzipping you can refer to : https://docs.databricks.com/files/unzip-files.html and https://community.databricks.com/s/question/0D58Y00009az9bGSAQ/unzip-files .

Sagacious
New Contributor II

Thanks for your kind response. I've already found the article on shell commands and the unzipping information, however I just don't have the python background yet to set this up with just the documentation to go off of.

I understand that I need to set up the %sh command at the beginning, but I don't understand what to do with the "import" block of code. Where is that data being put? I can follow the notebook setup template after I can locate where the unzipped data is going via that import/unzip command.

Hubert-Dudek
Esteemed Contributor III

After uploading the zip, copy the path to it from UI and unzip with something similar to:

import zipfile
import io
import os
 
zip_file = "/dbfs/tmp/tmp.zip"
with zipfile.ZipFile(zip_file, "r") as z:
 
    for filename in z.namelist():
 
        with z.open(filename) as f:
 
            extracted_file = os.path.join("/dbfs/tmp/", filename)
            with open(extracted_file, "wb") as output:
                output.write(f.read())

Anonymous
Not applicable

Hi @Sage Olson​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.