cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to upload large files to Databricks? and how to unzip files successfully?

Sagacious
New Contributor II

I have two JSON files, one ~3 gb and one ~5 gb. I am unable to upload them to databricks community edition as they exceed the max allowed up-loadable file size (~2 gb).

If I zip them I am able to upload them, but I am also having issues figuring out how to unzip the files into a readable format, currently it's only outputting unreadable characters in the import preview.

I'm relatively new to Databricks, just using it for a SQL certification, so I'd like to import the JSON into a query-able table.

Thanks.

5 REPLIES 5

Aviral-Bhardwaj
Esteemed Contributor III

@Sage Olson​  instead of uploading in databricks you can use any cloud provider and dump your data there and then read file from using databricks , it is safe

AviralBhardwaj

Debayan
Databricks Employee
Databricks Employee

Hi, You can create a notebook inside a Databricks cluster and unzip the files using linux commands in the notebook, please refer: https://docs.databricks.com/notebooks/notebooks-code.html

Also, while after entering the command, please run the notebook in PYTHON mode and start the notebook cell with %sh which will pick up the commands as shell commands and unzip the file.

For unzipping you can refer to : https://docs.databricks.com/files/unzip-files.html and https://community.databricks.com/s/question/0D58Y00009az9bGSAQ/unzip-files .

Sagacious
New Contributor II

Thanks for your kind response. I've already found the article on shell commands and the unzipping information, however I just don't have the python background yet to set this up with just the documentation to go off of.

I understand that I need to set up the %sh command at the beginning, but I don't understand what to do with the "import" block of code. Where is that data being put? I can follow the notebook setup template after I can locate where the unzipped data is going via that import/unzip command.

Hubert-Dudek
Esteemed Contributor III

After uploading the zip, copy the path to it from UI and unzip with something similar to:

import zipfile
import io
import os
 
zip_file = "/dbfs/tmp/tmp.zip"
with zipfile.ZipFile(zip_file, "r") as z:
 
    for filename in z.namelist():
 
        with z.open(filename) as f:
 
            extracted_file = os.path.join("/dbfs/tmp/", filename)
            with open(extracted_file, "wb") as output:
                output.write(f.read())

Anonymous
Not applicable

Hi @Sage Olson​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group