Databricks

Sagacious · ‎01-31-2023

I have two JSON files, one ~3 gb and one ~5 gb. I am unable to upload them to databricks community edition as they exceed the max allowed up-loadable file size (~2 gb).

If I zip them I am able to upload them, but I am also having issues figuring out how to unzip the files into a readable format, currently it's only outputting unreadable characters in the import preview.

I'm relatively new to Databricks, just using it for a SQL certification, so I'd like to import the JSON into a query-able table.

Thanks.

Aviral-Bhardwaj · ‎01-31-2023

@Sage Olson instead of uploading in databricks you can use any cloud provider and dump your data there and then read file from using databricks , it is safe

Debayan · ‎01-31-2023

Hi, You can create a notebook inside a Databricks cluster and unzip the files using linux commands in the notebook, please refer: https://docs.databricks.com/notebooks/notebooks-code.html

Also, while after entering the command, please run the notebook in PYTHON mode and start the notebook cell with %sh which will pick up the commands as shell commands and unzip the file.

For unzipping you can refer to : https://docs.databricks.com/files/unzip-files.html and https://community.databricks.com/s/question/0D58Y00009az9bGSAQ/unzip-files .

Sagacious · ‎01-31-2023

Thanks for your kind response. I've already found the article on shell commands and the unzipping information, however I just don't have the python background yet to set this up with just the documentation to go off of.

I understand that I need to set up the %sh command at the beginning, but I don't understand what to do with the "import" block of code. Where is that data being put? I can follow the notebook setup template after I can locate where the unzipped data is going via that import/unzip command.

Hubert-Dudek · ‎02-01-2023

After uploading the zip, copy the path to it from UI and unzip with something similar to:

import zipfile
import io
import os
 
zip_file = "/dbfs/tmp/tmp.zip"
with zipfile.ZipFile(zip_file, "r") as z:
 
    for filename in z.namelist():
 
        with z.open(filename) as f:
 
            extracted_file = os.path.join("/dbfs/tmp/", filename)
            with open(extracted_file, "wb") as output:
                output.write(f.read())

Anonymous · ‎04-08-2023

Hi @Sage Olson

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

Databricks

How to upload large files to Databricks? and how to unzip files successfully?

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!