cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to unzip files recursively and copy into a different folder

traillog
New Contributor

I am currently trying to unzip files recursively from one folder(source folder) and copy all the unzipped files into the destination folder using databricks(pyspark). The destination path is still empty even after running this code. I tried looking for solutions online, but most of them are for a single zip file only. 

import os
import zipfile
from pyspark.dbutils import DBUtils

local_file_path = dbutils.fs.ls('abfss://storage@container.dfs.core.windows.net/zipped_folder')
dest_file_path = dbutils.fs.ls('abfss://storage@container.dfs.core.windows.net/unzipped_folder')

def unzip_file(file_path, extract_path):
    # Copy the file to local file system
    local_path = "/tmp/" + os.path.basename(file_path)
    dbutils.fs.cp(file_path, "file:" + local_path, recurse=True)

    # Unzip the file on local file system
    with zipfile.ZipFile(local_path, 'r') as zip_ref:
        zip_ref.extractall("/tmp")

    # Create the destination folder if it doesn't exist
    dbutils.fs.mkdirs(extract_path)

    # Copy the unzipped files to the destination folder
    for root, dirs, files in os.walk("/tmp"):
        for filename in files:
            if filename != os.path.basename(file_path):
                local_file = os.path.join(root, filename)
                dest_file = extract_path + "/" + os.path.relpath(local_file, "/tmp")
                try:
                    dbutils.fs.cp("file:" + local_file, dest_file, recurse=True)  # Remove 'overwrite=True'
                except Exception as e:
                    print(f"Error copying file: {e}")

    # Remove the temporary files
    dbutils.fs.rm(local_path, recurse=True)

# Call the unzip_file function with the appropriate arguments
if dest_file_path:
    unzip_file(str(local_file_path[0].path), str(dest_file_path[0].path))

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @traillog

  • To recursively unzip files from a source folder, you can use the os.walk() function to traverse through all subdirectories and files.
  • Your current implementation only processes the top-level directory. To handle recursion, you need to iterate through all subdirectories as well.
  • Instead of copying files to the local file system and then unzipping them, you can directly unzip the files from the source folder.
  • Use dbutils.fs.cp() to copy files from one location to another within Databricks.
  • In your code, you’re using dbutils.fs.cp() without specifying overwrite=True. This means that if a file with the same name already exists in the destination folder, it won’t be overwritten.
  • To ensure that files are copied even if they already exist, add overwrite=True as an argument.
  • Here’s an updated version of your code:
  • import os
    import zipfile
    from pyspark.dbutils import DBUtils
    
    def unzip_file(file_path, extract_path):
        # Unzip the file directly from the source folder
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)
    
    # Specify your source and destination paths
    source_folder = 'abfss://storage@container.dfs.core.windows.net/zipped_folder'
    destination_folder = 'abfss://storage@container.dfs.core.windows.net/unzipped_folder'
    
    # Recursively process files in the source folder
    for root, dirs, files in dbutils.fs.ls(source_folder):
        for file_info in files:
            file_path = file_info.path
            if file_path.endswith('.zip'):
                unzip_file(file_path, destination_folder)
    
    print("Unzipping completed successfully!")
    
  • Remember to replace source_folder and destination_folder with your actual paths. This updated code should handle recursion, unzip files directly from the source folder, and overwrite existing files in the destination folder if necessary. Let me know if you have any questions or need further assistance! 
  1.  
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!