topic Re: Get metadata of files present in a zip in Data Engineering

Get metadata of files present in a zip

seeker — Mon, 19 Aug 2024 02:05:51 GMT

I have a .zip file present on an ADLS path which contains multiple files of different formats. I want to get metadata of the files like file name, modification time present in it without unzipping it. I have a code which works for smaller zip but runs into memory issues for large zip files leading to job failures. Is there a way to handle this within pyspark itself?

Re: Get metadata of files present in a zip

seeker — Mon, 19 Aug 2024 02:23:23 GMT

Here is the code which i am using

def register_udf(): def extract_file_metadata_from_zip(binary_content): metadata_list = [] with io.BytesIO(binary_content) as bio: with zipfile.ZipFile(bio, "r") as zip_ref: for file_info in zip_ref.infolist(): file_name = file_info.filename modification_time = datetime.datetime(*file_info.date_time) metadata_list.append((file_name, modification_time)) return metadata_list meta_schema = ArrayType( StructType( [ StructField("file_name", StringType(), True), StructField("modification_time", TimestampType(), True), ] ) ) extract_metadata_udf = udf(extract_file_metadata_from_zip, meta_schema) return extract_metadata_udf def get_last_modification_times(zip_file_path, expected_date, extract_metadata_udf): try: zip_file_df = ( spark.read.format("binaryFile") .option("pathGlobFilter", "*.zip") .load(zip_file_path) ) extracted_metadata_df = zip_file_df.withColumn( "file_metadata", extract_metadata_udf(col("content")) ) exploded_metadata_df = extracted_metadata_df.select( explode("file_metadata").alias("metadata") ) return exploded_metadata_df except Exception as e: print("An error occurred: ", str(e))

Re: Get metadata of files present in a zip

szymon_dybczak — Mon, 19 Aug 2024 13:34:33 GMT

Hi @seeker ,

I'm afraid there is no easy way to do that in pyspark. Spark supports the following compression formats:

bzip2
deflate
snappy
lz4
gzip

Thus, there is no native support for the zip format. And your solution will be slow, because you are using UDF which means it will apply this function on every row 😕

Re: Get metadata of files present in a zip

Retired_mod — Tue, 20 Aug 2024 11:43:20 GMT

Hi @seeker, Thanks for reaching out!

Please review the responses and let us know which best addresses your question. Your feedback is valuable to us and the community.

If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.

We appreciate your participation and are here if you need further assistance!

Re: Get metadata of files present in a zip

Retired_mod — Tue, 20 Aug 2024 18:55:36 GMT

Hi @seeker, There are only 2 ways I can think of to do it:

Write a UDF.
Write customized MapReduce logic instead of using Spark SQL.

But they are kind of the same. So I would say UDF is a good solution.