<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Volume Limitations in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/92138#M38367</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;as you have many files, I have a suggestion do not use spark to read them in all at once as it will slow down greatly.&lt;/P&gt;&lt;P&gt;instead use boto3 for the file listing, distribute the list across the cluster and again use boto3 to fetch the files and compact them into parquet, iceberg, delta or orc.&amp;nbsp;&lt;BR /&gt;AWS has (had?) an example code (somewhere in GitHub) doing this using EMR and the Java API of spark.&lt;/P&gt;</description>
    <pubDate>Fri, 27 Sep 2024 19:13:59 GMT</pubDate>
    <dc:creator>de-qrosh</dc:creator>
    <dc:date>2024-09-27T19:13:59Z</dc:date>
    <item>
      <title>Volume Limitations</title>
      <link>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/63935#M32408</link>
      <description>&lt;P&gt;I have a use case to create a table using JSON files. There are 36 million files in the upstream(S3 bucket). I just created a volume on top of it. So the volume has 36M files.&amp;nbsp; I'm trying to form a data frame by reading this volume using the below spark line&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;spark.read.json(volume_path)&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;But I'm unable to form it due to the following error&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;The spark driver has stopped unexpectedly and is restarting.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I also tried just listing out this volume which had failed too with &lt;STRONG&gt;overhead limit 1000&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I want to know the limitations of databricks volumes like&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Is there any limit of files it can contain?&lt;/LI&gt;&lt;LI&gt;What is the good number of files should be in a volume such that the processing would be smoother?&lt;/LI&gt;&lt;LI&gt;Is there a better alternative to process these 36M in the same volume?&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Mon, 18 Mar 2024 03:26:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/63935#M32408</guid>
      <dc:creator>Sampath_Kumar</dc:creator>
      <dc:date>2024-03-18T03:26:49Z</dc:date>
    </item>
    <item>
      <title>Re: Volume Limitations</title>
      <link>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/64043#M32443</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;Thanks for your prompt response.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;I referred the volume information in the documentation.&lt;/LI&gt;&lt;LI&gt;Recommendations:&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Number of files:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;While I understand that having fewer files with larger sizes tends to perform better than having a larger number of files with smaller sizes, we've received a specific number of files from the upstream team.&lt;/LI&gt;&lt;LI&gt;These files are located in an S3 location designated for data processing.&lt;/LI&gt;&lt;LI&gt;My task now is to process this data and redistribute it for better performance.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Processing Efficiency:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;In order to proceed with processing the data, I need to read it and form a data frame.&lt;/LI&gt;&lt;LI&gt;However, I'm encountering difficulties in reading the data efficiently, particularly due to its format and size.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;I've opted to use an external volume to leverage the same storage for this purpose.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Alternatives:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;The data is in the form of JSON files and is stored in an S3 bucket, not in ADLS Gen2.&lt;/LI&gt;&lt;LI&gt;Given these constraints, I'm exploring alternative approaches to efficiently read and process the data.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Use Case:&lt;/STRONG&gt;&lt;OL&gt;&lt;LI&gt;To provide context, there are approximately 36 million JSON files in an S3 bucket that require processing.&lt;/LI&gt;&lt;LI&gt;The objective is to create a delta table in the silver layer, which involves changing the file format and shuffling the data accurately.&lt;/LI&gt;&lt;LI&gt;First to make it accessible in databricks,&amp;nbsp;I've created an external volume on top of the folder containing all 36 million files.&lt;/LI&gt;&lt;LI&gt;Now I'm trying to create a delta table in silver layer which changes the file format and shuffles the data correctly.&lt;/LI&gt;&lt;LI&gt;To do the previous step, I'm using below spark line&lt;OL&gt;&lt;LI&gt;&amp;nbsp;&lt;STRONG&gt;spark.read.json(volume_path)&amp;nbsp;&lt;/STRONG&gt;where I'm encountering an error, as mentioned in yesterday's question.&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;I'm seeking advice on whether there are alternative methods to read these files from the volume and create a data frame.&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Your insights and guidance on this matter would be greatly appreciated.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Mar 2024 02:25:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/64043#M32443</guid>
      <dc:creator>Sampath_Kumar</dc:creator>
      <dc:date>2024-03-19T02:25:56Z</dc:date>
    </item>
    <item>
      <title>Re: Volume Limitations</title>
      <link>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/92138#M38367</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;as you have many files, I have a suggestion do not use spark to read them in all at once as it will slow down greatly.&lt;/P&gt;&lt;P&gt;instead use boto3 for the file listing, distribute the list across the cluster and again use boto3 to fetch the files and compact them into parquet, iceberg, delta or orc.&amp;nbsp;&lt;BR /&gt;AWS has (had?) an example code (somewhere in GitHub) doing this using EMR and the Java API of spark.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2024 19:13:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/volume-limitations/m-p/92138#M38367</guid>
      <dc:creator>de-qrosh</dc:creator>
      <dc:date>2024-09-27T19:13:59Z</dc:date>
    </item>
  </channel>
</rss>

