Databricks Community

Sampath_Kumar · ‎03-17-2024

I have a use case to create a table using JSON files. There are 36 million files in the upstream(S3 bucket). I just created a volume on top of it. So the volume has 36M files. I'm trying to form a data frame by reading this volume using the below spark line

spark.read.json(volume_path)

But I'm unable to form it due to the following error

The spark driver has stopped unexpectedly and is restarting.

I also tried just listing out this volume which had failed too with overhead limit 1000

I want to know the limitations of databricks volumes like

Is there any limit of files it can contain?
What is the good number of files should be in a volume such that the processing would be smoother?
Is there a better alternative to process these 36M in the same volume?

Sampath_Kumar · ‎03-18-2024

Hi @Retired_mod,

Thanks for your prompt response.

I referred the volume information in the documentation.
Recommendations:
- Number of files:
  - While I understand that having fewer files with larger sizes tends to perform better than having a larger number of files with smaller sizes, we've received a specific number of files from the upstream team.
  - These files are located in an S3 location designated for data processing.
  - My task now is to process this data and redistribute it for better performance.
- Processing Efficiency:
  - In order to proceed with processing the data, I need to read it and form a data frame.
  - However, I'm encountering difficulties in reading the data efficiently, particularly due to its format and size.
- I've opted to use an external volume to leverage the same storage for this purpose.
Alternatives:
- The data is in the form of JSON files and is stored in an S3 bucket, not in ADLS Gen2.
- Given these constraints, I'm exploring alternative approaches to efficiently read and process the data.
Use Case:
1. To provide context, there are approximately 36 million JSON files in an S3 bucket that require processing.
2. The objective is to create a delta table in the silver layer, which involves changing the file format and shuffling the data accurately.
3. First to make it accessible in databricks, I've created an external volume on top of the folder containing all 36 million files.
4. Now I'm trying to create a delta table in silver layer which changes the file format and shuffles the data correctly.
5. To do the previous step, I'm using below spark line
  1. spark.read.json(volume_path) where I'm encountering an error, as mentioned in yesterday's question.
6. I'm seeking advice on whether there are alternative methods to read these files from the volume and create a data frame.

Your insights and guidance on this matter would be greatly appreciated.

de-qrosh · ‎09-27-2024

Hi,

as you have many files, I have a suggestion do not use spark to read them in all at once as it will slow down greatly.

instead use boto3 for the file listing, distribute the list across the cluster and again use boto3 to fetch the files and compact them into parquet, iceberg, delta or orc.
AWS has (had?) an example code (somewhere in GitHub) doing this using EMR and the Java API of spark.