I have a use case to create a table using JSON files. There are 36 million files in the upstream(S3 bucket). I just created a volume on top of it. So the volume has 36M files. I'm trying to form a data frame by reading this volume using the below spark line
spark.read.json(volume_path)
But I'm unable to form it due to the following error
The spark driver has stopped unexpectedly and is restarting.
I also tried just listing out this volume which had failed too with overhead limit 1000
I want to know the limitations of databricks volumes like
- Is there any limit of files it can contain?
- What is the good number of files should be in a volume such that the processing would be smoother?
- Is there a better alternative to process these 36M in the same volume?