cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Volume Limitations

Sampath_Kumar
New Contributor II

I have a use case to create a table using JSON files. There are 36 million files in the upstream(S3 bucket). I just created a volume on top of it. So the volume has 36M files.  I'm trying to form a data frame by reading this volume using the below spark line

spark.read.json(volume_path)

But I'm unable to form it due to the following error

 The spark driver has stopped unexpectedly and is restarting.

I also tried just listing out this volume which had failed too with overhead limit 1000

I want to know the limitations of databricks volumes like

  1. Is there any limit of files it can contain?
  2. What is the good number of files should be in a volume such that the processing would be smoother?
  3. Is there a better alternative to process these 36M in the same volume?
2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Sampath_Kumar, Let’s delve into the limitations and best practices related to Databricks volumes.

  1. Volume Limitations:

    • Managed Volumes: These are Unity Catalog-governed storage volumes created within the default storage location of the containing schema. They allow the creation of governed storage for working with files without the overhead of external locations and storage credentials.
    • External Volumes: These are Unity Catalog-governed storage volumes registered against a directory within an external location (e.g., S3, ADLS, etc.).
    • Volume Naming and Reference:
      • A volume name is an identifier that can be qualified with a catalog and schema name in SQL commands.
      • The path to access files in volumes uses the following format:
        • /Volumes/<catalog_identifier>/<schema_identifier>/<volume_identifier>/<path>/<file_name>
        • Alternatively, you can use dbfs:/ scheme, like:
          • dbfs:/Volumes/<catalog_identifier>/<schema_identifier>/<volume_identifier>/<path>/<file_name>
      • Note that Databricks normalizes the identifiers to lowercase.
    • File Format Support:
      • Volumes can store and access files in any format, including structured, semi-structured, and unstructured data.
  2. Recommended Practices:

    • Number of Files:
      • While there isn’t a strict limit on the number of files a volume can contain, consider the following:
        • Smaller Files: Having a large number of very small files can impact performance due to overhead (metadata management, listing, etc.).
        • Larger Files: It’s generally better to have fewer, larger files rather than many tiny files.
      • Aim for a balance between granularity (smaller files) and manageability (fewer, larger files).
    • Processing Efficiency:
      • If you’re dealing with 36 million files, consider partitioning your data to optimize query performance.
      • Use appropriate file formats (e.g., Parquet, ORC) that allow predicate pushdown and column pruning.
      • Leverage Databricks Delta Lake for transactional capabilities and performance enhancements.
    • External vs. Managed Volumes:
      • Managed volumes are more convenient for governance, but external volumes allow you to work with data in existing storage locations.
      • Choose based on your use case and requirements.
  3. Alternatives:

    • Databricks Delta Lake:
      • If your data is structured, consider using Delta Lake. It provides ACID transactions, schema evolution, and performance optimizations.
      • Delta Lake can handle large-scale data efficiently.
    • Data Lake Storage Gen2 (ADLS Gen2):
      • If your data is already in ADLS Gen2, you can directly query it without creating a volume.
      • Use Databricks to read from ADLS Gen2 paths directly.

Remember that Databricks volumes are a powerful way to organize and govern your data, but thoughtful design and optimization are crucial for efficient processing. Experiment with different approaches to find what works best for your specific use case! 🚀🔍

For more details, you can refer to the official Databricks documentation on volumes1.

 

Sampath_Kumar
New Contributor II

Hi @Kaniz,

Thanks for your prompt response.

  1. I referred the volume information in the documentation.
  2. Recommendations:
    • Number of files:
      • While I understand that having fewer files with larger sizes tends to perform better than having a larger number of files with smaller sizes, we've received a specific number of files from the upstream team.
      • These files are located in an S3 location designated for data processing.
      • My task now is to process this data and redistribute it for better performance.
    • Processing Efficiency:
      • In order to proceed with processing the data, I need to read it and form a data frame.
      • However, I'm encountering difficulties in reading the data efficiently, particularly due to its format and size.
    • I've opted to use an external volume to leverage the same storage for this purpose.
  3. Alternatives:
    • The data is in the form of JSON files and is stored in an S3 bucket, not in ADLS Gen2.
    • Given these constraints, I'm exploring alternative approaches to efficiently read and process the data.
  4. Use Case:
    1. To provide context, there are approximately 36 million JSON files in an S3 bucket that require processing.
    2. The objective is to create a delta table in the silver layer, which involves changing the file format and shuffling the data accurately.
    3. First to make it accessible in databricks, I've created an external volume on top of the folder containing all 36 million files.
    4. Now I'm trying to create a delta table in silver layer which changes the file format and shuffles the data correctly.
    5. To do the previous step, I'm using below spark line
      1.  spark.read.json(volume_path) where I'm encountering an error, as mentioned in yesterday's question.
    6. I'm seeking advice on whether there are alternative methods to read these files from the volume and create a data frame.

Your insights and guidance on this matter would be greatly appreciated.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.