cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

what config do we use to set row groups fro delta tables on data bricks.

dlaxminaresh
New Contributor

I have tried multiples way to set row group for delta tables on data bricks notebook its not working where as I am able to set it properly using spark.
I tried 

1. val blockSize = 1024 * 1024 * 60

spark.sparkContext.hadoopConfiguration.setInt( "dfs.blocksize", blockSize )
spark.sparkContext.hadoopConfiguration.setInt( "parquet.block.size", blockSize )


2. df.repartition(1).write.option("parquet.block.size",blockSize).format("delta").mode("overwrite").save("<path>")
Same configs are working fine on simple parquet.
df size = 600 MB
block size = 60 MB

NumRowGroups should be 10

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @dlaxminaresh, Setting row groups for Delta tables in Databricks can be a bit tricky, but let’s explore some options to achieve this.

First, let’s address the approaches you’ve tried:

  1. Setting Block Sizes:

    • You’ve attempted to set the block size using the following configurations:
      val blockSize = 1024 * 1024 * 60
      spark.sparkContext.hadoopConfiguration.setInt("dfs.blocksize", blockSize)
      spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", blockSize)
      
    • While this approach works well for regular Parquet files, it might not directly impact Delta tables. Delta Lake has its own internal organization, including row groups, which are managed differently.
    • The dfs.blocksize and parquet.block.size settings primarily affect the HDFS block size and the size of individual Parquet files, respectively. However, they don’t directly control the row group size within a Delta table.
  2. Repartitioning and Writing to Delta:

    • You’ve also tried repartitioning your DataFrame and writing it to a Delta table:
      df.repartition(1).write.option("parquet.block.size", blockSize).format("delta").mode("overwrite").save("<path>")
      
    • This approach should work for regular Parquet files, but again, it doesn’t directly control the row group size within a Delta table.

Now, let’s explore some alternative ways to set the row group size for Delta tables:

  1. Delta Target File Size:

    • Delta tables allow you to specify the target file size using the delta.targetFileSize parameter during table creation. This indirectly affects the row group size.
    • Example:
      df.write.format("delta").mode("overwrite").option("delta.targetFileSize", 512000).saveAsTable("my_delta_table")
      
    • Adjust the delta.targetFileSize value according to your desired row group size (e.g., 512000 bytes for approximately 500 KB row groups).
  2. Auto-Merging Schema:

    • To efficiently manage schema evolution, enable auto-merging of schema changes:
      spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
      
    • This ensures that schema changes are automatically merged when writing to the Delta table.
  3. Optimal Row Group Size:

    • There’s no one-size-fits-all answer for the optimal row group size. It depends on your specific use case, query patterns, and data distribution.
    • Consider experimenting with different delta.targetFileSize values to find the right balance between row group size and query performance.

Remember that Delta tables provide additional features beyond regular Parquet files, such as ACID transactions and time travel. If you’re working with large datasets, it’s essential to optimize both storage and query performance. Feel free to adjust the parameters based on your requirements123. 😊🚀