cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Max Columns for Delta table

User16783853906
Contributor III

Is there an upper limit/recommended max value for no. of columns for Delta table?

1 ACCEPTED SOLUTION

Accepted Solutions

User16783853906
Contributor III

Original answer posted by @Gray Gwizdzโ€‹ 

This was a fun question to try and find the answer to! Thank you for that ๐Ÿ™‚

I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).

I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].

Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

View solution in original post

2 REPLIES 2

User16826994223
Honored Contributor III

There is no limit on no of column but one record should not be more than 20 MB

User16783853906
Contributor III

Original answer posted by @Gray Gwizdzโ€‹ 

This was a fun question to try and find the answer to! Thank you for that ๐Ÿ™‚

I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).

I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].

Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.