cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Max Columns for Delta table

User16783853906
Contributor III

Is there an upper limit/recommended max value for no. of columns for Delta table?

1 ACCEPTED SOLUTION

Accepted Solutions

User16783853906
Contributor III

Original answer posted by @Gray Gwizdz​ 

This was a fun question to try and find the answer to! Thank you for that 🙂

I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).

I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].

Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

View solution in original post

2 REPLIES 2

User16826994223
Honored Contributor III

There is no limit on no of column but one record should not be more than 20 MB

User16783853906
Contributor III

Original answer posted by @Gray Gwizdz​ 

This was a fun question to try and find the answer to! Thank you for that 🙂

I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).

I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].

Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now