- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-08-2021 03:25 PM
Is there an upper limit/recommended max value for no. of columns for Delta table?
- Labels:
-
Delta
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 02:35 PM
Original answer posted by @Gray Gwizdz
This was a fun question to try and find the answer to! Thank you for that 🙂
I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).
I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].
Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 01:13 AM
There is no limit on no of column but one record should not be more than 20 MB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 02:35 PM
Original answer posted by @Gray Gwizdz
This was a fun question to try and find the answer to! Thank you for that 🙂
I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).
I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].
Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

