Databricks

User16783853906 · ‎06-08-2021

Is there an upper limit/recommended max value for no. of columns for Delta table?

User16783853906 · ‎06-23-2021

Original answer posted by @Gray Gwizdz

This was a fun question to try and find the answer to! Thank you for that 🙂

I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).

I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].

Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

View solution in original post

User16826994223 · ‎06-23-2021

There is no limit on no of column but one record should not be more than 20 MB

User16783853906 · ‎06-23-2021

Original answer posted by @Gray Gwizdz

This was a fun question to try and find the answer to! Thank you for that 🙂

I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running into performance issues with 1000 columns here (https://github.com/delta-io/delta/issues/479) however there is a pending Pull Request where they tested with 4000 columns here and saw much better performance (https://github.com/delta-io/delta/pull/584).

I also reviewed internally and saw another approach that I would recommend here. This person was experiencing slower write performance when trying to use a really wide table. Instead of defining thousands of columns, the architect used an ArrayType column to contain most of the features instead which improved write performance significantly. They defined an intermediate state with feature fields as list of tuples List[(key, value)] and final output in the feature store as Map[key, aggregated_value].

Perhaps worth mentioning, Delta Lake tracks statistics for the first 32 columns of the table by default, so query planning for any of the additional rows outside of the first 32 will likely not be as quick as the first 32 columns. https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping

Databricks

Max Columns for Delta table

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!