Databricks Community

unj1m · ‎10-01-2024

Hi, I'm trying to leverage CACHE TABLE to create temporary tables that are cleaned up at the end of the session.

In creating one of these, I'm getting Data too long for column 'session_data'. The query I'm using isn't referencing a session_data column.

My guess is that Databricks is tracking cache tables in a session_data column somewhere. Can anyone verify, or offer another explanation? 🙂

stbjelcevic · a week ago

Thanks for sharing the details—this is a common point of confusion with caching versus temporary objects in Databricks.

The error message “Data too long for column 'session_data'” is emitted by the metastore/metadata persistence layer, not by your SQL statement itself. It typically indicates that a serialized payload (for example, session or query metadata) exceeded the backing store’s column size limit, which often shows up with external Hive metastores (MySQL) when a value exceeds roughly 4000 characters for the default type.
Others have reported seeing exactly this message when using CACHE TABLE for intermediate results, even though their queries didn’t reference a column named session_data. That aligns with the error originating in internal tracking rather than your user tables/views.
Databricks SQL uses multiple caching layers (UI cache, remote result cache for serverless, Spark cache). These can store session or result metadata outside your schema, which is why you might see an internal column name like session_data in the error even if it isn’t in your query.

CACHE TABLE caches the data of a table or view in Spark’s storage (memory/disk) to speed up reuse; it does not create a new table, and it isn’t a metastore object tied to session cleanup. It’s best used when you will perform multiple actions over the same dataset within the lifetime of the session/warehouse.
If your goal is a session-scoped, automatically cleaned-up object, use a temporary view rather than trying to create a “temporary table.” Spark/Databricks supports session-scoped views, not session-scoped tables.

Prefer CREATE TEMP VIEW for session-scoped semantics and optionally couple it with CACHE TABLE to speed up repeated access during the same session. This keeps the “temporary” concern separate from the “cache” concern and avoids metastore writes that may trigger large metadata payloads.
Avoid caching extremely large or very wide intermediate results if you suspect the error is triggered by oversized serialized session payloads. If you need reuse plus durability, materialize to a regular Delta table in a scratch schema and drop it at the end instead of relying on cache alone.
If you’re using an external Hive metastore (for example, MySQL) and you consistently see “Data too long for column,” confirm whether your metastore schema uses column types that cap around 4000 characters and consider adjusting the column type (for example, TEXT/MEDIUMTEXT in MySQL) or migrating to Unity Catalog to avoid this class of constraints.

The column name is from an internal store that tracks session or cached-result state, not your user schema. The message surfaces because that internal write exceeded the allowed size in its backing store.