Yes, it's possible to manage storage costs in Databricks and Unity Catalog by targeting specific tables for different storage classes, but Unity Catalog does add complexity since it abstracts the direct S3 (or ADLS/GCS) object paths from you. Here’s a comprehensive approach to address the scenario you described:
Direct S3 Path Workarounds
-
Unity Catalog Abstraction: Unity Catalog manages S3 locations internally for managed tables, so you don’t have direct control or visibility over prefixes per table to write a fine-grained AWS S3 Lifecycle Policy. This makes table-type-specific lifecycle management tricky if you rely solely on S3 object-level policies.
-
External (Unmanaged) Tables: If you register external tables (using CREATE TABLE ... LOCATION 's3://...'), you can specify unique S3 paths or prefixes for each type of table. Then, you can safely apply AWS S3 lifecycle policies to the prefix used by the "heavyweight" tables.
Possible Solutions in Unity Catalog
1. Partitioning by External Table Location
-
For new large tables, use external tables in a separate S3 prefix.
-
Apply a lifecycle policy targeting that specific prefix.
2. Table Tagging and Automation
-
Use Unity Catalog table tags (“table properties”) to annotate these table types.
-
Write an automated Databricks job (using Python/Scala/Shell notebooks or Databricks workflows) that:
-
Enumerates tables with the matching tag or property.
-
Resolves their actual storage URIs (this is possible through some API calls or logs but isn’t officially guaranteed; future Unity Catalog features may make this easier).
-
Moves their underlying data to the desired S3 prefix or storage class, then updates the table metadata.
-
This approach is advanced and requires caution to avoid data consistency issues.
3. Contacting Databricks Support
-
Databricks is aware of the need for finer-grained storage policies in Unity Catalog-managed environments.
-
There may be preview features or recommended best practices not yet widely documented.
-
Request guidance (and file a feature request if needed) for granular storage lifecycle control per table type.
Key Points and Limitations
-
No Native Table-level Policy: Unity Catalog currently lacks a built-in way to apply different S3 lifecycle rules per table type for managed tables.
-
External Table Best Practice: If you need this flexibility now, use external tables with clearly separated S3 paths.
-
Manual Table Management: For existing managed tables, it’s not possible today to retroactively assign different storage classes without moving them out of Unity Catalog management and re-registering as external tables.
References
-
: AWS S3 Lifecycle Policies and Databricks Table Management
-
: Databricks Documentation on External Tables with Unity Catalog
-
: Community Discussions and Feature Requests: Per-table Storage Classes
In summary:
For table-level storage class control in Unity Catalog, use external tables and distinct S3 prefixes for the heavy-storage table types. Then, apply storage class or lifecycle policies in S3 to those prefixes. For fully managed tables, current Unity Catalog abstractions prevent this, so consider feature requests or consult Databricks for roadmap or workaround options.