Thanks for sharing this. It looks useful, especially for iterative notebook development where the expensive part is not just reading source files but recomputing a complex intermediate DataFrame after several joins or transformations.
I can see the value compared with normal Spark cache or Databricks disk cache: Spark cache is cluster/session dependent, while disk cache mainly accelerates reads of remote data files and does not really persist arbitrary intermediate query results as reusable DataFrames. Your approach of explicitly materializing the DataFrame to a persistent location/table could help a lot for EDA and repeated debugging loops. Databricks doc also describes disk cache as local node caching of remote data files not as a general cache for arbitrary subquery results. https://docs.databricks.com/aws/en/optimizations/disk-cache
How does the library decide that an existing cached DataFrame is still valid? For example, if the source Delta table changes or if the notebook logic changes slightly, is the cache key based on the logical plan, user provided key, source table versions or params?
Since Databricks now recommends against DBFS root and DBFS mounts for most Unity Catalog-enabled workspaces, it would be good to document the recommended storage location clearly, for example UC managed tables, external locations or volumes rather than legacy DBFS root.
If this answer resolves your question, could you please mark it as โAccept as Solutionโ? It will help other users quickly find the correct fix.
Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP