New remote (dbfs) caching python library

nito — Mon, 05 May 2025 07:05:49 GMT

I had some problems getting much speedup at all from spark or DB disk cache, which I think is essential when developing PySpark code iteratively in notebooks. So I developed a handy caching-library for this which has recently been open sourced, see https://github.com/schibsted/dbfs-spark-cache . This adds support for remote caching through an explicit method to the pyspak DataFrame, which previousely was only supported for SQL UI cache . Proper use of remote dbfs caching also seems to avoid the slow queries and poor worker utilization that you often get after complex queries with multiple joins.

I'd be interested to know if others in the Databricks community will find this useful.

Re: New remote (dbfs) caching python library

amirabedhiafi — Sun, 07 Jun 2026 13:59:37 GMT

Thanks for sharing this. It looks useful, especially for iterative notebook development where the expensive part is not just reading source files but recomputing a complex intermediate DataFrame after several joins or transformations.

I can see the value compared with normal Spark cache or Databricks disk cache: Spark cache is cluster/session dependent, while disk cache mainly accelerates reads of remote data files and does not really persist arbitrary intermediate query results as reusable DataFrames. Your approach of explicitly materializing the DataFrame to a persistent location/table could help a lot for EDA and repeated debugging loops. Databricks doc also describes disk cache as local node caching of remote data files not as a general cache for arbitrary subquery results. https://docs.databricks.com/aws/en/optimizations/disk-cache

How does the library decide that an existing cached DataFrame is still valid? For example, if the source Delta table changes or if the notebook logic changes slightly, is the cache key based on the logical plan, user provided key, source table versions or params?

Since Databricks now recommends against DBFS root and DBFS mounts for most Unity Catalog-enabled workspaces, it would be good to document the recommended storage location clearly, for example UC managed tables, external locations or volumes rather than legacy DBFS root.

topic Re: New remote (dbfs) caching python library in Data Engineering

New remote (dbfs) caching python library

Re: New remote (dbfs) caching python library