cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

New remote (dbfs) caching python library

nito
New Contributor II

I had some problems getting much speedup at all from spark or DB disk cache, which I think is essential when developing PySpark code iteratively in notebooks. So I developed a handy caching-library for this which has recently been open sourced, see https://github.com/schibsted/dbfs-spark-cache . This adds support for remote caching through an explicit method to the pyspak DataFrame, which previousely was only supported for SQL UI cache . Proper use of remote dbfs caching also seems to avoid the slow queries and poor worker utilization that you often get after complex queries with multiple joins.

I'd be interested to know if others in the Databricks community will find this useful.

1 REPLY 1

amirabedhiafi
Contributor

Thanks for sharing this. It looks useful, especially for iterative notebook development where the expensive part is not just reading source files but recomputing a complex intermediate DataFrame after several joins or transformations.

I can see the value compared with normal Spark cache or Databricks disk cache: Spark cache is cluster/session dependent, while disk cache mainly accelerates reads of remote data files and does not really persist arbitrary intermediate query results as reusable DataFrames. Your approach of explicitly materializing the DataFrame to a persistent location/table could help a lot for EDA and repeated debugging loops. Databricks doc  also describes disk cache as local node caching of remote data files not as a general cache for arbitrary subquery results. https://docs.databricks.com/aws/en/optimizations/disk-cache

How does the library decide that an existing cached DataFrame is still valid? For example, if the source Delta table changes or if the notebook logic changes slightly, is the cache key based on the logical plan, user provided key, source table versions or params?

Since Databricks now recommends against DBFS root and DBFS mounts for most Unity Catalog-enabled workspaces, it would be good to document the recommended storage location clearly, for example UC managed tables, external locations or volumes rather than legacy DBFS root.

If this answer resolves your question, could you please mark it as โ€œAccept as Solutionโ€? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP