cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Stop Cache in free edition

Hritik_Moon
New Contributor III

Hello,

I am using databricks free edition, is there a way to turn off IO caching.

I am trying to learn optimization and cant see any difference in query run time with caching enabled.

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @Hritik_Moon ,

I guess you cannot. To disable disk cache you need to have ability to run following command:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

But serverless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:

szymon_dybczak_0-1760357977345.png

 

So, if you want to have a proper envirionment to learn apache spark optimization use OSS Apache Spark docker container as an alternative

View solution in original post

6 REPLIES 6

szymon_dybczak
Esteemed Contributor III

Hi @Hritik_Moon ,

I guess you cannot. To disable disk cache you need to have ability to run following command:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

But serverless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:

szymon_dybczak_0-1760357977345.png

 

So, if you want to have a proper envirionment to learn apache spark optimization use OSS Apache Spark docker container as an alternative

Thanks, I have no prior experience with docker and how to get spark but I guess youtube will help ๐Ÿ˜.

szymon_dybczak
Esteemed Contributor III

Yep, it's really simple to setup. As an added benefit you will have a full control over your environment ๐Ÿ˜„ Here you have an yt video that shows how to setup it:

How to Run a Spark Cluster with Multiple Workers Locally Using Docker

Thanks, I will be back later with additional questions ๐Ÿ˜Š.

szymon_dybczak
Esteemed Contributor III

Sure, one suggestion though. If your next question will be related to cache then ask it here. But if it will be something completely unrelated to this topic, please start new one. 
Usually, all questions and answers should be related to given thread. This way it's much easier for others to find what they're looking for. Also, if someone's answer solved your issue/help you try to pick that answer as a solution for a given thread.

Prajapathy_NKR
New Contributor II

@Hritik_Moon 

1. check if your data is cached, this you can see in sparkUI > storage tab.

2. if it is not cached, try to add a action statement after you cache. eg : df.count(). Data is cached with the first action statement it encounters. Now check in spark UI.

3. if you have only one action statement, you dont see any difference. But if you have multiple action statement, you tend to see the relevant transformations before your cached dataframe gets skipped. You can see these skips in your DAG.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now