3 weeks ago
Hello,
I am using databricks free edition, is there a way to turn off IO caching.
I am trying to learn optimization and cant see any difference in query run time with caching enabled.
3 weeks ago
Hi @Hritik_Moon ,
I guess you cannot. To disable disk cache you need to have ability to run following command:
spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")But serverless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:
So, if you want to have a proper envirionment to learn apache spark optimization use OSS Apache Spark docker container as an alternative
3 weeks ago
Hi @Hritik_Moon ,
I guess you cannot. To disable disk cache you need to have ability to run following command:
spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")But serverless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:
So, if you want to have a proper envirionment to learn apache spark optimization use OSS Apache Spark docker container as an alternative
3 weeks ago
Thanks, I have no prior experience with docker and how to get spark but I guess youtube will help ๐.
3 weeks ago
Yep, it's really simple to setup. As an added benefit you will have a full control over your environment ๐ Here you have an yt video that shows how to setup it:
How to Run a Spark Cluster with Multiple Workers Locally Using Docker
3 weeks ago
Thanks, I will be back later with additional questions ๐.
3 weeks ago - last edited 3 weeks ago
Sure, one suggestion though. If your next question will be related to cache then ask it here. But if it will be something completely unrelated to this topic, please start new one. 
Usually, all questions and answers should be related to given thread. This way it's much easier for others to find what they're looking for. Also, if someone's answer solved your issue/help you try to pick that answer as a solution for a given thread.
3 weeks ago
1. check if your data is cached, this you can see in sparkUI > storage tab.
2. if it is not cached, try to add a action statement after you cache. eg : df.count(). Data is cached with the first action statement it encounters. Now check in spark UI.
3. if you have only one action statement, you dont see any difference. But if you have multiple action statement, you tend to see the relevant transformations before your cached dataframe gets skipped. You can see these skips in your DAG.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now