Hello Databricks Community🙂!
I am excited to share my first blog 🚀 post with you all. This is a small and basic introduction to the concept of shared variables in Apache Spark. I hope this post will help those who are new to Spark understand why shared variables are important and how to use them effectively.
When you pass a function like filter() to Spark, it's executed on the worker nodes in the cluster. This function can indeed access variables defined outside of it, but the changes made to those variables are not reflected back to the driver program automatically. This is because each task running on a worker node operates on its own copy of the variables, and these copies are not automatically synchronized with the variables in the driver program.
Accumulators and Broadcast variable are used to remove above drawback ( i.e. we can get the updated values back to our Driver program)
🤝 Let's connect, engage, and grow together! I'm eager to hear your thoughts, experiences, and perspectives.I look forward to your feedback and engaging in discussions with the community.
Thank you for your support!