Why we are using shared variables into spark?

Yogic24 — Fri, 28 Jun 2024 12:21:02 GMT

Hello Databricks Community🙂!

I am excited to share my first blog 🚀 post with you all. This is a small and basic introduction to the concept of shared variables in Apache Spark. I hope this post will help those who are new to Spark understand why shared variables are important and how to use them effectively.

When you pass a function like filter() to Spark, it's executed on the worker nodes in the cluster. This function can indeed access variables defined outside of it, but the changes made to those variables are not reflected back to the driver program automatically. This is because each task running on a worker node operates on its own copy of the variables, and these copies are not automatically synchronized with the variables in the driver program.

Accumulators and Broadcast variable are used to remove above drawback ( i.e. we can get the updated values back to our Driver program)

🤝 Let's connect, engage, and grow together! I'm eager to hear your thoughts, experiences, and perspectives.I look forward to your feedback and engaging in discussions with the community.

Thank you for your support!

Re: Why we are using shared variables into spark?

RishabhTiwari07 — Mon, 01 Jul 2024 17:34:35 GMT

Hi @Yogic24 ,

Welcome to the Databricks Community! Thank you for sharing your first blog post with us, I am sure it will help our community members. Thank you for your contribution and support!

Thanks,
Rishabh

topic Why we are using shared variables into spark? in Community Articles

Why we are using shared variables into spark?

Re: Why we are using shared variables into spark?