cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Using a cluster of type SINGLE_USER to run parallel python tasks in one job

oye
New Contributor II

Hi, 

I have set up a job of multiple spark python tasks running in parallel. I have only set up one job cluster, single node, data security mode SINGLE_USER, using Databricks Runtime version 14.3.x-scala2.12.

These parallel spark python tasks share some similar variable names, but they are not technically global variables, everything is defined under one main function per file.

Will the python tasks somehow share these variables since I am using the same clusters? Can this ever happen using Databricks cluster?

1 ACCEPTED SOLUTION

Accepted Solutions

Coffee77
Contributor III

In my case, we've some jobs configured in a similar way and not issues so far. We are indeed leveraging usage of global temp views at cluster level to improve performance ๐Ÿ™‚


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

View solution in original post

4 REPLIES 4

Coffee77
Contributor III

Not sure to understand completely but If you are running parallel tasks with each task executed in a given notebook with same variable names, answer is no. The scope of those variables is kind of the spark session or notebook, not the cluster.

To share "data" at cluster level you can use Cluster-Scoped Environment Variables, Global Temp Views, Databricks Secrets for confidential data or even Shared files.

 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

oye
New Contributor II

Hi thanks for replying!

In my case, it would be running parallel tasks of type spark python tasks in a lakeflow job. This is a screenshot of the setup: 

oye_0-1763974540051.png

Aside from the fact that the tasks will share the same resource and thus might run slower, I wonder if there could be any other problem from cluster sharing.

But going from what you said, then there should not be any problem for my setup.

Coffee77
Contributor III

In my case, we've some jobs configured in a similar way and not issues so far. We are indeed leveraging usage of global temp views at cluster level to improve performance ๐Ÿ™‚


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Raman_Unifeye
Contributor III

@oye - The variables scope is local to the individual task and do no interfere with other tasks even if the underlying cluster is same. In fact, the issue is normally other way round where if we have to share the variable across tasks - Then the solutions mentioned by @Coffee77 - Global Temp view or Cluster-scoped env vars.


RG #Driving Business Outcomes with Data Intelligence