cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Python function achieving Parallelism

sathya08
New Contributor II

Hello everyone,

I have a very basic question wrt Databricks spark parallelism.

I have a python function within a for loop, so I believe this is running sequentially.

Databricks cluster is enabled with Photon and with Spark 15x, does that mean the driver is responsible to make this to run in parallel even though it is in a for loop OR do I need to introduce something to make the function to run in parallel.

Need you help to understand on the above one and if I need to introduce parallelism explicitly then how do I do it.

Also how to achieve it based on the total executors cores in the cluster [ I read executor cores are responsible for the parallelism ].

Please correct me if my understanding is wrong.

Thanks

Sathya

 

 

 

2 REPLIES 2

saurabh18cs
Contributor II

In Spark, the level of parallelism is determined by the number of partitions and the number of executor cores. Each task runs on a single core, so having more executor cores allows more tasks to run in parallel.

To achieve parallelism, you need to explicitly introduce parallel processing. Spark provides several ways to achieve parallelism, such as using map operations on RDDs or DataFrames, or using the concurrent.futures module in Python.

1) If your function can be applied to elements of an RDD or DataFrame, you can use Spark's map operation to run it in parallel across the cluster.

 

# Define your function
def my_function(x)
    return x * x

# Apply the function in parallel
result_rdd = rdd.map(my_function)

# Collect the results
results = result_rdd.collect()
print(results)
 
2) If you need to run a Python function in parallel and it doesn't fit well with Spark's map operation, you can use the concurrent.futures module.
# Define your function
def my_function(x)
    return x * x

# Example data
data = [1, 2, 3, 4, 5]

# Use ThreadPoolExecutor or ProcessPoolExecutor
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(executor.map(my_function, data))

print(results)

sathya08
New Contributor II

Thankyou for your reply. My code is not writing to any target it is actually doing a optimize and vaccum on all the tables based on catalog. Currently in for loop it is taking one table at a time and sequentially performing the actions.

Can this be parallelize using the concurrent.future module or is there any other ways of doing it.

Thanks

Sathya

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group