Databricks Community

pjv · ‎10-09-2024

Hi,

I have the following databricks notebook code defined:

pyspark_dataframe = create_pyspark_dataframe(some input data)

MyUDF = udf(myfunc, StringType())

pyspark_dataframe = pyspark_dataframe.withColumn('UDFOutput', DownloadUDF(input data columns))

output_strings = [x["UDFOutput"] for x in pyspark_dataframe.select("UDFOutput").collect()]

Im running this notebook on a cluster with multiple worker nodes. How can I ensure that the udf execution is distributed equally across the worker nodes?

Kind regards,

Pim

VZLA · ‎11-01-2024

@pjv Can you please try the following, you'll basically want to have more than a single partition:

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Initialize Spark session (if not already done)
spark = SparkSession.builder.appName("AppName").getOrCreate()

# Create a PySpark DataFrame from your input data
pyspark_dataframe = create_pyspark_dataframe(some_input_data)

# Repartition the DataFrame to ensure even distribution across worker nodes
num_partitions = 4  # Adjust based on your cluster size
pyspark_dataframe = pyspark_dataframe.repartition(num_partitions)

# Define your UDF
MyUDF = udf(myfunc, StringType())

# Apply the UDF to the DataFrame
pyspark_dataframe = pyspark_dataframe.withColumn('UDFOutput', MyUDF(input_data_columns))

# Collect the results
output_strings = [x["UDFOutput"] for x in pyspark_dataframe.select("UDFOutput").collect()]

# Confirm the distribution UDF execution.

Databricks Community

How to ensure pyspark udf execution is distributed across worker nodes

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!