Hi,
I have the following databricks notebook code defined:
pyspark_dataframe = create_pyspark_dataframe(some input data)
MyUDF = udf(myfunc, StringType())
pyspark_dataframe = pyspark_dataframe.withColumn('UDFOutput', DownloadUDF(input data columns))
output_strings = [x["UDFOutput"] for x in pyspark_dataframe.select("UDFOutput").collect()]
Im running this notebook on a cluster with multiple worker nodes. How can I ensure that the udf execution is distributed equally across the worker nodes?
Kind regards,
Pim