Hubert-Dudek
Databricks MVP

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html


My blog: https://databrickster.medium.com/

View solution in original post