How to process a large delta table with UDF ?

Constantine — Thu, 24 Mar 2022 15:39:56 GMT

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column

My code is something like this

def my_udf(data):
    return pass
       
 
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))

The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table

Re: How to process a large delta table with UDF ?

Hubert-Dudek — Thu, 24 Mar 2022 15:52:10 GMT

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

topic Re: How to process a large delta table with UDF ? in Data Engineering

How to process a large delta table with UDF ?

Re: How to process a large delta table with UDF ?