How to process a large delta table with UDF ?

Constantine · ‎03-24-2022

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column

My code is something like this

def my_udf(data):
    return pass
       
 
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))

The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table