Databricks Community

skarpeck · ‎10-28-2024

I need to track codes of records that were ingested in foreachBatch function, and pass it as a task value, so downstream tasks can take actions based on this output. What would be the best approach to achieve that? Now, I have a following solution, but I can see that sometimes it just doesn't fill the set, and I can that task value "codes" is just empty...

codes = set()

def foreach_func(df, batch_id):
    codes.update({ code.ColCode for code in df.select("ColCode").distinct().collect() })

    # Additional logic of inserting df data into tables
    ...
    ...
    ...
    


(
input_df.writeStream
    .trigger(availableNow=True)
    .format("delta")
    .outputMode("append")                    
    .option("checkpointLocation",checkpoint_location)   
    .option("badRecordsPath", errors_path)
    .foreachBatch(foreach_func)
    .start()
    .awaitTermination()
)

dbutils.jobs.taskValues.set(key = "codes", value = list(codes))

skarpeck · ‎10-28-2024

I found it is related to a Shared cluster mode. When I use single user mode it all works fine. Furthermore, using Accumulator is not helping....

raphaelblg · ‎11-28-2024

@skarpeck does your input df contain any filters? The empty codes variable could be due to empty microbatches maybe.

Please check the numInputRows from your query's Stream Monitoring Metrics. I recommend you to check if there are input rows for the batch ids you're observing that lead to no data in codes.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks