Hey guys, after some successful data preprocessing without any errors, i have a final dataframe shape with the shape of ~ (200M, 150). the cluster i am using has sufficient ram + cpus + autoscaling, all metrics look fine after the job was done.
The problem that i am facing is that only approx half of my data is written to the delta table. is there any thoughts on as to why this is happening? I'd really appreciate some guidance here! here is my code snippet:
repartition_df = df.repartition(<num_cores*2; or 1>).persist() # i also tried without using df persist but no luck either, none of them works
repartition_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(output_table)
chunked_df.unpersist()