@Chetan Kardekarโ :
Hubert's answer is not relevant in this case. His answer might be correct if someone is performing data warehousing scenario's having primary keys etc. and completely using SQL. Programatic API code doesn't usually use the MERGE statements.
I have seen an open PULL request for this functionality and a lot of people are asking the same but not sure why engineering team is hesitating to implement it. I have written a hacky code to perform the relevant operations where the returned list of write partitions can be directly fed into the spark.write.format("delta").mode("overwrite").option("replaceWhere", writepar).save() statement.
def _get_partitions(spark: SparkSession, df, partition_col_list):
"""
Create a list of partitions in the input dataframe, based on partition columns.
:param spark: The spark session.
:param df: input dataframe.
:param partition_col_list: the list of partition column in input dataframe.
:return: List of partitions in the input dataframe
"""
for i in df.dtypes:
if i[1] in ('date', 'timestamp'):
df = df.withColumn(i[0], F.col(i[0]).cast(StringType()))
write_par_temp = df.select(partition_col_list).distinct().rdd.collect()
write_par_temp2 = str([str(el).replace("Row", "").replace("(", "").replace(")", "").replace(",", " AND") for el in
write_par_temp]).replace('[', "").replace(']', "").replace('"', '(').replace('(,', ') OR ')
len_par = len(write_par_temp2)
write_par = write_par_temp2[:len_par - 1] + ')'
print("write_par: {}".format(write_par))
return write_par