cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where? For non dated partitions, this is really a mess with delta tables.

Maverick1
Valued Contributor II

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where. For non dated partitions, this is really a mess with delta tables.

Most of my DE teams don't want to adopt delta because of these glitches.

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

@Saurabh Verma​ , Delta is ACID and has a commit log, schema, etc., that's why it is different. So when you delete all data related to the partition, it will stay there as it still has a history. INSERT OVERWRITE can overwrite data only in a specified partition (but you need to specify it). The best way is to use MERGE when you have a unique id and partitions in the ON clause.

INSERT OVERWRITE TABLE part PARTITION (p1 = 'part1') VALUES(3)

Anonymous
Not applicable

Hi @Saurabh Verma​ following up did you get a chance to check @Hubert Dudek​ previous comments ?

Maverick1
Valued Contributor II

@Chetan Kardekar​ :

Hubert's answer is not relevant in this case. His answer might be correct if someone is performing data warehousing scenario's having primary keys etc. and completely using SQL. Programatic API code doesn't usually use the MERGE statements.

I have seen an open PULL request for this functionality and a lot of people are asking the same but not sure why engineering team is hesitating to implement it. I have written a hacky code to perform the relevant operations where the returned list of write partitions can be directly fed into the spark.write.format("delta").mode("overwrite").option("replaceWhere", writepar).save() statement.

def _get_partitions(spark: SparkSession, df, partition_col_list):
    """
    Create a list of partitions in the input dataframe, based on partition columns.
    :param spark: The spark session.
    :param df: input dataframe.
    :param partition_col_list: the list of partition column in input dataframe.
    :return: List of partitions in the input dataframe
    """
    for i in df.dtypes:
        if i[1] in ('date', 'timestamp'):
            df = df.withColumn(i[0], F.col(i[0]).cast(StringType()))
 
    write_par_temp = df.select(partition_col_list).distinct().rdd.collect()
 
    write_par_temp2 = str([str(el).replace("Row", "").replace("(", "").replace(")", "").replace(",", " AND") for el in
                           write_par_temp]).replace('[', "").replace(']', "").replace('"', '(').replace('(,', ') OR ')
 
    len_par = len(write_par_temp2)
    write_par = write_par_temp2[:len_par - 1] + ')'
    print("write_par: {}".format(write_par))
    return write_par

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.