cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

'replaceWhere' clause in spark.write for a partitioned table

TheDataEngineer
New Contributor

Hi, I want to be clear about 'replaceWhere' clause in spark.write.

Here is the scenario:
I would like to add a column to few existing records.
The table is already partitioned on "PickupMonth" column.

Here is example: Without 'replaceWhere'


spark.read \
.format("delta") \
.load(destination_path) \
.where("PickupMonth == '12' and PaymentType == '3' ") \
.withColumn("PaymentType", lit(4).cast(LongType())) \
.write \
.format("delta") \
.mode("overwrite") \
.save(destination_path)


In the above code we are reading a few records "where("PickupMonth == '12' and PaymentType == '3' ")"
and then adding a new column and writing back to the same table. Does the above code work?

If we do not explicitly mention ".option("replaceWhere", "PickupMonth = '12'") \" during write,
should it not write just the "PickupMonth == '12'" partition automatically, because the table is a partitioned table?

Why should we write as follows with ".option("replaceWhere", "PickupMonth = '12'") \"

spark.read \
.format("delta") \
.load(destination_path) \
.where("PickupMonth == '12' and PaymentType == '3' ") \
.withColumn("PaymentType", lit(4).cast(LongType())) \
.write \
.format("delta") \
.option("replaceWhere", "PickupMonth = '12'") \
.mode("overwrite") \
.save(destination_path)

in other words what is the difference between whether we mention ".option("replaceWhere", "PickupMonth = '12'") \" or not.

Thank you in advance.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @TheDataEngineerLet’s dive into the details of the replaceWhere clause in Spark’s Delta Lake.

The replaceWhere option is a powerful feature in Delta Lake that allows you to overwrite a subset of a table during write operations. Specifically, it lets you replace partitions that match a certain condition, rather than the entire table. Let’s break down your scenario and explore how it works:

  1. Scenario:

    • You have an existing Delta table partitioned by the “PickupMonth” column.
    • You want to add a new column to a subset of records (where “PickupMonth” is ‘12’ and “PaymentType” is ‘3’).
    • You’re using the overwrite mode to replace the existing data.
  2. Code Without replaceWhere:

    spark.read \
        .format("delta") \
        .load(destination_path) \
        .where("PickupMonth == '12' and PaymentType == '3' ") \
        .withColumn("PaymentType", lit(4).cast(LongType())) \
        .write \
        .format("delta") \
        .mode("overwrite") \
        .save(destination_path)
    
    • In this code, you’re reading a subset of records where “PickupMonth” is ‘12’ and “PaymentType” is ‘3’.
    • You add a new column (“PaymentType”) and overwrite the entire table.
  3. Why Use replaceWhere?:

    • When you don’t explicitly specify replaceWhere, the overwrite mode replaces the entire table, not just the filtered partitions.
    • In your case, it would overwrite all partitions, including those where “PickupMonth” is not ‘12’.
    • If you want to be more selective and only replace the relevant partitions, you should use replaceWhere.
  4. Code With replaceWhere:

    spark.read \
        .format("delta") \
        .load(destination_path) \
        .where("PickupMonth == '12' and PaymentType == '3' ") \
        .withColumn("PaymentType", lit(4).cast(LongType())) \
        .write \
        .format("delta") \
        .option("replaceWhere", "PickupMonth = '12'") \
        .mode("overwrite") \
        .save(destination_path)
    
    • By adding .option("replaceWhere", "PickupMonth = '12'"), you specify that only partitions where “PickupMonth” is ‘12’ should be replaced.
    • Other partitions remain untouched.
  5. Difference:

    • Without replaceWhere: Overwrites the entire table, including all partitions.
    • With replaceWhere: Selectively overwrites only the partitions that match the specified condition (“PickupMonth” is ‘12’).

In summary, if you want to be more precise and avoid unnecessary overwrites, use replaceWhere to target specific partitions. Otherwise, without it, the entire table will be replaced during the write operation.

Feel free to adjust your code based on your specific requirements! 😊

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.