cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

.overwriteschema + writestream

lakime
New Contributor III

HelloI have issue with overwriting schema while using writestream - I do not receive any error - however schema remain unchanged

Below example

df_abc = spark.readstream

   .format("cloudFiles")

   .option("cloudFiles.format", "parquet") 

   .option"cloudFiles.schemaLocation", chklocat )

   .load(deltatbl)

df_abc = df_abc.withColumn("columna", col("columna").cast("timestamp"))

write = df_abc.writestream

   .outputMode("append")

   .option("checkpointLocation",chklocat)

   .trigger(availableNow=True)

   .option("overwriteSchema", "true")

   .toTable(dbname + "." + tblname)

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @lakime, It seems you’re encountering an issue with schema overwriting while using writestream in PySpark. 

 

Let’s troubleshoot this together!

 

Boolean Value for overwriteSchema:

  • The overwriteSchema option expects a string value, not a boolean. You should set it as "True" (with quotes) instead of True.

Schema Migration:

  • If you still face issues, consider enabling schema migration using DataFrameWriter or DataStreamWriter.
  • Set the .option("mergeSchema", "true") to enable schema migration.
  • Additionally, ensure that the session configuration spark.databricks.delta.schema.autoMerge.enabled is set to "true".

Hopefully, this helps resolve the schema overwriting issue! 🚀

lakime
New Contributor III

Hey Kaniz, not sure do I follow

- overwriteSchema option was set up as you have written

- session configuration is set-up correctly

I have also tried several ways configuration including set up of "mergeSchema", "true" but still doesn't work

Kaniz
Community Manager
Community Manager

Hi @lakime, You're encountering schema overwriting issues while using writestream in Databricks. 

 

Let's troubleshoot this together!

 

Correct Option Placement:

  • The overwriteSchema option should be set in the write operation, not the read operation.
  • Make sure to set it when saving the data to the Delta table, not during reading.

Avoid Writing Data Twice:

  • In your example, you're writing the data twice: once as a "normal" directory and then as a managed table.
  • If you want to create an unmanaged table in a custom location, add the path option to the third variant.
  • You can omit the dbfs:/ prefix since it's the default schema.

Consider mergeSchema Option:

  • If your schema changes only add columns or are minor, you can use mergeSchema instead of overwriteSchema.
  • mergeSchema allows you to merge new columns into the existing schema without causing issues.
  • Adjust your code accordingly based on the nature of your schema changes.

Check for Table ACLs:

  • If you have Table ACLs enabled, schema changes might require different permissions.
  • Ensure that your Delta table's necessary permissions (e.g., MODIFY and OWN) are correctly set.

    Remember to apply these adjustments to your code; hopefully, it will resolve the issue. If you encounter any further challenges, feel free to ask! 🚀🔍🔒

lakime
New Contributor III

That did not solve the problem

 

Kaniz
Community Manager
Community Manager

Here are a few things to check and try:

 

1. Schema Mismatch: Make sure there isn’t a schema mismatch between your input data (df_abc) and the target table where you’re writing the data. If there’s a mismatch, the schema won’t be overwritten as expected. You can restart the stream to resolve schema mismatches.

2. Checkpoint Location: Verify that the checkpoint location (chklocat) is correctly set. The checkpoint location is essential for maintaining the state of the streaming query. If it’s not set correctly, it might impact schema overwriting.

3. Explicitly Specify Schema: Instead of relying on schema inference, explicitly define the schema for your df_abc using .schema(your_schema). Ensure that the specified schema matches the expected output schema.

4. Trigger Mode: The trigger mode you’ve set (availableNow=True) indicates that the query should run as soon as possible. Consider using other trigger modes (e.g., processingTime, once, or continuous) based on your use case.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!