11-20-2023 03:57 AM
HelloI have issue with overwriting schema while using writestream - I do not receive any error - however schema remain unchanged
Below example
df_abc = spark.readstream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option"cloudFiles.schemaLocation", chklocat )
.load(deltatbl)
df_abc = df_abc.withColumn("columna", col("columna").cast("timestamp"))
write = df_abc.writestream
.outputMode("append")
.option("checkpointLocation",chklocat)
.trigger(availableNow=True)
.option("overwriteSchema", "true")
.toTable(dbname + "." + tblname)
11-20-2023 04:58 AM
Hi @PiotrU, It seems you’re encountering an issue with schema overwriting while using writestream in PySpark.
Let’s troubleshoot this together!
Boolean Value for overwriteSchema:
Schema Migration:
Hopefully, this helps resolve the schema overwriting issue! 🚀
11-20-2023 11:52 AM
Hey Kaniz, not sure do I follow
- overwriteSchema option was set up as you have written
- session configuration is set-up correctly
I have also tried several ways configuration including set up of "mergeSchema", "true" but still doesn't work
11-20-2023 09:03 PM
Hi @PiotrU, You're encountering schema overwriting issues while using writestream in Databricks.
Let's troubleshoot this together!
Correct Option Placement:
Avoid Writing Data Twice:
Consider mergeSchema Option:
Check for Table ACLs:
11-23-2023 12:36 AM
That did not solve the problem
11-29-2023 07:19 AM
Here are a few things to check and try:
1. Schema Mismatch: Make sure there isn’t a schema mismatch between your input data (df_abc) and the target table where you’re writing the data. If there’s a mismatch, the schema won’t be overwritten as expected. You can restart the stream to resolve schema mismatches.
2. Checkpoint Location: Verify that the checkpoint location (chklocat) is correctly set. The checkpoint location is essential for maintaining the state of the streaming query. If it’s not set correctly, it might impact schema overwriting.
3. Explicitly Specify Schema: Instead of relying on schema inference, explicitly define the schema for your df_abc using .schema(your_schema). Ensure that the specified schema matches the expected output schema.
4. Trigger Mode: The trigger mode you’ve set (availableNow=True) indicates that the query should run as soon as possible. Consider using other trigger modes (e.g., processingTime, once, or continuous) based on your use case.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group