Databricks Community

lakime · ‎11-20-2023

HelloI have issue with overwriting schema while using writestream - I do not receive any error - however schema remain unchanged

Below example

df_abc = spark.readstream

.format("cloudFiles")

.option("cloudFiles.format", "parquet")

.option"cloudFiles.schemaLocation", chklocat )

.load(deltatbl)

df_abc = df_abc.withColumn("columna", col("columna").cast("timestamp"))

write = df_abc.writestream

.outputMode("append")

.option("checkpointLocation",chklocat)

.trigger(availableNow=True)

.option("overwriteSchema", "true")

.toTable(dbname + "." + tblname)

Kaniz_Fatma · ‎11-20-2023

Hi @lakime, It seems you’re encountering an issue with schema overwriting while using writestream in PySpark.

Let’s troubleshoot this together!

Boolean Value for overwriteSchema:

The overwriteSchema option expects a string value, not a boolean. You should set it as "True" (with quotes) instead of True.

Schema Migration:

If you still face issues, consider enabling schema migration using DataFrameWriter or DataStreamWriter.
Set the .option("mergeSchema", "true") to enable schema migration.
Additionally, ensure that the session configuration spark.databricks.delta.schema.autoMerge.enabled is set to "true".

Hopefully, this helps resolve the schema overwriting issue! 🚀

lakime · ‎11-20-2023

Hey Kaniz, not sure do I follow

- overwriteSchema option was set up as you have written

- session configuration is set-up correctly

I have also tried several ways configuration including set up of "mergeSchema", "true" but still doesn't work

Kaniz_Fatma · ‎11-20-2023

Hi @lakime, You're encountering schema overwriting issues while using writestream in Databricks.

Let's troubleshoot this together!

Correct Option Placement:

The overwriteSchema option should be set in the write operation, not the read operation.
Make sure to set it when saving the data to the Delta table, not during reading.

Avoid Writing Data Twice:

In your example, you're writing the data twice: once as a "normal" directory and then as a managed table.
If you want to create an unmanaged table in a custom location, add the path option to the third variant.
You can omit the dbfs:/ prefix since it's the default schema.

Consider mergeSchema Option:

If your schema changes only add columns or are minor, you can use mergeSchema instead of overwriteSchema.
mergeSchema allows you to merge new columns into the existing schema without causing issues.
Adjust your code accordingly based on the nature of your schema changes.

Check for Table ACLs:

If you have Table ACLs enabled, schema changes might require different permissions.
Ensure that your Delta table's necessary permissions (e.g., MODIFY and OWN) are correctly set.

Remember to apply these adjustments to your code; hopefully, it will resolve the issue. If you encounter any further challenges, feel free to ask! 🚀🔍🔒

lakime · ‎11-23-2023

That did not solve the problem

Kaniz_Fatma · ‎11-29-2023

Here are a few things to check and try:

1. Schema Mismatch: Make sure there isn’t a schema mismatch between your input data (df_abc) and the target table where you’re writing the data. If there’s a mismatch, the schema won’t be overwritten as expected. You can restart the stream to resolve schema mismatches.

2. Checkpoint Location: Verify that the checkpoint location (chklocat) is correctly set. The checkpoint location is essential for maintaining the state of the streaming query. If it’s not set correctly, it might impact schema overwriting.

3. Explicitly Specify Schema: Instead of relying on schema inference, explicitly define the schema for your df_abc using .schema(your_schema). Ensure that the specified schema matches the expected output schema.

4. Trigger Mode: The trigger mode you’ve set (availableNow=True) indicates that the query should run as soon as possible. Consider using other trigger modes (e.g., processingTime, once, or continuous) based on your use case.

Databricks Community

.overwriteschema + writestream

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI