Databricks Community

ajithgaade · ‎07-01-2024

Hi,
written in pyspark.

databricks autoloader job with retry didn't merge/update the schema.

spark.readStream.format("cloudFiles")

.option("cloudFiles.format", "parquet")

.option("cloudFiles.schemaLocation", checkpoint_path)

.option("cloudFiles.includeExistingFiles", "false"

.load(source_path)

.writeStream

.queryName("write stream query")

.option("checkpointLocation", checkpoint_path)

.trigger(availableNow=True)

.forEachBatch(batch_operation)

.option("mergeSchema", True)

.start()

.awaitTermination()

Error:

Error while reading file s3://path
Caused by: UnknownFieldException: [UNKNOWN_FILED_EXCEPTION.NEW_FILEDS_IN_FILE] Encountered unknown fields during parsing: filed type, which can be fixed by an automatic retry: false

tried running couple of times. set retry = 2 for job and task as well.

please can you help?

Witold · ‎07-01-2024

What happens if you enable rescue mode?

.option('cloudFiles.schemaEvolutionMode', 'rescue')

ajithgaade · ‎07-02-2024

@Witold

rescue option won't evolve the schema.
https://docs.databricks.com/en/ingestion/auto-loader/schema.html#:~:text=evolve%20data%20types.-,res...

my requirement is schema should evolve automatically

Giri-Patcham · ‎07-01-2024

Hi @ajithgaade ,

If you are using a merge statement inside the forEachBatch function batch_operation then you have to use DBR 15.2 and above to evolve the schema
https://docs.databricks.com/en/delta/update-schema.html#automatic-schema-evolution-for-delta-lake-me...

ajithgaade · ‎07-02-2024

Hi @Giri-Patcham
batch operation doesn't has merge statement. I am dropping tables and recreating. Tried clearing checkpoint location many times and different options. Tried with DBR 15.3, No Luck.

Giri-Patcham · ‎07-02-2024

@ajithgaade can you share the sample code inside the batch function ?

ajithgaade · ‎07-02-2024

Below is the sample one

def batch_operation(df, batch_id😞

logger.info(f"batch_id: {batch_id}")

src_cnt = 0

filtr_cnt = 0

df = src_table_df \

.withColumn('load_ctl_key', lit(aws_lck)

src_columns = df.columns

# checksum text column is generated and added to the DataFrame.

if condition

df2 = create_chk_sum_txt(df=df, cols=col_info, exc_cols=exclude_columns)

else:

df2 = df

df3 = df2.filter(

col("id").isNull() | (length(col("id")) > 50) | (length(col("cd")) > 10) | \

(col("5yr").isNotNull() & col("5yr").cast("int").isNull())

)

# columns=df3.columns

logger.info("df3 completed")

df4 = df3.withColumn("error_reason", when(col("id").isNull(), "id is Null") \

.when(length(col("id")) > 50, "id exceeds column size in base table") \

.otherwise("Miscellaneous Error"))

# Filter valid records to be loaded to Base Table

df5 = df2.exceptAll(df3).withColumn("error_reason", lit(""))

ins_cnt = df5.count()

filtr_cnt = src_cnt - ins_cnt

logger.info('src_cnt: {}'.format(src_cnt))

logger.info('ins_cnt: {}'.format(ins_cnt))

logger.info('filtr_cnt: {}'.format(filtr_cnt))

select_df = df5.select(*col_info)

logger.info("select_df completed")

err_df = df4.select(*col_info)

logger.info("err_df completed")

logger.info("Final Schema before writing to UC:")

select_df.printSchema()

err_df.printSchema()

(

transformed_df.write

.mode("append")

.saveAsTable(f"catalog.schema.{trgt_tbl}")

)

(

err_df.write

.mode("append")

.saveAsTable(f"catalog.schema.{err_tbl}")

)

if int(load_ctl_key) != -1:

print("Pushing metrics to UC")

payload = {

"load_ctl_key": load_ctl_key,

"job_id": job_id,

"batch_id": batch_id,

"table": trgt_tbl,

"src_cnt": src_cnt,

"filter_cnt": filtr_cnt,

"ins_cnt": ins_cnt

}

print("Payload: {}".format(payload))

payload_df = spark.read.json(sc.parallelize([payload]))

display(payload_df)

payload_df.write.mode("append").saveAsTable(f"catalog.schema.audit")

Giri-Patcham · ‎07-02-2024

@ajithgaade

can you try setting this conf

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

mtajmouati · ‎07-02-2024

Hello,

Try this :

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Auto Loader Schema Evolution") \
    .getOrCreate()

# Source and checkpoint paths
source_path = "s3://path"
checkpoint_path = "/path/to/checkpoint"

# Define batch processing function
def batch_operation(df, epoch_id):
    # Perform your batch operations here
    # For example, write to Delta table with schema merge
    df.write \
      .format("delta") \
      .mode("append") \
      .option("mergeSchema", "true") \
      .save("/path/to/delta/table")

# Read stream with schema evolution
df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "parquet") \
    .option("cloudFiles.schemaLocation", checkpoint_path) \
    .option("cloudFiles.includeExistingFiles", "false") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .load(source_path)

# Write stream with schema merge
query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .trigger(availableNow=True) \
    .foreachBatch(batch_operation) \
    .option("mergeSchema", "true") \
    .start()

query.awaitTermination()

and try Setting Retry Policies

{
  "tasks": [
    {
      "task_key": "example-task",
      "notebook_task": {
        "notebook_path": "/path/to/your/notebook"
      },
      "max_retries": 2,
      "min_retry_interval_millis": 60000,
      "retry_on_timeout": true
    }
  ]
}

Databricks Community

Autoloader includeExistingFiles with retry didn't update the schema

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences