Hi all! I am having the following issue with a couple of pyspark streams.
I have some notebooks running each of them an independent file structured streaming using delta bronze table (gzip parquet files) dumped from kinesis to S3 in a previous job. Each file contains some events in json format that need to be aggregated in different ways for further dump to aws S3 again (just dumped, not appended any table). Between the events, sometimes I get an corrupted event in string format which I need to filter from the stream. Let suppose the event is a single string that says "error_event".
At the beginning of the notebook, the firsts things I do after spark.readStream are:
1. bronze_df.where(f.col("data") != "error_event")
2. apply schema to data column to get expected format from json
For some reason I haven't been able to figure out yet, some of the streams fail when I change my cluster mode from photon to standard returning the following error, despite they all use the same function to filter the error events:
Error details:
Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION] Malformed records are detected in record parsing: [null,null,null,null,null,null,null,null,null,null,null,null,null].
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'error_event': was expecting (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false') at [Source: (InputStreamReader)
Any ideas of what might be causing it? Thanks in advance!