Databricks Community

patojo94 · ‎11-28-2023

Hi all! I am having the following issue with a couple of pyspark streams.

I have some notebooks running each of them an independent file structured streaming using delta bronze table (gzip parquet files) dumped from kinesis to S3 in a previous job. Each file contains some events in json format that need to be aggregated in different ways for further dump to aws S3 again (just dumped, not appended any table). Between the events, sometimes I get an corrupted event in string format which I need to filter from the stream. Let suppose the event is a single string that says "error_event".

At the beginning of the notebook, the firsts things I do after spark.readStream are:

1. bronze_df.where(f.col("data") != "error_event")

2. apply schema to data column to get expected format from json

For some reason I haven't been able to figure out yet, some of the streams fail when I change my cluster mode from photon to standard returning the following error, despite they all use the same function to filter the error events:

Error details:

Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION] Malformed records are detected in record parsing: [null,null,null,null,null,null,null,null,null,null,null,null,null].

Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'error_event': was expecting (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false') at [Source: (InputStreamReader)

Any ideas of what might be causing it? Thanks in advance!

Kaniz_Fatma · ‎11-29-2023

Hi @patojo94, You're encountering an issue with malformed records in your PySpark streams.

Let's explore some potential solutions:

Malformed Record Handling:

The error message indicates that there are malformed records during parsing. By default, Spark uses the PERMISSIVE mode, which attempts to parse records even if they are malformed. If a record cannot be parsed, it fills in null values.
You can explicitly set the parsing mode to DROPMALFORMED to reject malformed records during reading. This way, any records that don't conform to the expected format will be dropped.
Here's how you can apply this mode when reading your data:

Check Data Schema:

Ensure that the schema you're applying matches the actual structure of your data. If the schema doesn't match, it could lead to parsing errors.
Verify that your schema's column names and data types align with the actual data.

Inspect Malformed Records:

To diagnose the issue further, consider examining the malformed records. You can cache or save the parsed results and then filter out the malformed records.
For example:

Cluster Mode Change:

You mentioned that the issue occurs when changing the cluster mode from photon to standard. Verify if there are any differences in configuration or environment between these modes.
Check if the standard mode introduces any additional constraints or limitations that affect record parsing.

Data Inspection:

Inspect the actual data files (good and malformed records) to identify patterns or anomalies.
Look for unexpected characters or formatting issues within the records.

Remember to adapt these suggestions to your specific use case. Hopefully, one of these approaches will help you resolve the issue. If you need further assistance, feel free to ask! 🚀

View solution in original post

Kaniz_Fatma · ‎11-29-2023