cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Stream failure JsonParseException

patojo94
New Contributor II

Hi all! I am having the following issue with a couple of pyspark streams. 

I have some notebooks running each of them an independent file structured streaming using  delta bronze table  (gzip parquet files) dumped from kinesis to S3 in a previous job. Each file contains some events in json format that need to be aggregated in different ways for further dump to aws S3 again (just dumped, not appended any table).  Between the events, sometimes I get an corrupted event in string format which I need to filter from the stream. Let suppose the event is a single string that says "error_event".

At the beginning of the notebook, the firsts things I do after spark.readStream are: 

1. bronze_df.where(f.col("data") != "error_event")
2. apply schema to data column to get expected format from json
 
For some reason I haven't been able to figure out yet, some of the streams fail when I change my cluster mode from photon to standard returning the following error, despite they all use the same function to filter the error events:
 

Error details:

 

 

Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION] Malformed records are detected in record parsing: [null,null,null,null,null,null,null,null,null,null,null,null,null].

Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'error_event': was expecting (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false') at [Source: (InputStreamReader)

 

 

Any ideas of what might be causing it? Thanks in advance!

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @patojo94, You're encountering an issue with malformed records in your PySpark streams. 

 

Let's explore some potential solutions:

 

Malformed Record Handling:

  • The error message indicates that there are malformed records during parsing. By default, Spark uses the PERMISSIVE mode, which attempts to parse records even if they are malformed. If a record cannot be parsed, it fills in null values.
  • You can explicitly set the parsing mode to DROPMALFORMED to reject malformed records during reading. This way, any records that don't conform to the expected format will be dropped.
  • Here's how you can apply this mode when reading your data:

Check Data Schema:

  • Ensure that the schema you're applying matches the actual structure of your data. If the schema doesn't match, it could lead to parsing errors.
  • Verify that your schema's column names and data types align with the actual data.

Inspect Malformed Records:

  • To diagnose the issue further, consider examining the malformed records. You can cache or save the parsed results and then filter out the malformed records.
  • For example:

Cluster Mode Change:

  • You mentioned that the issue occurs when changing the cluster mode from photon to standard. Verify if there are any differences in configuration or environment between these modes.
  • Check if the standard mode introduces any additional constraints or limitations that affect record parsing.

Data Inspection:

  • Inspect the actual data files (good and malformed records) to identify patterns or anomalies.
  • Look for unexpected characters or formatting issues within the records.

Remember to adapt these suggestions to your specific use case. Hopefully, one of these approaches will help you resolve the issue. If you need further assistance, feel free to ask! 🚀

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @patojo94, You're encountering an issue with malformed records in your PySpark streams. 

 

Let's explore some potential solutions:

 

Malformed Record Handling:

  • The error message indicates that there are malformed records during parsing. By default, Spark uses the PERMISSIVE mode, which attempts to parse records even if they are malformed. If a record cannot be parsed, it fills in null values.
  • You can explicitly set the parsing mode to DROPMALFORMED to reject malformed records during reading. This way, any records that don't conform to the expected format will be dropped.
  • Here's how you can apply this mode when reading your data:

Check Data Schema:

  • Ensure that the schema you're applying matches the actual structure of your data. If the schema doesn't match, it could lead to parsing errors.
  • Verify that your schema's column names and data types align with the actual data.

Inspect Malformed Records:

  • To diagnose the issue further, consider examining the malformed records. You can cache or save the parsed results and then filter out the malformed records.
  • For example:

Cluster Mode Change:

  • You mentioned that the issue occurs when changing the cluster mode from photon to standard. Verify if there are any differences in configuration or environment between these modes.
  • Check if the standard mode introduces any additional constraints or limitations that affect record parsing.

Data Inspection:

  • Inspect the actual data files (good and malformed records) to identify patterns or anomalies.
  • Look for unexpected characters or formatting issues within the records.

Remember to adapt these suggestions to your specific use case. Hopefully, one of these approaches will help you resolve the issue. If you need further assistance, feel free to ask! 🚀

Thank you sir for answering, that helps a lot. Please mark it as a solution.

If you're into online gaming cashlib casino is a must-try. As a self-proclaimed 'European specialist in online payment solutions,' they live up to the reputation. The article provides a detailed exploration of its features, functionality, and advantages. I personally found it to be a game-changer in enhancing my overall gambling experience.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group