2 weeks ago - last edited 2 weeks ago
For one of badge completion, it was mandatory to complete a Spark Streaming Demo Practice.
Due to the absence of a Kafka broker setup required for the demo practice, I configured a Confluent Kafka cluster and made several modifications to the Spark script provided by DBDemos streaming-sessionization to ensure compatibility with the Confluent Kafka cluster. As a result, I successfully ingested data from the Kafka topic into the bronze table.
Upon initiating the data load from the bronze table to the silver table, I observed an anomaly where the code executed successfully, and the transaction log indicated a 'Streaming_update' in delta table logs. However, the table did not contain any data. Further investigation revealed that the from_json function was not parsing the values correctly, despite the schema I was passing to function was as expected or in script.
1. Understand the schema of value column from Kafka topic. Using schema_of_json function we can dynamically extract schema from stringified JSON value.
2. Now trying from_json with json_schema to convert value column into individual key columns, where key-value pair parse but values were becoming null.
Also, checked whether JSON is valid by using python json lib, which parsed it without any error.
3. Finally, resolving the issue with few simple strings cleaning as mentioned below, it worked.
a) Replacing the leading and trailing double quotes from string value
b) Replacing back-slash with empty-string
I'm not sure whether it's expected behavior of from_json function or it should be fixed.
#DataEngineering #StreamingSessionization #dbdemos #SparkStructuredStreaming
Regards,
Hari Prasad
a week ago - last edited a week ago
Hi @hari-prasad ,
We have a ES ticket that mentions that JSON parsing for structs, maps, and arrays was fixed so that when a part of a record does not match the schema, the rest of the record can still be parsed correctly instead of returning nulls. This behavior is optional and can be enabled by setting spark.sql.json.enablePartialResults to true. By default, this flag is disabled to preserve the original behavior.
This suggests that the default behavior of from_json might not handle certain discrepancies in the JSON data gracefully, leading to null values. Cleaning the string values by replacing leading and trailing double quotes and backslashes indicates that the input data might not have been in the expected format, which could cause parsing issues.
Therefore, the behavior you encountered might be expected under certain conditions, especially if the input data format does not align perfectly with the expected schema. It may not necessarily be a bug but rather a limitation or characteristic of the default parsing behavior. You can consider enabling the spark.sql.json.enablePartialResults option to see if it improves the parsing behavior in your case.
Thanks!!
a week ago
Hi @Sidhant07,
Thanks for responding, I would try this spark config spark.sql.json.enablePartialResults.
But on the other hand, Python JSON library is able to parse the raw json string without additional configs.
Regards!
Saturday
@Sidhant07, the spark config spark.conf.set("spark.sql.json.enablePartialResults", True) is not helping, I assume it is exception which need to be handle by replace those characters from string to convert.
Regards,
Hari Prasad
a week ago
I am not sure if I read the full explanation but how about this :
Saturday - last edited Saturday
@saurabh18cs , decode won't help as value column is not of binary string and is not encoded with utf-8 or any other unicode. Value is available as stringified JSON without encoding.
Regards,
Hari Prasad
Monday
@hari-prasad thanks then I think pre-processing json string using regexp is the right thing to do as you're doing already
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group