Spark issue handling data from json when the schema DataType mismatch occurs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-05-2023 06:57 AM
Hi,
I have encountered a problem using spark, when creating a dataframe from a raw json source.
I have defined an schema for my data and the problem is that when there is a mismatch between one of the column values and its defined schema, spark not only sets that column to null but also sets all the columns after that column to null. I made a minimal example to show the problem as attached.
Basically, the values of "foo71" and "foo72" for the row_2 have issue with the defined schema "DecimalType(10,7)". Then, in the created dataframe (df_1), all the column after the column "foo7" return null, despite the schema is correct for those. What actually I expect is that only the "foo72" and "foo71" to be nulls, and the rest of columns after that should have the correct value in the columns.
I have used the following for this notebook:
1- Databricks Notebook in the databricks data science and engineering workspace
2- My cluster uses Databricks Runtime Version 12.0 (includes Apache Spark 3.3.1, Scala 2.12)
I have a workaround this problem, but I thought that it might worth reporting it to have a fix, because this can potentially risks losing data in similar situations.
Thanks you
Farzad
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-10-2023 06:11 AM
@Farzad Bonabi :
Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also all subsequent columns to null.
One workaround for this issue is to use the spark.read.json() method instead of spark.createDataFrame() and then select the columns of interest. Here's an example:
from pyspark.sql.functions import col
json_data = '''
{
"row_1": {
"foo1": "1",
"foo2": "2",
"foo3": "3",
"foo4": "4",
"foo5": "5",
"foo6": "6",
"foo7": "7",
"foo71": "1.2345678",
"foo72": "123.456789",
"foo8": "8"
},
"row_2": {
"foo1": "10",
"foo2": "20",
"foo3": "30",
"foo4": "40",
"foo5": "50",
"foo6": "60",
"foo7": "70",
"foo71": "invalid_value",
"foo72": "invalid_value",
"foo8": "80"
}
}
'''
df = spark.read.json(sc.parallelize([json_data]), schema=MySchema)
df_1 = df.select(col("foo1"), col("foo2"), col("foo3"), col("foo4"), col("foo5"), col("foo6"), col("foo7"), col("foo71"), col("foo72"), col("foo8"))
df_1.show()
This should produce the expected output where only the "foo71" and "foo72" columns are null and the rest of the columns have the correct values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-20-2023 05:04 AM
Hi @Suteja Kanuri
Thank you for your reply and suggestion. We have solved the issue in another way but your suggestion also seems a good way.
I mainly wanted to report the bug, but it seems it is a known issue. 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-21-2024 02:04 PM - edited 03-21-2024 02:11 PM
Hi Farbo,
We have been facing similar kind of issue where schema is missing for most of the rows while reading json data even when using spark read.json. we saw this after upgrading spark from 2.4 to 3.3, even spark 4 also we had the same issue.
Could you pls suggest the alternate solution which worked for you
Thanks,
Phani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-11-2023 02:07 AM
Hi @Suteja Kanuri
would you provide a link to the issue tracker for that bug?