@Farzad Bonabi :
Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also all subsequent columns to null.
One workaround for this issue is to use the spark.read.json() method instead of spark.createDataFrame() and then select the columns of interest. Here's an example:
from pyspark.sql.functions import col
json_data = '''
{
"row_1": {
"foo1": "1",
"foo2": "2",
"foo3": "3",
"foo4": "4",
"foo5": "5",
"foo6": "6",
"foo7": "7",
"foo71": "1.2345678",
"foo72": "123.456789",
"foo8": "8"
},
"row_2": {
"foo1": "10",
"foo2": "20",
"foo3": "30",
"foo4": "40",
"foo5": "50",
"foo6": "60",
"foo7": "70",
"foo71": "invalid_value",
"foo72": "invalid_value",
"foo8": "80"
}
}
'''
df = spark.read.json(sc.parallelize([json_data]), schema=MySchema)
df_1 = df.select(col("foo1"), col("foo2"), col("foo3"), col("foo4"), col("foo5"), col("foo6"), col("foo7"), col("foo71"), col("foo72"), col("foo8"))
df_1.show()
This should produce the expected output where only the "foo71" and "foo72" columns are null and the rest of the columns have the correct values.