cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark issue handling data from json when the schema DataType mismatch occurs

FarBo
New Contributor III

Hi,

I have encountered a problem using spark, when creating a dataframe from a raw json source.

I have defined an schema for my data and the problem is that when there is a mismatch between one of the column values and its defined schema, spark not only sets that column to null but also sets all the columns after that column to null. I made a minimal example to show the problem as attached.

Basically, the values of "foo71" and "foo72" for the row_2 have issue with the defined schema "DecimalType(10,7)". Then, in the created dataframe (df_1), all the column after the column "foo7" return null, despite the schema is correct for those. What actually I expect is that only the "foo72" and "foo71" to be nulls, and the rest of columns after that should have the correct value in the columns.

I have used the following for this notebook:

1- Databricks Notebook in the databricks data science and engineering workspace

2- My cluster uses Databricks Runtime Version 12.0 (includes Apache Spark 3.3.1, Scala 2.12)

I have a workaround this problem, but I thought that it might worth reporting it to have a fix, because this can potentially risks losing data in similar situations.

Thanks you

Farzad

4 REPLIES 4

Anonymous
Not applicable

@Farzad Bonabi​ :

Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also all subsequent columns to null.

One workaround for this issue is to use the spark.read.json() method instead of spark.createDataFrame() and then select the columns of interest. Here's an example:

from pyspark.sql.functions import col
 
json_data = '''
{
    "row_1": {
        "foo1": "1",
        "foo2": "2",
        "foo3": "3",
        "foo4": "4",
        "foo5": "5",
        "foo6": "6",
        "foo7": "7",
        "foo71": "1.2345678",
        "foo72": "123.456789",
        "foo8": "8"
    },
    "row_2": {
        "foo1": "10",
        "foo2": "20",
        "foo3": "30",
        "foo4": "40",
        "foo5": "50",
        "foo6": "60",
        "foo7": "70",
        "foo71": "invalid_value",
        "foo72": "invalid_value",
        "foo8": "80"
    }
}
'''
 
df = spark.read.json(sc.parallelize([json_data]), schema=MySchema)
df_1 = df.select(col("foo1"), col("foo2"), col("foo3"), col("foo4"), col("foo5"), col("foo6"), col("foo7"), col("foo71"), col("foo72"), col("foo8"))
 
df_1.show()

This should produce the expected output where only the "foo71" and "foo72" columns are null and the rest of the columns have the correct values.

FarBo
New Contributor III

Hi @Suteja Kanuri​ 

Thank you for your reply and suggestion. We have solved the issue in another way but your suggestion also seems a good way.

I mainly wanted to report the bug, but it seems it is a known issue. 🙂

Hi Farbo,

We have been facing similar kind of issue where schema is missing for most of the rows while reading json data even when using spark read.json. we saw this after upgrading spark from 2.4 to 3.3, even spark 4 also we had the same issue.

Could you pls suggest the alternate solution which worked for you 

Thanks,

Phani

 

Hi @Suteja Kanuri​ 

would you provide a link to the issue tracker for that bug?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.