Databricks Community

FarBo · ‎01-05-2023

Hi,

I have encountered a problem using spark, when creating a dataframe from a raw json source.

I have defined an schema for my data and the problem is that when there is a mismatch between one of the column values and its defined schema, spark not only sets that column to null but also sets all the columns after that column to null. I made a minimal example to show the problem as attached.

Basically, the values of "foo71" and "foo72" for the row_2 have issue with the defined schema "DecimalType(10,7)". Then, in the created dataframe (df_1), all the column after the column "foo7" return null, despite the schema is correct for those. What actually I expect is that only the "foo72" and "foo71" to be nulls, and the rest of columns after that should have the correct value in the columns.

I have used the following for this notebook:

1- Databricks Notebook in the databricks data science and engineering workspace

2- My cluster uses Databricks Runtime Version 12.0 (includes Apache Spark 3.3.1, Scala 2.12)

I have a workaround this problem, but I thought that it might worth reporting it to have a fix, because this can potentially risks losing data in similar situations.

Thanks you

Farzad

Anonymous · ‎04-10-2023

@Farzad Bonabi :

Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also all subsequent columns to null.

One workaround for this issue is to use the spark.read.json() method instead of spark.createDataFrame() and then select the columns of interest. Here's an example:

from pyspark.sql.functions import col
 
json_data = '''
{
    "row_1": {
        "foo1": "1",
        "foo2": "2",
        "foo3": "3",
        "foo4": "4",
        "foo5": "5",
        "foo6": "6",
        "foo7": "7",
        "foo71": "1.2345678",
        "foo72": "123.456789",
        "foo8": "8"
    },
    "row_2": {
        "foo1": "10",
        "foo2": "20",
        "foo3": "30",
        "foo4": "40",
        "foo5": "50",
        "foo6": "60",
        "foo7": "70",
        "foo71": "invalid_value",
        "foo72": "invalid_value",
        "foo8": "80"
    }
}
'''
 
df = spark.read.json(sc.parallelize([json_data]), schema=MySchema)
df_1 = df.select(col("foo1"), col("foo2"), col("foo3"), col("foo4"), col("foo5"), col("foo6"), col("foo7"), col("foo71"), col("foo72"), col("foo8"))
 
df_1.show()

This should produce the expected output where only the "foo71" and "foo72" columns are null and the rest of the columns have the correct values.

FarBo · ‎04-20-2023

Hi @Suteja Kanuri

Thank you for your reply and suggestion. We have solved the issue in another way but your suggestion also seems a good way.

I mainly wanted to report the bug, but it seems it is a known issue. 🙂

Phanirajg · ‎03-21-2024

Hi Farbo,

We have been facing similar kind of issue where schema is missing for most of the rows while reading json data even when using spark read.json. we saw this after upgrading spark from 2.4 to 3.3, even spark 4 also we had the same issue.

Could you pls suggest the alternate solution which worked for you

Thanks,

Phani