topic Re: BufferHolder Exceeded on Json flattening in Data Engineering

BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 14:39:35 GMT

On Databricks, we use the following code to flatten JSON in Python. The data is from a REST API:

```

df = spark.read.format("json").option("header", "true").option("multiline", "true").load(SourceFileFolder + sourcetable + "*.json")

df2 = df.select(psf.explode('value').alias('tmp')).select('tmp.*')

df2.write.format("delta").save(DeltaLakeFolder)

```

We don't know the schema's as they change so it is as generic as possible. However, as the json files grow above 2.8GB, I now see the following error:

```

Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 168 because the size after growing exceeds size limitation 2147483632

```

The json is like this:

```

{

"@odata.context": "RANDOMSTRING)",

"value": [

{

"COL1": null,

"COL2": "VAL2",

"COL3": "VAL3",

"COL4": "VAL4",

"COL5": "VAL5",

"COL6": "VAL6",

"COL8": "VAL7",

"COL9": null

{

"COL1": null,

"COL2": "VAL2",

"COL3": "VAL3",

"COL4": "VAL4",

"COL5": "VAL5",

"COL6": "VAL6",

"COL8": "VAL7",

"COL9": null

{

"COL1": null,

"COL2": "VAL2",

"COL3": "VAL3",

"COL4": "VAL4",

"COL5": "VAL5",

"COL6": "VAL6",

"COL8": "VAL7",

"COL9": null

}

]

}

```

How can I resolve this or work around this?

Thanks in advance!

Kind regards,

Dennis

Re: BufferHolder Exceeded on Json flattening

-werners- — Wed, 20 Oct 2021 15:06:03 GMT

if you use the multiline = true, the file will be read as a whole

(see https://docs.databricks.com/data/data-sources/read-json.html).

So if you can, try with single-line.

If that is not possible, you can try to split the json file on the storage, or use larger workers/driver.

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 15:18:18 GMT

Thank you for your quick response!

I tried with single line, but that resulted in Caused by: java.io.IOException: Too many bytes before newline: 2147483648

I am currently on 28GB/4core with 8/nodes 1 driver. What would increase the buffer in my case?

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 15:26:26 GMT

I've edited the Json as it also contains a single string at the start of the file. I think that's why the single mode doesn't work.

Re: BufferHolder Exceeded on Json flattening

-werners- — Wed, 20 Oct 2021 15:36:40 GMT

If that string is too big, you will run into problems indeed.

Also keep in mind that each line must contain a separate, self-contained valid JSON object.

(https://spark.apache.org/docs/3.2.0/sql-data-sources-json.html#content)

So you might have to do some editing first.

Here someone also posted a possible solution to edit the json.

Re: BufferHolder Exceeded on Json flattening

Hubert-Dudek — Wed, 20 Oct 2021 15:56:59 GMT

You can also just read it as text file to memory RDD and than read that RDD as json. During reading as text file is easy to set partitioning. so it will be not one big object but for example 8:

rdd = sc.textFile(sourceFile, 8)
df = spark.read.json(rdd)

If there are problems with json text you can clean it using map:

rdd = sc.textFile(sourceFile, 8).map(your lambda function)
df = spark.read.json(rdd)

Re: BufferHolder Exceeded on Json flattening

Dan_Z — Wed, 20 Oct 2021 15:57:02 GMT

@Dennis D , what's happening here is that more than 2 GB (2147483648 bytes) is being loaded into a single column value. This is a hard-limit for serialization. This KB article addresses it. The solution would be to find some way to have this loaded into different column values. I'll poke around and see if I can figure out a nice way to code it. Will respond if I find something, otherwise you'll need to take werner's approach and edit the incoming JSON.

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 15:59:52 GMT

Hi , thanks for the response!

What i do is first load the full json (with the structure above), and then explode the value part.

Would it be possible to instantly load and explode it into the dataframe so we never reach that limit in a column?

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 16:02:21 GMT

Hi , thank you for your response!

I'll give this a try!

Thanks!

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 16:14:43 GMT

With the following code:

rdd = sc.textFile(SourceFileFolder + sourcetable + "*.json", 8)
 
df = spark.read.option("header", "true").option("multiline", "true").json(rdd)

I am getting

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 7.0 failed 4 times, most recent failure: Lost task 2.3 in stage 7.0 (TID 113) (10.139.64.6 executor 8): java.io.IOException: Too many bytes before newline: 2147483648

Am i missing something?

Re: BufferHolder Exceeded on Json flattening

Dan_Z — Wed, 20 Oct 2021 16:43:35 GMT

hm, you might be able to do something like this using a UDF. That would force the read and explode to happen on one node. But using a UDF would require you to know the output schema. I think the best bet is to set up some sort of input JSON parser. like SAX. I haven't done that before.

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Wed, 20 Oct 2021 16:46:16 GMT

I do know the output schema, but theres loads of files so I like to avoid having to create schema's for each of them. Would you be able to provide me with a small sample based on the schema in my original post? I am not familiar with UDF.

Re: BufferHolder Exceeded on Json flattening

Dan_Z — Wed, 20 Oct 2021 17:46:29 GMT

So for instance:

from pyspark.sql import Row
from pyspark.sql.types import *
 
file = [Row(file="/dbfs/tmp/comm_q_json.json")]
df = spark.createDataFrame(file, ("file: String"))
 
def read_process_json(iterator):
  import pandas as pd
  
  def process_from_path(path):
    rawJSON = pd.io.json.read_json(path)
    return pd.json_normalize(rawJSON['value'])
  
  for pdf in iterator:
    DFseries = pdf["file"].apply(process_from_path).tolist()
    yield pd.concat(DFseries)
    
outSchema = StructType([
  StructField("COL1",StringType(),True),
  StructField("COL2",StringType(),True),
  StructField("COL3",StringType(),True),
  StructField("COL4",StringType(),True),
  StructField("COL5",StringType(),True),
  StructField("COL6",StringType(),True),
  StructField("COL8",StringType(),True),
  StructField("COL9",StringType(),True)
])
 
display(df.mapInPandas(read_process_json, schema=outSchema))

Here we rely on the pandas API to do the JSON wrangling. I'm not 100% sure it would work, but who knows. You might have to tweak some configs if the results are too large to serialize.

Re: BufferHolder Exceeded on Json flattening

Dan_Z — Wed, 20 Oct 2021 17:48:45 GMT

Also- this is what I used to load the data on disk:

dbutils.fs.put("/tmp/comm_q_json.json",
"""
{
"@odata.context": "RANDOMSTRING)",
"value": [
{
"COL1": null,
"COL2": "VAL2",
"COL3": "VAL3",
"COL4": "VAL4",
"COL5": "VAL5",
"COL6": "VAL6",
"COL8": "VAL7",
"COL9": null
},
{
"COL1": null,
"COL2": "VAL2",
"COL3": "VAL3",
"COL4": "VAL4",
"COL5": "VAL5",
"COL6": "VAL6",
"COL8": "VAL7",
"COL9": null
},
{
"COL1": null,
"COL2": "VAL2",
"COL3": "VAL3",
"COL4": "VAL4",
"COL5": "VAL5",
"COL6": "VAL6",
"COL8": "VAL7",
"COL9": null
}
]
}
""", True)

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Thu, 21 Oct 2021 08:19:41 GMT

Hi!

I tried your solution and it works for most table.s Thanks!

I tried another table, but i get :

ValueError: Unexpected character found when decoding object value'

I am not sure how to solve this as i can't figure out where in the JSON. Is there a way to automatically solve it?

Re: BufferHolder Exceeded on Json flattening

D3nnisd — Thu, 21 Oct 2021 10:29:16 GMT

Nevermind. The JSON was somehow corrupted. I re-extracted and it worked out of the box 🙂