<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: BufferHolder Exceeded on Json flattening in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12881#M7636</link>
    <description>&lt;P&gt;Hi , thank you for your response!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'll give this a try!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 20 Oct 2021 16:02:21 GMT</pubDate>
    <dc:creator>D3nnisd</dc:creator>
    <dc:date>2021-10-20T16:02:21Z</dc:date>
    <item>
      <title>BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12873#M7628</link>
      <description>&lt;P&gt;On Databricks, we use the following code to flatten JSON in Python. The data is from a REST API:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;df = spark.read.format("json").option("header", "true").option("multiline", "true").load(SourceFileFolder + sourcetable + "*.json")&lt;/P&gt;&lt;P&gt;df2 = df.select(psf.explode('value').alias('tmp')).select('tmp.*')&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;df2.write.format("delta").save(DeltaLakeFolder)&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;We don't know the schema's as they change so it is as generic as possible. However, as the json files grow above 2.8GB, I now see the following error:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 168 because the size after growing exceeds size limitation 2147483632&lt;/P&gt;&lt;P&gt;```&amp;nbsp;&lt;/P&gt;&lt;P&gt;The json is like this:&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;	"@odata.context": "RANDOMSTRING)",&lt;/P&gt;&lt;P&gt;	"value": [&lt;/P&gt;&lt;P&gt; {&lt;/P&gt;&lt;P&gt; "COL1": null,&lt;/P&gt;&lt;P&gt; "COL2": "VAL2",&lt;/P&gt;&lt;P&gt; "COL3": "VAL3",&lt;/P&gt;&lt;P&gt; "COL4": "VAL4",&lt;/P&gt;&lt;P&gt; "COL5": "VAL5",&lt;/P&gt;&lt;P&gt; "COL6": "VAL6",&lt;/P&gt;&lt;P&gt; "COL8": "VAL7",&lt;/P&gt;&lt;P&gt; "COL9": null&lt;/P&gt;&lt;P&gt; },&lt;/P&gt;&lt;P&gt; {&lt;/P&gt;&lt;P&gt; "COL1": null,&lt;/P&gt;&lt;P&gt; "COL2": "VAL2",&lt;/P&gt;&lt;P&gt; "COL3": "VAL3",&lt;/P&gt;&lt;P&gt; "COL4": "VAL4",&lt;/P&gt;&lt;P&gt; "COL5": "VAL5",&lt;/P&gt;&lt;P&gt; "COL6": "VAL6",&lt;/P&gt;&lt;P&gt; "COL8": "VAL7",&lt;/P&gt;&lt;P&gt; "COL9": null&lt;/P&gt;&lt;P&gt; },&lt;/P&gt;&lt;P&gt; {&lt;/P&gt;&lt;P&gt; "COL1": null,&lt;/P&gt;&lt;P&gt; "COL2": "VAL2",&lt;/P&gt;&lt;P&gt; "COL3": "VAL3",&lt;/P&gt;&lt;P&gt; "COL4": "VAL4",&lt;/P&gt;&lt;P&gt; "COL5": "VAL5",&lt;/P&gt;&lt;P&gt; "COL6": "VAL6",&lt;/P&gt;&lt;P&gt; "COL8": "VAL7",&lt;/P&gt;&lt;P&gt; "COL9": null&lt;/P&gt;&lt;P&gt; }&lt;/P&gt;&lt;P&gt;	]&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How can I resolve this or work around this? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Kind regards,&lt;/P&gt;&lt;P&gt;Dennis&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 14:39:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12873#M7628</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T14:39:35Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12874#M7629</link>
      <description>&lt;P&gt;if you use the multiline = true, the file will be read as a whole&lt;/P&gt;&lt;P&gt;(see &lt;A href="https://docs.databricks.com/data/data-sources/read-json.html" alt="https://docs.databricks.com/data/data-sources/read-json.html" target="_blank"&gt;https://docs.databricks.com/data/data-sources/read-json.html&lt;/A&gt;).&lt;/P&gt;&lt;P&gt;So if you can, try with single-line.&lt;/P&gt;&lt;P&gt;If that is not possible, you can try to split the json file on the storage, or use larger workers/driver.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:06:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12874#M7629</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-20T15:06:03Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12875#M7630</link>
      <description>&lt;P&gt;Thank you for your quick response!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried with single line, but that resulted in Caused by: java.io.IOException: Too many bytes before newline: 2147483648&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am currently on 28GB/4core with 8/nodes 1 driver.  What would increase the buffer in my case?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:18:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12875#M7630</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T15:18:18Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12876#M7631</link>
      <description>&lt;P&gt;I've edited the Json as it also contains a single string at the start of the file. I think that's why the single mode doesn't work.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:26:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12876#M7631</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T15:26:26Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12877#M7632</link>
      <description>&lt;P&gt;If that string is too big, you will run into problems indeed.&lt;/P&gt;&lt;P&gt;Also keep in mind that each line must contain a separate, self-contained valid JSON object.&lt;/P&gt;&lt;P&gt;(&lt;A href="https://spark.apache.org/docs/3.2.0/sql-data-sources-json.html#content" alt="https://spark.apache.org/docs/3.2.0/sql-data-sources-json.html#content" target="_blank"&gt;https://spark.apache.org/docs/3.2.0/sql-data-sources-json.html#content&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;So you might have to do some editing first.&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/35990846/spark-read-json-throwing-java-io-ioexception-too-many-bytes-before-newline" alt="https://stackoverflow.com/questions/35990846/spark-read-json-throwing-java-io-ioexception-too-many-bytes-before-newline" target="_blank"&gt;Here&lt;/A&gt; someone also posted a possible solution to edit the json.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:36:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12877#M7632</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-20T15:36:40Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12878#M7633</link>
      <description>&lt;P&gt;You can also just read it as text file to memory RDD and than read that RDD as json. During reading as text file is easy to set partitioning. so it will be not one big object but for example 8:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;rdd = sc.textFile(sourceFile, 8)
df = spark.read.json(rdd)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If there are problems with json text you can clean it using map:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;rdd = sc.textFile(sourceFile, 8).map(your lambda function)
df = spark.read.json(rdd)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:56:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12878#M7633</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-10-20T15:56:59Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12879#M7634</link>
      <description>&lt;P&gt;@Dennis D​&amp;nbsp;, what's happening here is that more than 2 GB (2147483648 bytes) is being loaded into a single column value. This is a hard-limit for serialization. &lt;A href="https://kb.databricks.com/sql/cannot-grow-bufferholder-exceeds-size.html?_ga=2.253773384.1725562219.1634572497-1833030568.1621870634" alt="https://kb.databricks.com/sql/cannot-grow-bufferholder-exceeds-size.html?_ga=2.253773384.1725562219.1634572497-1833030568.1621870634" target="_blank"&gt;This KB article &lt;/A&gt;addresses it. The solution would be to find some way to have this loaded into different column values. I'll poke around and see if I can figure out a nice way to code it. Will respond if I find something, otherwise you'll need to take werner's approach and edit the incoming JSON.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:57:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12879#M7634</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-20T15:57:02Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12880#M7635</link>
      <description>&lt;P&gt;Hi , thanks for the response!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What i do is first load the full json (with the structure above), and then explode the value part. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Would it be possible to instantly load and explode it into the dataframe so we never reach that limit in a column? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 15:59:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12880#M7635</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T15:59:52Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12881#M7636</link>
      <description>&lt;P&gt;Hi , thank you for your response!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'll give this a try!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 16:02:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12881#M7636</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T16:02:21Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12882#M7637</link>
      <description>&lt;P&gt;With the following code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;rdd = sc.textFile(SourceFileFolder + sourcetable + "*.json", 8)
&amp;nbsp;
df = spark.read.option("header", "true").option("multiline", "true").json(rdd)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I am getting &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 7.0 failed 4 times, most recent failure: Lost task 2.3 in stage 7.0 (TID 113) (10.139.64.6 executor 8): java.io.IOException: Too many bytes before newline: 2147483648&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Am i missing something?&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 16:14:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12882#M7637</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T16:14:43Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12883#M7638</link>
      <description>&lt;P&gt;hm, you might be able to do something like this using a UDF. That would force the read and explode to happen on one node. But using a UDF would require you to know the output schema. I think the best bet is to set up some sort of input JSON parser. like SAX. I haven't done that before.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 16:43:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12883#M7638</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-20T16:43:35Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12884#M7639</link>
      <description>&lt;P&gt;I do know the output schema, but theres loads of files so I like to avoid having to create schema's for each of them. Would you be able to provide me with a small sample based on the schema in my original post? I am not familiar with UDF.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 16:46:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12884#M7639</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-20T16:46:16Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12885#M7640</link>
      <description>&lt;P&gt;So for instance:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql import Row
from pyspark.sql.types import *
&amp;nbsp;
file = [Row(file="/dbfs/tmp/comm_q_json.json")]
df = spark.createDataFrame(file, ("file: String"))
&amp;nbsp;
def read_process_json(iterator):
  import pandas as pd
  
  def process_from_path(path):
    rawJSON = pd.io.json.read_json(path)
    return pd.json_normalize(rawJSON['value'])
  
  for pdf in iterator:
    DFseries = pdf["file"].apply(process_from_path).tolist()
    yield pd.concat(DFseries)
    
outSchema = StructType([
  StructField("COL1",StringType(),True),
  StructField("COL2",StringType(),True),
  StructField("COL3",StringType(),True),
  StructField("COL4",StringType(),True),
  StructField("COL5",StringType(),True),
  StructField("COL6",StringType(),True),
  StructField("COL8",StringType(),True),
  StructField("COL9",StringType(),True)
])
&amp;nbsp;
display(df.mapInPandas(read_process_json, schema=outSchema))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Here we rely on the pandas API to do the JSON wrangling. I'm not 100% sure it would work, but who knows. You might have to tweak some configs if the results are too large to serialize.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 17:46:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12885#M7640</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-20T17:46:29Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12886#M7641</link>
      <description>&lt;P&gt;Also- this is what I used to load the data on disk:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dbutils.fs.put("/tmp/comm_q_json.json",
"""
{
"@odata.context": "RANDOMSTRING)",
"value": [
{
"COL1": null,
"COL2": "VAL2",
"COL3": "VAL3",
"COL4": "VAL4",
"COL5": "VAL5",
"COL6": "VAL6",
"COL8": "VAL7",
"COL9": null
},
{
"COL1": null,
"COL2": "VAL2",
"COL3": "VAL3",
"COL4": "VAL4",
"COL5": "VAL5",
"COL6": "VAL6",
"COL8": "VAL7",
"COL9": null
},
{
"COL1": null,
"COL2": "VAL2",
"COL3": "VAL3",
"COL4": "VAL4",
"COL5": "VAL5",
"COL6": "VAL6",
"COL8": "VAL7",
"COL9": null
}
]
}
""", True)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 17:48:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12886#M7641</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-20T17:48:45Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12887#M7642</link>
      <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried your solution and it works for most table.s  Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried another table, but i get :&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;ValueError: Unexpected character found when decoding object value'&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am not sure how to solve this as i can't figure out where in the JSON. Is there a way to automatically solve it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Oct 2021 08:19:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12887#M7642</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-21T08:19:41Z</dc:date>
    </item>
    <item>
      <title>Re: BufferHolder Exceeded on Json flattening</title>
      <link>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12888#M7643</link>
      <description>&lt;P&gt;Nevermind. The JSON was somehow corrupted. I re-extracted and it worked out of the box &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Oct 2021 10:29:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/bufferholder-exceeded-on-json-flattening/m-p/12888#M7643</guid>
      <dc:creator>D3nnisd</dc:creator>
      <dc:date>2021-10-21T10:29:16Z</dc:date>
    </item>
  </channel>
</rss>

