cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Photon engine throws error "JSON document exceeded maximum allowed size 400.0 MiB"

kk007
New Contributor III

I am reading a 83MB json file using " spark.read.json(storage_path)", when I display the data is seems displaying fine, but when I try command line count, it complains about file size , being more than 400MB, which is not true.

Photon JSON reader error: JSON document exceeded maximum allowed size 400.0 MiB. Any single document must fit within this memory budget.

at 0x4fe8a91 <photon>.UnrecoverableError(external/workspace_spark_3_3/photon/io/json/simd-json-util.h:33)

at 0x4fe8974 <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:286)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe929f <photon>.NextInput(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:308)

at 0x4ac26de <photon>.OpenFileForReading(external/workspace_spark_3_3/photon/exec-nodes/json-file-scan-node.cc:501)

at 0x4ac1058 <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/json-file-scan-node.cc:402)

at 0x49cc47c <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/grouping-agg-node.cc:92)

at 0x49cc47c <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/shuffle-sink-node.cc:146)

at com.databricks.photon.JniApiImpl.open(Native Method)

at com.databricks.photon.JniApi.open(JniApi.scala)

at com.databricks.photon.JniExecNode.open(JniExecNode.java:64)

at com.databricks.photon.PhotonShuffleMapStageExec.$anonfun$preShuffleRDDInternal$9(PhotonExec.scala:809)

at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

at com.databricks.photon.PhotonExec.timeit(PhotonExec.scala:344)

What can cause this error ? is it a bug ? same works fine when I switch off the photon engine.

4 REPLIES 4

kk007
New Contributor III

it seems working when I do .option("multiline",True), still the error seems ambiguous for a 83MB file.

karthik_p
Esteemed Contributor

@Kamal Kumarโ€‹ can we have test data that you are trying to read, looks single line reading causing issue, not multi line . Is size of data in single line is big

https://docs.databricks.com/external-data/json.html

kk007
New Contributor III

well, my whole file size is 83 mb , how the error stating it to be going beyond 400MB.

JSON document exceeded maximum allowed size 400.0 MiB

Anonymous
Not applicable

@Kamal Kumarโ€‹ :

The error message suggests that the JSON document size is exceeding the maximum allowed size of 400MB. This could be caused by one or more documents in your JSON file being larger than this limit. It is not a bug, but a limitation set by the Photon JSON reader.

To resolve the issue, you could try splitting the JSON file into smaller chunks and processing them separately. You can do this using tools like jq or split in Linux or Python libraries like ijson. Alternatively, you could consider using a different JSON reader that does not have this limitation.

Another possible cause of this error could be memory issues. If your system does not have enough memory to process the large JSON file, it may cause the error you are seeing. In this case, you may need to increase the memory allocated to your Spark job or optimize your code to use less memory.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group