cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Photon engine throws error "JSON document exceeded maximum allowed size 400.0 MiB"

kk007
New Contributor III

I am reading a 83MB json file using " spark.read.json(storage_path)", when I display the data is seems displaying fine, but when I try command line count, it complains about file size , being more than 400MB, which is not true.

Photon JSON reader error: JSON document exceeded maximum allowed size 400.0 MiB. Any single document must fit within this memory budget.

at 0x4fe8a91 <photon>.UnrecoverableError(external/workspace_spark_3_3/photon/io/json/simd-json-util.h:33)

at 0x4fe8974 <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:286)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe907c <photon>.TryLoadDocumentsFromStream(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:287)

at 0x4fe929f <photon>.NextInput(external/workspace_spark_3_3/photon/io/json/simd-json-reader.cc:308)

at 0x4ac26de <photon>.OpenFileForReading(external/workspace_spark_3_3/photon/exec-nodes/json-file-scan-node.cc:501)

at 0x4ac1058 <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/json-file-scan-node.cc:402)

at 0x49cc47c <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/grouping-agg-node.cc:92)

at 0x49cc47c <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/shuffle-sink-node.cc:146)

at com.databricks.photon.JniApiImpl.open(Native Method)

at com.databricks.photon.JniApi.open(JniApi.scala)

at com.databricks.photon.JniExecNode.open(JniExecNode.java:64)

at com.databricks.photon.PhotonShuffleMapStageExec.$anonfun$preShuffleRDDInternal$9(PhotonExec.scala:809)

at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

at com.databricks.photon.PhotonExec.timeit(PhotonExec.scala:344)

What can cause this error ? is it a bug ? same works fine when I switch off the photon engine.

4 REPLIES 4

kk007
New Contributor III

it seems working when I do .option("multiline",True), still the error seems ambiguous for a 83MB file.

karthik_p
Esteemed Contributor

@Kamal Kumar​ can we have test data that you are trying to read, looks single line reading causing issue, not multi line . Is size of data in single line is big

https://docs.databricks.com/external-data/json.html

kk007
New Contributor III

well, my whole file size is 83 mb , how the error stating it to be going beyond 400MB.

JSON document exceeded maximum allowed size 400.0 MiB

Anonymous
Not applicable

@Kamal Kumar​ :

The error message suggests that the JSON document size is exceeding the maximum allowed size of 400MB. This could be caused by one or more documents in your JSON file being larger than this limit. It is not a bug, but a limitation set by the Photon JSON reader.

To resolve the issue, you could try splitting the JSON file into smaller chunks and processing them separately. You can do this using tools like jq or split in Linux or Python libraries like ijson. Alternatively, you could consider using a different JSON reader that does not have this limitation.

Another possible cause of this error could be memory issues. If your system does not have enough memory to process the large JSON file, it may cause the error you are seeing. In this case, you may need to increase the memory allocated to your Spark job or optimize your code to use less memory.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.