cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading empty json file in serverless gives error

Dhruv-22
Contributor II

I ran a databricks notebook to do incremental loads from files in raw layer to bronze layer tables. Today, I encountered a case where the delta file was empty. I tried running it manually on the serverless compute and encountered an error.

df = spark.read.json(path)
df.display()

-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only
include the internal corrupt record column (named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count() and
spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or
save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

 I tried caching, but it isn't allowed in serverless compute

[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000

I have the following questions:

  1. Why does this issue occur only in serverless compute? I tried using All-purpose compute with 15.4LTS and it created an empty dataframe.
  2. Is there a way to display the dataframe to see what exactly is the corrupt record? I tried collect, select('*', lit('c')) but it didn't work.
  3. Is there a way in the serverless compute to tolerate empty files?
7 REPLIES 7

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Dhruv-22 ,

  • Can you share the schema of the df? Do you have a _corrupt_record column in your dataframe? If yes.. where are you getting it from, because you said its an empty file correct?
  • As per the design ,Spark blocks queries that only reference_corrupt_record  a column from raw JSON/CSV and it throws an error if explicitly accessed. But in your case, you aren't doing that, but it still throws that error, which is why we will need to schema and the explain plan of the dataframe df.explain(true).

By the way, i created an empty file in serverless and it created an empty df as expected 

K_Anudeep_0-1761910255579.png

 

Anudeep

Hey @K_Anudeep

Here are the details you requested.

  • Yes, there is a _corrupt_record column in the dataframe. It is coming since spark is treating the file to have some corrupt records. Therefore, it is generating the _corrupt_record column

Dhruv22_2-1761913421664.png

  • The error comes when I try to run display, collect or any such command. Here is the explain

Dhruv22_3-1761913478812.png

  • Also, db.count() is 1

I checked the file size, it was 3 bytes. It doesn't display any character.

Dhruv22_0-1761913058957.png

But, printing the hexdump it gives the following

Dhruv22_1-1761913093525.png

I guess this is causing the issue. Can you tell how to deal with it? It runs fine on the all-purpose cluster though

K_Anudeep
Databricks Employee
Databricks Employee

Hey @Dhruv-22 ,

Oh .. this totally makes sense now. In that case, it is a true corrupt record..You can just add the read option DROPMALFORMED and it should work

df1 = (spark.read
.format("json")
.option("mode", "DROPMALFORMED") # <- drops malformed record
.load(base))

df1.display()
 
 
 
Anudeep

Hi @K_Anudeep 

I searched more. It is not a corrupt record. The three characters represent a Byte Order Mark (BOM) signaling that the file is UTF-8 encoded. It is a standard thing. Also, the file is generated automatically by a no-code pipeline (Azure Data Factory) so it is difficult to say that there is an issue with the service.

The difference occurs with Photon enablement. In serverless cluster, Photon is enabled by default. In all-purpose cluster the file reads fine. When I enable photon in the all-purpose cluster, it fails with the same error as in serverless cluster.

So, there is a difference in the parser used by photon. It is different than normal spark. Could you some more digging and find out what exactly is the thing causing the error?

Thanks for the help uptil now.

P.S. - I got the idea of Photon from Chatgpt. I tried and found it to be true

Adding to my point. Suppose the file consists of a valid json along with the byte order mark like below (the bytes ef, bb and bf represent the byte order mark)

dhruv@AS-MAC-0324 Downloads  % cat zone.json|hexdump -C
00000000  ef bb bf 7b 22 61 22 3a  20 22 61 22 7d 0a        |...{"a": "a"}.|
0000000e

Then a cluster with photon enabled reads the file and gives this as the output.

Dhruv22_0-1762158303933.png

i.e. the cluster reads the file properly. So, it is a bug in photon enabled environment that it is unable to read an empty file with a byte order mark.

K_Anudeep
Databricks Employee
Databricks Employee

Hey @Dhruv-22 ! Thanks for the info! 

I will need to analyse this internally to pinpoint the exact root cause. I advise that you raise a support case with us to have a closer look. You can raise a support case using the link: https://help.databricks.com/  
and add a comment to assign it to me so that I can look and provide a detailed analysis and fix if any 

Anudeep

hey @K_Anudeep 

I don't have any support subscription.