Reading empty json file in serverless gives error
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-30-2025 10:40 AM
I ran a databricks notebook to do incremental loads from files in raw layer to bronze layer tables. Today, I encountered a case where the delta file was empty. I tried running it manually on the serverless compute and encountered an error.
df = spark.read.json(path)
df.display()
-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only
include the internal corrupt record column (named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count() and
spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or
save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
I tried caching, but it isn't allowed in serverless compute
[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000
I have the following questions:
- Why does this issue occur only in serverless compute? I tried using All-purpose compute with 15.4LTS and it created an empty dataframe.
- Is there a way to display the dataframe to see what exactly is the corrupt record? I tried collect, select('*', lit('c')) but it didn't work.
- Is there a way in the serverless compute to tolerate empty files?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2025 04:28 AM - edited 10-31-2025 04:31 AM
Hello @Dhruv-22 ,
- Can you share the schema of the df? Do you have a _corrupt_record column in your dataframe? If yes.. where are you getting it from, because you said its an empty file correct?
- As per the design ,Spark blocks queries that only reference
_corrupt_recorda column from raw JSON/CSV and it throws an error if explicitly accessed. But in your case, you aren't doing that, but it still throws that error, which is why we will need to schema and the explain plan of the dataframe df.explain(true).
By the way, i created an empty file in serverless and it created an empty df as expected
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2025 05:42 AM
Hey @K_Anudeep
Here are the details you requested.
- Yes, there is a _corrupt_record column in the dataframe. It is coming since spark is treating the file to have some corrupt records. Therefore, it is generating the _corrupt_record column
- The error comes when I try to run display, collect or any such command. Here is the explain
- Also, db.count() is 1
I checked the file size, it was 3 bytes. It doesn't display any character.
But, printing the hexdump it gives the following
I guess this is causing the issue. Can you tell how to deal with it? It runs fine on the all-purpose cluster though
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2025 08:04 AM
Hey @Dhruv-22 ,
Oh .. this totally makes sense now. In that case, it is a true corrupt record..You can just add the read option DROPMALFORMED and it should work
df1 = (spark.read
.format("json")
.option("mode", "DROPMALFORMED") # <- drops malformed record
.load(base))
df1.display()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2025 03:41 AM - edited 11-02-2025 03:42 AM
Hi @K_Anudeep
I searched more. It is not a corrupt record. The three characters represent a Byte Order Mark (BOM) signaling that the file is UTF-8 encoded. It is a standard thing. Also, the file is generated automatically by a no-code pipeline (Azure Data Factory) so it is difficult to say that there is an issue with the service.
The difference occurs with Photon enablement. In serverless cluster, Photon is enabled by default. In all-purpose cluster the file reads fine. When I enable photon in the all-purpose cluster, it fails with the same error as in serverless cluster.
So, there is a difference in the parser used by photon. It is different than normal spark. Could you some more digging and find out what exactly is the thing causing the error?
Thanks for the help uptil now.
P.S. - I got the idea of Photon from Chatgpt. I tried and found it to be true
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-03-2025 12:27 AM
Adding to my point. Suppose the file consists of a valid json along with the byte order mark like below (the bytes ef, bb and bf represent the byte order mark)
dhruv@AS-MAC-0324 Downloads % cat zone.json|hexdump -C
00000000 ef bb bf 7b 22 61 22 3a 20 22 61 22 7d 0a |...{"a": "a"}.|
0000000e
Then a cluster with photon enabled reads the file and gives this as the output.
i.e. the cluster reads the file properly. So, it is a bug in photon enabled environment that it is unable to read an empty file with a byte order mark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-04-2025 06:38 AM
Hey @Dhruv-22 ! Thanks for the info!
I will need to analyse this internally to pinpoint the exact root cause. I advise that you raise a support case with us to have a closer look. You can raise a support case using the link: https://help.databricks.com/
and add a comment to assign it to me so that I can look and provide a detailed analysis and fix if any
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-06-2025 08:45 PM
hey @K_Anudeep
I don't have any support subscription.