Databricks Community

Dhruv-22 · 2 weeks ago

I ran a databricks notebook to do incremental loads from files in raw layer to bronze layer tables. Today, I encountered a case where the delta file was empty. I tried running it manually on the serverless compute and encountered an error.

df = spark.read.json(path)
df.display()

-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only 
include the internal corrupt record column (named _corrupt_record by default). For example: 
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count() and 
spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or 
save the parsed results and then send the same query. 
For example, val df = spark.read.schema(schema).csv(file).cache() and then 
df.filter($"_corrupt_record".isNotNull).count().

I tried caching, but it isn't allowed in serverless compute

[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000

I have the following questions:

Why does this issue occur only in serverless compute? I tried using All-purpose compute with 15.4LTS and it created an empty dataframe.
Is there a way to display the dataframe to see what exactly is the corrupt record? I tried collect, select('*', lit('c')) but it didn't work.
Is there a way in the serverless compute to tolerate empty files?

K_Anudeep · 2 weeks ago

Hello @Dhruv-22 ,

Can you share the schema of the df? Do you have a _corrupt_record column in your dataframe? If yes.. where are you getting it from, because you said its an empty file correct?
As per the design ,Spark blocks queries that only reference_corrupt_record a column from raw JSON/CSV and it throws an error if explicitly accessed. But in your case, you aren't doing that, but it still throws that error, which is why we will need to schema and the explain plan of the dataframe df.explain(true).

By the way, i created an empty file in serverless and it created an empty df as expected

Anudeep

Dhruv-22 · 2 weeks ago

Hey @K_Anudeep

Here are the details you requested.

Yes, there is a _corrupt_record column in the dataframe. It is coming since spark is treating the file to have some corrupt records. Therefore, it is generating the _corrupt_record column

The error comes when I try to run display, collect or any such command. Here is the explain

Also, db.count() is 1

I checked the file size, it was 3 bytes. It doesn't display any character.

But, printing the hexdump it gives the following

I guess this is causing the issue. Can you tell how to deal with it? It runs fine on the all-purpose cluster though

K_Anudeep · 2 weeks ago

Hey @Dhruv-22 ,

Oh .. this totally makes sense now. In that case, it is a true corrupt record..You can just add the read option DROPMALFORMED and it should work

df1 = (spark.read
.format("json")
.option("mode", "DROPMALFORMED") # <- drops malformed record
.load(base))

df1.display()

Anudeep

Dhruv-22 · 2 weeks ago

Hi @K_Anudeep

I searched more. It is not a corrupt record. The three characters represent a Byte Order Mark (BOM) signaling that the file is UTF-8 encoded. It is a standard thing. Also, the file is generated automatically by a no-code pipeline (Azure Data Factory) so it is difficult to say that there is an issue with the service.

The difference occurs with Photon enablement. In serverless cluster, Photon is enabled by default. In all-purpose cluster the file reads fine. When I enable photon in the all-purpose cluster, it fails with the same error as in serverless cluster.

So, there is a difference in the parser used by photon. It is different than normal spark. Could you some more digging and find out what exactly is the thing causing the error?

Thanks for the help uptil now.

P.S. - I got the idea of Photon from Chatgpt. I tried and found it to be true

Dhruv-22 · 2 weeks ago

Adding to my point. Suppose the file consists of a valid json along with the byte order mark like below (the bytes ef, bb and bf represent the byte order mark)

dhruv@AS-MAC-0324 Downloads  % cat zone.json|hexdump -C
00000000  ef bb bf 7b 22 61 22 3a  20 22 61 22 7d 0a        |...{"a": "a"}.|
0000000e

Then a cluster with photon enabled reads the file and gives this as the output.

i.e. the cluster reads the file properly. So, it is a bug in photon enabled environment that it is unable to read an empty file with a byte order mark.

K_Anudeep · 2 weeks ago

Hey @Dhruv-22 ! Thanks for the info!

I will need to analyse this internally to pinpoint the exact root cause. I advise that you raise a support case with us to have a closer look. You can raise a support case using the link: https://help.databricks.com/
and add a comment to assign it to me so that I can look and provide a detailed analysis and fix if any

Anudeep