2 weeks ago
I ran a databricks notebook to do incremental loads from files in raw layer to bronze layer tables. Today, I encountered a case where the delta file was empty. I tried running it manually on the serverless compute and encountered an error.
df = spark.read.json(path)
df.display()
-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only
include the internal corrupt record column (named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count() and
spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or
save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
I tried caching, but it isn't allowed in serverless compute
[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000
I have the following questions:
2 weeks ago - last edited 2 weeks ago
Hello @Dhruv-22 ,
_corrupt_record a column from raw JSON/CSV and it throws an error if explicitly accessed. But in your case, you aren't doing that, but it still throws that error, which is why we will need to schema and the explain plan of the dataframe df.explain(true).By the way, i created an empty file in serverless and it created an empty df as expected
2 weeks ago
Hey @K_Anudeep
Here are the details you requested.
I checked the file size, it was 3 bytes. It doesn't display any character.
But, printing the hexdump it gives the following
I guess this is causing the issue. Can you tell how to deal with it? It runs fine on the all-purpose cluster though
2 weeks ago
Hey @Dhruv-22 ,
Oh .. this totally makes sense now. In that case, it is a true corrupt record..You can just add the read option DROPMALFORMED and it should work
df1 = (spark.read
.format("json")
.option("mode", "DROPMALFORMED") # <- drops malformed record
.load(base))
df1.display()
2 weeks ago - last edited 2 weeks ago
Hi @K_Anudeep
I searched more. It is not a corrupt record. The three characters represent a Byte Order Mark (BOM) signaling that the file is UTF-8 encoded. It is a standard thing. Also, the file is generated automatically by a no-code pipeline (Azure Data Factory) so it is difficult to say that there is an issue with the service.
The difference occurs with Photon enablement. In serverless cluster, Photon is enabled by default. In all-purpose cluster the file reads fine. When I enable photon in the all-purpose cluster, it fails with the same error as in serverless cluster.
So, there is a difference in the parser used by photon. It is different than normal spark. Could you some more digging and find out what exactly is the thing causing the error?
Thanks for the help uptil now.
P.S. - I got the idea of Photon from Chatgpt. I tried and found it to be true
2 weeks ago
Adding to my point. Suppose the file consists of a valid json along with the byte order mark like below (the bytes ef, bb and bf represent the byte order mark)
dhruv@AS-MAC-0324 Downloads % cat zone.json|hexdump -C
00000000 ef bb bf 7b 22 61 22 3a 20 22 61 22 7d 0a |...{"a": "a"}.|
0000000e
Then a cluster with photon enabled reads the file and gives this as the output.
i.e. the cluster reads the file properly. So, it is a bug in photon enabled environment that it is unable to read an empty file with a byte order mark.
2 weeks ago
Hey @Dhruv-22 ! Thanks for the info!
I will need to analyse this internally to pinpoint the exact root cause. I advise that you raise a support case with us to have a closer look. You can raise a support case using the link: https://help.databricks.com/
and add a comment to assign it to me so that I can look and provide a detailed analysis and fix if any
a week ago
hey @K_Anudeep
I don't have any support subscription.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now