Databricks Community

Dhruv-22 · 2 weeks ago

I have a pipeline which puts json files in a storage location after reading a daily delta load. Today I encountered a case where the file as empty. I tried running the notebook manually using serverless cluster (Environment version 4) and encountered this error.

df = spark.read.json(path)
df.display()

-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only 
include the internal corrupt record column (named _corrupt_record by default). For example: 
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count() 
and spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or 
save the parsed results and then send the same query. For example, 
val df = spark.read.schema(schema).csv(file).cache() and then 
df.filter($"_corrupt_record".isNotNull).count().

I tried doing the cache but got another error that cache is not supported in serverless compute

[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000

I have three questions:

Why does this only happen in serverless compute and not in all-purpose compute (I tried on 15.4)?
How to display the dataframe in serverless compute in such case? I tried collect, select('*', lit('a')) but nothing works
Is there any way to avoid this error? Like in all-purpose compute it just creates an empty dataframe with no columns and rows.