I have a pipeline which puts json files in a storage location after reading a daily delta load. Today I encountered a case where the file as empty. I tried running the notebook manually using serverless cluster (Environment version 4) and encountered this error.
df = spark.read.json(path)
df.display()
-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only
include the internal corrupt record column (named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or
save the parsed results and then send the same query. For example,
val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
I tried doing the cache but got another error that cache is not supported in serverless compute
[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000
I have three questions:
- Why does this only happen in serverless compute and not in all-purpose compute (I tried on 15.4)?
- How to display the dataframe in serverless compute in such case? I tried collect, select('*', lit('a')) but nothing works
- Is there any way to avoid this error? Like in all-purpose compute it just creates an empty dataframe with no columns and rows.