cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading empty json file in serverless gives error

Dhruv-22
Contributor II

I have a pipeline which puts json files in a storage location after reading a daily delta load. Today I encountered a case where the file as empty. I tried running the notebook manually using serverless cluster (Environment version 4) and encountered this error.

df = spark.read.json(path)
df.display()

-- Output
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only
include the internal corrupt record column (named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show(). Instead, you can cache or
save the parsed results and then send the same query. For example,
val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

I tried doing the cache but got another error that cache is not supported in serverless compute

[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. SQLSTATE: 0A000

 I have three questions:

  1. Why does this only happen in serverless compute and not in all-purpose compute (I tried on 15.4)?
  2. How to display the dataframe in serverless compute in such case? I tried collect, select('*', lit('a')) but nothing works
  3. Is there any way to avoid this error? Like in all-purpose compute it just creates an empty dataframe with no columns and rows.
1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee
1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee