topic Reading data from "dbfs:/mnt/" in Data Engineering

Reading data from "dbfs:/mnt/"

Pat — Fri, 16 Dec 2022 21:51:10 GMT

Hi community,

I don't know what is happening TBH.

I have a use case where data is written to the location "dbfs:/mnt/...", don't ask me why it's mounted, it's just a side project. I do believe that data is stored in ADLS2.

I've been trying to read the data after it's written bu when I try to read data from the folder:

df = spark.read.format("parquet").load("dbfs:/mnt/table/")
 
or
 
df = spark.read.format("parquet").load("dbfs:/mnt/table/date=2022-12-16")

I get: AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

when I provide the schema, the count = 0 (zero):

df.count()

but when I provide full path to the parquet file it works:

df = spark.read.format("parquet").load("dbfs:/mnt/table/date=2022-12-16/some-spark-file.snappy.parquet")
 
df.count()

it return 700 rows.

any ideas ? 🙂

Re: Reading data from "dbfs:/mnt/"

Pat — Fri, 16 Dec 2022 22:57:06 GMT

I am still not sure what happened, but I've re-run job on smaller dataset and seems to work, maybe corrupted data ?

Re: Reading data from "dbfs:/mnt/"

Chaitanya_Raju — Sat, 17 Dec 2022 02:22:08 GMT

Yes, maybe the data of a particular partition or file got corrupted and for me, it is working fine for a sample parquet data, I can able to read without any issues.

Re: Reading data from "dbfs:/mnt/"

Aviral-Bhardwaj — Sun, 18 Dec 2022 06:08:51 GMT

this is really interesting never faced this type od situation @Pat Sienkiewicz can you please share whole code by that we can test and debug this in our system

Thanks

Aviral

Re: Reading data from "dbfs:/mnt/"

Pat — Mon, 19 Dec 2022 07:35:30 GMT

Hi @Aviral Bhardwaj ,

I will try to re-produce this. I think that at least one of the files is corrupted, but I would expect different error in that case, not long running job that fails with `Unable to infer schema for Parquet. It must be specified manually.`

Re: Reading data from "dbfs:/mnt/"

Aviral-Bhardwaj — Tue, 20 Dec 2022 01:45:03 GMT

thanks for the sharing ,i hope it will work