Databricks

rpshgupta · ‎06-19-2022

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 458.0 failed 4 times, most recent failure: Lost task 0.3 in stage 458.0 (TID 2247) (172.18.102.75 executor 1): com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@adls.dfs.core.windows.net/file.csv. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:417)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:369)

at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:509)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:322)

at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:317)

at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)

at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)

at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)

at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)

at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)

at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)

at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)

at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.scheduler.Task.run(Task.scala:95)

at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:825)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1658)

at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:828)

at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, HEAD, https://adls.dfs.core.windows.net/raw/file.csv?upn=false&action=getStatus&timeout=90

at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1344)

at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:266)

at com.databricks.spark.metrics.FileSystemWithMetrics.open(FileSystemWithMetrics.scala:336)

at org.apache.hadoop.fs.FileSystem.lambda$openFileWithOptions$0(FileSystem.java:4633)

at org.apache.hadoop.util.LambdaUtils.eval(LambdaUtils.java:52)

at org.apache.hadoop.fs.FileSystem.openFileWithOptions(FileSystem.java:4631)

at org.apache.hadoop.fs.FileSystem$FSDataInputStreamBuilder.build(FileSystem.java:4768)

at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:92)

at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)

at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource.readFile(CSVDataSource.scala:108)

at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:169)

at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:156)

at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:143)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:353)

... 31 more

Hubert-Dudek · ‎06-20-2022

It seems that it points to a file that no longer exists. As the error says, please try 'REFRESH TABLE tableName' so it will update links to files in hive metastore. If that doesn't help, please share your code.

rpshgupta · ‎06-20-2022

@Hubert Dudek There is no table at all. I am just writing/reading parquet files.

Hubert-Dudek · ‎06-28-2022

Please share your code. Then we will be able to help.

Kaniz · ‎06-27-2022

Hi @Rupesh gupta, This example uses the read method to use the parquet method of the resulting DataFrameReader to read the Parquet file in the specified location into a DataFrame and then display the DataFrame’s content. You can read your parquet file through this method.

parquetDF = spark.read.format("parquet").load("/path")
parquetDF.show(truncate=False)

jose_gonzalez · ‎07-29-2022

Try to convert your Parquet table to Delta table and this error will be resolved.

nagini_sitarama · ‎11-30-2022

I am also facing the same issue . I am accessing view that is created on top of joining 4 tables that are in parquet format. so when i pull the data from the view using my streaming job , the job fails .

Even though the base table is incremental append on daily basis , does the part file changes its name for every day in case of parquet file format ?

Vidula · ‎08-25-2022

Hi @Rupesh gupta

Hope you are well. Just wanted to see if you were able to find an answer to your question and would you like to mark an answer as best? It would be really helpful for the other members too.

Cheers!

rpshgupta · ‎08-30-2022

I couldn't find any best solution yet. I have seen this issue so many times now and it get fixed after rerun. I don't feel re-running is the best solution.

Databricks

Databricks notebook failed with "Caused by: java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, HEAD, https://adls.dfs.core.windows.net/raw/file.csv?upn=false&action=getStatus&timeout=90".

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI