cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

InconsistentReadException: The file might have been updated during query - CSV backed table

hukel
Contributor

I have some CSV files that I upload to DBFS storage several times a day.   From these CSVs,  I have created SQL tables:

 

CREATE TABLE IF NOT EXISTS masterdata.lookup_host
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION '/mnt/masterdata/assets-lookup.csv';

 

This works well for short queries but occasionally a long-running query will be disrupted when the underlying CSV is updated.  

 

23/11/27 21:09:57 ERROR Executor: Exception in task 2.0 in stage 339.0 (TID 1167)
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/masterdata/assets-lookup.csv.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:695)
.....
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.databricks.common.filesystem.InconsistentReadException: The file might have been updated during query execution. Ensure that no pipeline updates existing files during query execution and try again.

 

This may be the result of this change:

https://docs.databricks.com/en/release-notes/runtime/13.3lts.html#databricks-runtime-returns-an-erro...

Databricks Runtime returns an error if a file is modified between query planning and invocation

Databricks Runtime queries now return an error if a file is updated between query planning and invocation. Before this change, Databricks Runtime would read a file between these stages, which occasionally lead to unpredictable results.

Is there a way to accept/ignore this file change and continue query execution?

Is there a better way to keep simple master/lookup data available for SQL joins and subqueries?

1 REPLY 1

hukel
Contributor

One approach I'm testing (positive results so far, but still early).

%sql
# Prep and cleanup
REFRESH TABLE masterdata.lookup_host;
DROP TABLE IF EXISTS t_hosts;

# Forcibly cache the needed columns before using the data in another query.  
CACHE TABLE t_hosts
OPTIONS ('storageLevel' 'DISK_ONLY')
SELECT identifier,category,domain
FROM masterdata.lookup_host lh 
WHERE lh.category = 'workstation';

SELECT * FROM t_hosts