Databricks Community

hukel · ‎11-27-2023

I have some CSV files that I upload to DBFS storage several times a day. From these CSVs, I have created SQL tables:

CREATE TABLE IF NOT EXISTS masterdata.lookup_host
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION '/mnt/masterdata/assets-lookup.csv';

This works well for short queries but occasionally a long-running query will be disrupted when the underlying CSV is updated.

23/11/27 21:09:57 ERROR Executor: Exception in task 2.0 in stage 339.0 (TID 1167)
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/masterdata/assets-lookup.csv.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:695)
.....
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.databricks.common.filesystem.InconsistentReadException: The file might have been updated during query execution. Ensure that no pipeline updates existing files during query execution and try again.

This may be the result of this change:

https://docs.databricks.com/en/release-notes/runtime/13.3lts.html#databricks-runtime-returns-an-erro...

Databricks Runtime returns an error if a file is modified between query planning and invocation

Databricks Runtime queries now return an error if a file is updated between query planning and invocation. Before this change, Databricks Runtime would read a file between these stages, which occasionally lead to unpredictable results.

Is there a way to accept/ignore this file change and continue query execution?

Is there a better way to keep simple master/lookup data available for SQL joins and subqueries?

hukel · ‎11-28-2023

One approach I'm testing (positive results so far, but still early).

%sql
# Prep and cleanup
REFRESH TABLE masterdata.lookup_host;
DROP TABLE IF EXISTS t_hosts;

# Forcibly cache the needed columns before using the data in another query.  
CACHE TABLE t_hosts
OPTIONS ('storageLevel' 'DISK_ONLY')
SELECT identifier,category,domain
FROM masterdata.lookup_host lh 
WHERE lh.category = 'workstation';

SELECT * FROM t_hosts

Databricks Community

InconsistentReadException: The file might have been updated during query - CSV backed table

Databricks Runtime returns an error if a file is modified between query planning and invocation

Get Certified at Data & AI Summit and Earn this Exclusive Databricks Jacket

Supercharge Your Code Generation

Registration now open! Databricks Data + AI Summit 2024

Announcing General Availability of Liquid Clustering

Introducing the Databricks AI Fund