cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks external table lagging behind source files

ChrisHunt
Visitor

I have a databricks external table which is pointed at an S3 bucket which contains an ever-growing number of parquet files (currently around 2000 of them). Each row in the file is timestamped to indicate when it was written. A new parquet file is added every hour or so with that hour's updates.

When I view the files in S3, I can see that the most recent file is dated 12:44:25 today, and (viewing its contents) I see that the latest row is dated 12:44:14

When I select from the databricks table which based on these files, the most recent row is timestamped 10:43:46 - the two most recent parquet files are not appearing in the table data. I have tried running "REFRESH TABLE my_table_name" and also "REFRESH FOREIGN TABLE my_table_name", but it makes no difference. I also tried MSCK REPAIR TABLE following this thread, but that just gave me the error message "The command(s): Repair Table are not supported in Unity Catalog. SQLSTATE: 0AKUC" 

Is there some caching somewhere that I need to disable? How can I get Databricks to show the latest changes in the underlying data?

2 REPLIES 2

iyashk-DB
Databricks Employee
Databricks Employee

Hi Chris,
You can use Auto Loader, as it is the most reliable way to pick up each new Parquet file as it lands in S3 and make those records immediately queryable in Databricks. It does this by incrementally discovering files and writing them into a Delta table (or streaming table), avoiding manual refreshes or partition repairs on your external Parquet table. It makes sure that already processed files are also not processed again, and it picks up only the new files.

Ref Doc - https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader

Auto Loader (“cloudFiles”) continuously detects new objects in S3 and ingests them with checkpointed state, so newly arrived records are available without you running REPAIR or REFRESH commands.

If you enable “file events” on your Unity Catalog external location and set cloudFiles.useManagedFileEvents = true, Auto Loader uses a Databricks-managed event cache (SNS/SQS under the hood on AWS) for near real‑time, low‑cost discovery instead of repeated directory listings.

But a point to note here is that Auto Loader does not “refresh” your existing UC external Parquet table. MSCK or REFRESH TABLE doesn't work when UC hasn't discrovered the new partitions/files yet. Its better to choose Auto Loader if you want continuous ingestion and immediate queryability without manual maintenance; it’s the best path for growing file counts with near real‑time updates.

Coffee77
Contributor III

Not sure what the root cause issue comes from but my recommendation, if possible, is to consider migrating to external "managed" delta tables in order to leave behind all of this weird behaviors. 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now