cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks external table lagging behind source files

ChrisHunt
New Contributor

I have a databricks external table which is pointed at an S3 bucket which contains an ever-growing number of parquet files (currently around 2000 of them). Each row in the file is timestamped to indicate when it was written. A new parquet file is added every hour or so with that hour's updates.

When I view the files in S3, I can see that the most recent file is dated 12:44:25 today, and (viewing its contents) I see that the latest row is dated 12:44:14

When I select from the databricks table which based on these files, the most recent row is timestamped 10:43:46 - the two most recent parquet files are not appearing in the table data. I have tried running "REFRESH TABLE my_table_name" and also "REFRESH FOREIGN TABLE my_table_name", but it makes no difference. I also tried MSCK REPAIR TABLE following this thread, but that just gave me the error message "The command(s): Repair Table are not supported in Unity Catalog. SQLSTATE: 0AKUC" 

Is there some caching somewhere that I need to disable? How can I get Databricks to show the latest changes in the underlying data?

1 ACCEPTED SOLUTION

Accepted Solutions

ChrisHunt
New Contributor

Thanks for your answers.

I got a solution in the end, but it was more weirdness. A colleague fired the same query at the same database on his machine, and got the latest data! So I rebooted my PC and opened a new Databricks session, and I got the latest data too! So there must have been some kind of caching going on just for my session.

View solution in original post

9 REPLIES 9

iyashk-DB
Databricks Employee
Databricks Employee

Hi Chris,
You can use Auto Loader, as it is the most reliable way to pick up each new Parquet file as it lands in S3 and make those records immediately queryable in Databricks. It does this by incrementally discovering files and writing them into a Delta table (or streaming table), avoiding manual refreshes or partition repairs on your external Parquet table. It makes sure that already processed files are also not processed again, and it picks up only the new files.

Ref Doc - https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader

Auto Loader (โ€œcloudFilesโ€) continuously detects new objects in S3 and ingests them with checkpointed state, so newly arrived records are available without you running REPAIR or REFRESH commands.

If you enable โ€œfile eventsโ€ on your Unity Catalog external location and set cloudFiles.useManagedFileEvents = true, Auto Loader uses a Databricks-managed event cache (SNS/SQS under the hood on AWS) for near realโ€‘time, lowโ€‘cost discovery instead of repeated directory listings.

But a point to note here is that Auto Loader does not โ€œrefreshโ€ your existing UC external Parquet table. MSCK or REFRESH TABLE doesn't work when UC hasn't discrovered the new partitions/files yet. Its better to choose Auto Loader if you want continuous ingestion and immediate queryability without manual maintenance; itโ€™s the best path for growing file counts with near realโ€‘time updates.

Coffee77
Contributor III

Not sure what the root cause issue comes from but my recommendation, if possible, is to consider migrating to external "managed" delta tables in order to leave behind all of this weird behaviors. 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Prajapathy_NKR
New Contributor III

@ChrisHunt 

check if "spark.conf.get("spark.databricks.io.cache.enabled")" ( disk cache ) is enabled. if so try to clear cache and re-run your query.

command to clear cache - "spark.catalog.clearCache()"

If disk cache is enabled, then the data is kind of stored in the disk for faster read. May be this might have caused the issue. Try to clear the cache and re-run your query, if it works you can try to set disk cache to false so that your query fetches the latest result.

Hope this helps.

Coffee77
Contributor III

Note: with managed tables, invalidation of cache is automatic. No need to manual refreshments.

Take a look at this screenshot from Hubert Dudek on how to migrate:

Coffee77_0-1764062194282.png

And here you have more details abouut how disk caching works in collaboration with spark cache along with cache invalidation: https://youtu.be/_vWnH4kmF60?si=m6FAwKnuWP8pkhvW 

 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData
โ˜• Episode 3 of BOOST DATABRICKS PERFORMANCE SERIES โœ… Goal is to explain clearly what "Databricks Disk Caching" and "Spark Caching" are, what are the key differences and how they can boost performance working together. ๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘ โœ… Table of Contents 00:00 Introduction 00:55 What is Lazy Apac...

Raman_Unifeye
Contributor III

@Coffee77 - Managed table, of course, the go-to approach to avoid the lag and cache invalidation etc.

However, if we assume @ChrisHunt has to keep it as external table, then is it a cache which can be manually refreshed in any way? I have not seen this weired behaviour with external tables used in past for several years.


RG #Driving Business Outcomes with Data Intelligence

Me neither @Raman_Unifeye Really strange


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Prajapathy_NKR
New Contributor III

@Raman_Unifeye  and @Coffee77  there was a situation when a parquet file were deleted because it was obsolete when a vacuum was executed. Post which the job started to fail, it was saying unable to find the parquet file, even though it was reading an external location. The cluster had the old metadata of the delta file and not the latest one. Once i cleared the cache it started to work as expected.

Before I cleared the cache, I tried reading the same delta file from a different cluster, it was able to read without any issues.

More about the cluster, it was a long running interactive cluster in which the job was running at a frequent interval. 

ChrisHunt
New Contributor

Thanks for your answers.

I got a solution in the end, but it was more weirdness. A colleague fired the same query at the same database on his machine, and got the latest data! So I rebooted my PC and opened a new Databricks session, and I got the latest data too! So there must have been some kind of caching going on just for my session.

Caching is very cool but there are so many stages of possible "caches" that sometimes leads to this type of weird behaviors indeed ๐Ÿ˜‰ Hence, it is very relevant to know not only how and where to cache but how invalidation works!


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData