Monday
I have a databricks external table which is pointed at an S3 bucket which contains an ever-growing number of parquet files (currently around 2000 of them). Each row in the file is timestamped to indicate when it was written. A new parquet file is added every hour or so with that hour's updates.
When I view the files in S3, I can see that the most recent file is dated 12:44:25 today, and (viewing its contents) I see that the latest row is dated 12:44:14
When I select from the databricks table which based on these files, the most recent row is timestamped 10:43:46 - the two most recent parquet files are not appearing in the table data. I have tried running "REFRESH TABLE my_table_name" and also "REFRESH FOREIGN TABLE my_table_name", but it makes no difference. I also tried MSCK REPAIR TABLE following this thread, but that just gave me the error message "The command(s): Repair Table are not supported in Unity Catalog. SQLSTATE: 0AKUC"
Is there some caching somewhere that I need to disable? How can I get Databricks to show the latest changes in the underlying data?
Thursday
Thanks for your answers.
I got a solution in the end, but it was more weirdness. A colleague fired the same query at the same database on his machine, and got the latest data! So I rebooted my PC and opened a new Databricks session, and I got the latest data too! So there must have been some kind of caching going on just for my session.
Monday
Hi Chris,
You can use Auto Loader, as it is the most reliable way to pick up each new Parquet file as it lands in S3 and make those records immediately queryable in Databricks. It does this by incrementally discovering files and writing them into a Delta table (or streaming table), avoiding manual refreshes or partition repairs on your external Parquet table. It makes sure that already processed files are also not processed again, and it picks up only the new files.
Ref Doc - https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader
Auto Loader (โcloudFilesโ) continuously detects new objects in S3 and ingests them with checkpointed state, so newly arrived records are available without you running REPAIR or REFRESH commands.
If you enable โfile eventsโ on your Unity Catalog external location and set cloudFiles.useManagedFileEvents = true, Auto Loader uses a Databricks-managed event cache (SNS/SQS under the hood on AWS) for near realโtime, lowโcost discovery instead of repeated directory listings.
But a point to note here is that Auto Loader does not โrefreshโ your existing UC external Parquet table. MSCK or REFRESH TABLE doesn't work when UC hasn't discrovered the new partitions/files yet. Its better to choose Auto Loader if you want continuous ingestion and immediate queryability without manual maintenance; itโs the best path for growing file counts with near realโtime updates.
Monday
Not sure what the root cause issue comes from but my recommendation, if possible, is to consider migrating to external "managed" delta tables in order to leave behind all of this weird behaviors.
Monday - last edited Monday
check if "spark.conf.get("spark.databricks.io.cache.enabled")" ( disk cache ) is enabled. if so try to clear cache and re-run your query.
command to clear cache - "spark.catalog.clearCache()"
If disk cache is enabled, then the data is kind of stored in the disk for faster read. May be this might have caused the issue. Try to clear the cache and re-run your query, if it works you can try to set disk cache to false so that your query fetches the latest result.
Hope this helps.
Tuesday
Note: with managed tables, invalidation of cache is automatic. No need to manual refreshments.
Take a look at this screenshot from Hubert Dudek on how to migrate:
And here you have more details abouut how disk caching works in collaboration with spark cache along with cache invalidation: https://youtu.be/_vWnH4kmF60?si=m6FAwKnuWP8pkhvW
Tuesday
@Coffee77 - Managed table, of course, the go-to approach to avoid the lag and cache invalidation etc.
However, if we assume @ChrisHunt has to keep it as external table, then is it a cache which can be manually refreshed in any way? I have not seen this weired behaviour with external tables used in past for several years.
Tuesday
Me neither @Raman_Unifeye Really strange
Tuesday - last edited Tuesday
@Raman_Unifeye and @Coffee77 there was a situation when a parquet file were deleted because it was obsolete when a vacuum was executed. Post which the job started to fail, it was saying unable to find the parquet file, even though it was reading an external location. The cluster had the old metadata of the delta file and not the latest one. Once i cleared the cache it started to work as expected.
Before I cleared the cache, I tried reading the same delta file from a different cluster, it was able to read without any issues.
More about the cluster, it was a long running interactive cluster in which the job was running at a frequent interval.
Thursday
Thanks for your answers.
I got a solution in the end, but it was more weirdness. A colleague fired the same query at the same database on his machine, and got the latest data! So I rebooted my PC and opened a new Databricks session, and I got the latest data too! So there must have been some kind of caching going on just for my session.
Thursday
Caching is very cool but there are so many stages of possible "caches" that sometimes leads to this type of weird behaviors indeed ๐ Hence, it is very relevant to know not only how and where to cache but how invalidation works!
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now