โ11-21-2023 06:13 AM
Hi Team,
While writing my data to datalake table I am getting 'filtering files for query', it would be stuck at writing
How can I resolve this issue
โ11-21-2023 06:51 AM
Can you give some more details? Are doing merge statements? How big are the tables?
For merge statements i.e. the process needs to read the target table to analyze which parquet files need to be rewritten. If you don't have proper partitioning or z-index, it could end up scanning all files even you only try to update a few rows.
Did you try to optimize the tables already?
โ11-21-2023 07:47 AM
Thanks for quick reply,
I am using SSMS as my redacted table and it is using upsert as write mode, that table huge in size when I checked the SQL part in
Databricks ,it is reading every records to memory
โ11-21-2023 08:01 AM
โ11-21-2023 10:13 AM
@pgruetter
could you please check above?
โ12-12-2023 06:23 PM
@Muhammed describe <table_name> will give you idea about how your table is partitioned. Consider adding partition column condition in where clause for better performance.
โ11-21-2023 10:33 AM
Still hard to say, but it sounds like my assumption is correct. Because your upsert doesn't which records to update, it needs to scan everything. Make sure that it's properly partitioned, you have a z-index and execute an optimize table.
โ11-22-2023 11:37 PM
Hi Kaniz
I have one more issue , i am writing less than 1.2k records to the datalake table (append mode). While writing it is showing "determining dbio file fragments this would take some time', when i checked the log i see GC allocation failure .
and my overall execution time is 20 mins which is hard for me , how can i resolve this , ? is it mandatory to use Vaccum, Analyze queries along with Optimize
shall i run optimize datalake.table ?
โ11-22-2023 11:39 PM
โ12-09-2023 08:09 AM
We are using framework for data ingestion, hope this will not make any issues to the metadata of the datalake table ?, as per the framework metadata of the table is crucial , any changes happened to it will effect the system .
Some times the particular pipeline would take 2 hrs for just writing 1k records.
โ12-11-2023 09:38 AM
Hi @Retired_mod
Any info on this ?
โ12-12-2023 06:33 PM
I understand you are getting 'filtering files for query' while writing.
From screenshot it looks like you have 157 million files in source location. can you please try dividing the files per by prefix so that small microbatches can be processed in parallel.
Try to use maxFilesPertrigger option so restrict files per batch.
โ12-14-2023 04:26 AM
Where did you get the info related to 157 million files ? If possible could you pls explain it
โ12-14-2023 10:31 AM
My bad, somewhere in the screenshot I saw that but not able to find it now.
Which source you are using to load the data, delta table, aws-s3, or azure-storage?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group