Databricks Community

Muhammed · ‎11-21-2023

Hi Team,

While writing my data to datalake table I am getting 'filtering files for query', it would be stuck at writing

How can I resolve this issue

pgruetter · ‎11-21-2023

Can you give some more details? Are doing merge statements? How big are the tables?

For merge statements i.e. the process needs to read the target table to analyze which parquet files need to be rewritten. If you don't have proper partitioning or z-index, it could end up scanning all files even you only try to update a few rows.

Did you try to optimize the tables already?

Muhammed · ‎11-21-2023

Thanks for quick reply,

I am using SSMS as my redacted table and it is using upsert as write mode, that table huge in size when I checked the SQL part in

Databricks ,it is reading every records to memory

Muhammed · ‎11-21-2023

Muhammed · ‎11-21-2023

@pgruetter
could you please check above?

kulkpd · ‎12-12-2023

@Muhammed describe <table_name> will give you idea about how your table is partitioned. Consider adding partition column condition in where clause for better performance.

pgruetter · ‎11-21-2023

Still hard to say, but it sounds like my assumption is correct. Because your upsert doesn't which records to update, it needs to scan everything. Make sure that it's properly partitioned, you have a z-index and execute an optimize table.

Muhammed · ‎11-22-2023

Hi Kaniz
I have one more issue , i am writing less than 1.2k records to the datalake table (append mode). While writing it is showing "determining dbio file fragments this would take some time', when i checked the log i see GC allocation failure .
and my overall execution time is 20 mins which is hard for me , how can i resolve this , ? is it mandatory to use Vaccum, Analyze queries along with Optimize
shall i run optimize datalake.table ?

Muhammed · ‎11-22-2023

Muhammed · ‎12-09-2023

@Retired_mod

We are using framework for data ingestion, hope this will not make any issues to the metadata of the datalake table ?, as per the framework metadata of the table is crucial , any changes happened to it will effect the system .

Some times the particular pipeline would take 2 hrs for just writing 1k records.

Muhammed · ‎12-11-2023

Hi @Retired_mod

Any info on this ?

kulkpd · ‎12-12-2023

I understand you are getting 'filtering files for query' while writing.

From screenshot it looks like you have 157 million files in source location. can you please try dividing the files per by prefix so that small microbatches can be processed in parallel.

Try to use maxFilesPertrigger option so restrict files per batch.

Muhammed · ‎12-14-2023

@kulkpd

Where did you get the info related to 157 million files ? If possible could you pls explain it

kulkpd · ‎12-14-2023

My bad, somewhere in the screenshot I saw that but not able to find it now.
Which source you are using to load the data, delta table, aws-s3, or azure-storage?