topic Re: Filtering files for query in Data Engineering

Filtering files for query

Muhammed — Tue, 21 Nov 2023 14:13:24 GMT

Hi Team,

While writing my data to datalake table I am getting 'filtering files for query', it would be stuck at writing

How can I resolve this issue

Re: Filtering files for query

pgruetter — Tue, 21 Nov 2023 14:51:40 GMT

Can you give some more details? Are doing merge statements? How big are the tables?

For merge statements i.e. the process needs to read the target table to analyze which parquet files need to be rewritten. If you don't have proper partitioning or z-index, it could end up scanning all files even you only try to update a few rows.

Did you try to optimize the tables already?

Re: Filtering files for query

Muhammed — Tue, 21 Nov 2023 15:47:36 GMT

Thanks for quick reply,

I am using SSMS as my redacted table and it is using upsert as write mode, that table huge in size when I checked the SQL part in

Databricks ,it is reading every records to memory

Re: Filtering files for query

Muhammed — Tue, 21 Nov 2023 16:01:43 GMT

Re: Filtering files for query

Muhammed — Tue, 21 Nov 2023 18:13:20 GMT

@pgruetter
could you please check above?

Re: Filtering files for query

pgruetter — Tue, 21 Nov 2023 18:33:26 GMT

Still hard to say, but it sounds like my assumption is correct. Because your upsert doesn't which records to update, it needs to scan everything. Make sure that it's properly partitioned, you have a z-index and execute an optimize table.

Re: Filtering files for query

Muhammed — Thu, 23 Nov 2023 07:37:34 GMT

Hi Kaniz
I have one more issue , i am writing less than 1.2k records to the datalake table (append mode). While writing it is showing "determining dbio file fragments this would take some time', when i checked the log i see GC allocation failure .
and my overall execution time is 20 mins which is hard for me , how can i resolve this , ? is it mandatory to use Vaccum, Analyze queries along with Optimize
shall i run optimize datalake.table ?

Re: Filtering files for query

Muhammed — Thu, 23 Nov 2023 07:39:43 GMT

Re: Filtering files for query

Muhammed — Sat, 09 Dec 2023 16:09:32 GMT

@Retired_mod

We are using framework for data ingestion, hope this will not make any issues to the metadata of the datalake table ?, as per the framework metadata of the table is crucial , any changes happened to it will effect the system .

Some times the particular pipeline would take 2 hrs for just writing 1k records.

Re: Filtering files for query

Muhammed — Mon, 11 Dec 2023 17:38:13 GMT

Hi @Retired_mod

Any info on this ?

Re: Filtering files for query

kulkpd — Wed, 13 Dec 2023 02:23:13 GMT

@Muhammed describe <table_name> will give you idea about how your table is partitioned. Consider adding partition column condition in where clause for better performance.

Re: Filtering files for query

kulkpd — Wed, 13 Dec 2023 02:33:10 GMT

I understand you are getting 'filtering files for query' while writing.

From screenshot it looks like you have 157 million files in source location. can you please try dividing the files per by prefix so that small microbatches can be processed in parallel.

Try to use maxFilesPertrigger option so restrict files per batch.

Re: Filtering files for query

Muhammed — Thu, 14 Dec 2023 12:26:28 GMT

@kulkpd

Where did you get the info related to 157 million files ? If possible could you pls explain it

Re: Filtering files for query

kulkpd — Thu, 14 Dec 2023 18:31:14 GMT

My bad, somewhere in the screenshot I saw that but not able to find it now.
Which source you are using to load the data, delta table, aws-s3, or azure-storage?