cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Filtering files for query

Muhammed
New Contributor III

Hi Team,

While writing my data to datalake table I am getting 'filtering files for query', it would be stuck at writing

How can I resolve this issue

13 REPLIES 13

pgruetter
Contributor

Can you give some more details? Are doing merge statements? How big are the tables?

For merge statements i.e. the process needs to read the target table to analyze which parquet files need to be rewritten. If you don't have proper partitioning or z-index, it could end up scanning all files even you only try to update a few rows. 

Did you try to optimize the tables already?

Muhammed
New Contributor III

Thanks for quick reply,

I am using SSMS as my redacted table and it is using upsert as write mode, that table huge in size when I checked the SQL part in 

Databricks ,it is reading every records to memory  

Muhammed
New Contributor III
 

Muhammed
New Contributor III

@pgruetter 
could you please check above?

@Muhammed  describe <table_name> will give you idea about how your table is partitioned. Consider adding partition column condition in where clause for better performance.

pgruetter
Contributor

Still hard to say, but it sounds like my assumption is correct. Because your upsert doesn't which records to update, it needs to scan everything. Make sure that it's properly partitioned, you have a z-index and execute an optimize table.

Muhammed
New Contributor III

Hi Kaniz 
I have one more issue , i am writing less than 1.2k records to the datalake table (append mode). While writing it is showing "determining dbio file fragments this would take some time', when i checked the log i see GC allocation failure  .
and my overall execution time is 20 mins which is hard for me , how can i resolve this , ? is it mandatory to use Vaccum, Analyze queries along with Optimize
shall i run optimize datalake.table ?

 



Muhammed
New Contributor III
 

Muhammed
New Contributor III

@Retired_mod 

We are using framework for data ingestion, hope this will not make any issues to the metadata of the datalake table ?, as per the framework metadata of the table is crucial , any changes happened to it will effect the system .

Some times the particular pipeline would take 2 hrs for just writing 1k records.

Muhammed
New Contributor III

Hi @Retired_mod 

Any info on this ?

kulkpd
Contributor

I understand you are getting 'filtering files for query' while writing.

From screenshot it looks like you have 157 million files in source location. can you please try dividing the files per by prefix so that small microbatches can be processed in parallel.

Try to use maxFilesPertrigger option so restrict files per batch.

Muhammed
New Contributor III

@kulkpd 

Where did you get the  info related to 157 million files ? If possible could you pls explain it 

kulkpd
Contributor

My bad, somewhere in the screenshot I saw that but not able to find it now.
Which source you are using to load the data, delta table, aws-s3, or azure-storage?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group