cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Filtering files for query

Muhammed
New Contributor III

Hi Team,

While writing my data to datalake table I am getting 'filtering files for query', it would be stuck at writing

How can I resolve this issue

15 REPLIES 15

pgruetter
Contributor

Can you give some more details? Are doing merge statements? How big are the tables?

For merge statements i.e. the process needs to read the target table to analyze which parquet files need to be rewritten. If you don't have proper partitioning or z-index, it could end up scanning all files even you only try to update a few rows. 

Did you try to optimize the tables already?

Muhammed
New Contributor III

Thanks for quick reply,

I am using SSMS as my redacted table and it is using upsert as write mode, that table huge in size when I checked the SQL part in 

Databricks ,it is reading every records to memory  

Muhammed
New Contributor III
 

Muhammed
New Contributor III

@pgruetter 
could you please check above?

@Muhammed  describe <table_name> will give you idea about how your table is partitioned. Consider adding partition column condition in where clause for better performance.

pgruetter
Contributor

Still hard to say, but it sounds like my assumption is correct. Because your upsert doesn't which records to update, it needs to scan everything. Make sure that it's properly partitioned, you have a z-index and execute an optimize table.

Kaniz
Community Manager
Community Manager

Hi @Muhammed , when encountering the issue of being stuck at “Filtering files for query” while writing data to a Databricks Delta table, there are several steps you can take to address this situation:

 

Optimize Your Query:

Check Data Clustering and Partitioning:

  • Partitioning: Verify that your Delta table is properly partitioned. Partitioning helps with data organization and query performance.
  • Clustering: If applicable, use data clustering to group similar data together. This can enhance query efficiency by reducing the amount of data scanned.

Memory and Resource Considerations:

  • Cluster Configuration: Ensure that your Databricks cluster has sufficient resources (memory and cores) to handle the query. You might need to adjust the cluster size or type.
  • Memory/Caching: If the data cannot be loaded into cache quickly enough, it might impact the writing process. Consider increasing memory resources if needed.

Review Your Write Command:

  • Check the syntax of your write command. Make sure it’s correctly specifying the format, mode, overwrite options, and partitioning.

Monitor and Debug:

  • Job Monitoring: Monitor the job execution using Databricks UI. Check the query execution plan and identify any bottlenecks.
  • Logs and Metrics: Inspect logs and metrics to identify any issues related to memory, spills, or resource contention.

Cluster Version and Type:

  • Ensure that your Databricks runtime version and cluster type are compatible with your workload. Consider upgrading if necessary.

Remember that the specific solution may vary based on your dataset, query complexity, and cluster configuration. Experiment with different approaches and monitor the impact on performance. Good luck! 🚀

Muhammed
New Contributor III

Hi Kaniz 
I have one more issue , i am writing less than 1.2k records to the datalake table (append mode). While writing it is showing "determining dbio file fragments this would take some time', when i checked the log i see GC allocation failure  .
and my overall execution time is 20 mins which is hard for me , how can i resolve this , ? is it mandatory to use Vaccum, Analyze queries along with Optimize
shall i run optimize datalake.table ?

 



Kaniz
Community Manager
Community Manager

Hi @Muhammed, The GC (Allocation Failure) error you’re encountering is related to memory allocation within the Java Virtual Machine (JVM). When the JVM can’t allocate memory for an object due to insufficient space in the Young Generation heap, it triggers a Minor GC to free up memory. If that doesn’t suffice, a Full GC is initiated.

 

Here are some steps to address this issue:

 

Memory Configuration:

  • Ensure that your JVM memory settings are appropriately configured. You might need to allocate more memory to avoid frequent garbage collection.
  • Consider adjusting the heap size parameters (-Xms and -Xmx) based on your system’s available memory.

Optimize Your Code:

  • Review your code for any memory leaks or inefficient memory usage patterns.
  • Avoid creating unnecessary objects or holding references longer than needed.

Data Lake Table Optimization:

  • Running OPTIMIZE datalake.table can help optimize the table layout and improve query performance. It reorganizes data files and statistics.
  • Additionally, consider running VACUUM and ANALYZE commands periodically to maintain data quality and optimize storage.

Partitioning and Clustering:

  • If your table has partitions, ensure they are well-defined and aligned with your query patterns.
  • Clustering data based on frequently queried columns can enhance performance.

Compression:

  • Evaluate using compression for your data files. Compressed files consume less storage and can improve read performance.

File Fragmentation:

  • The message “determining dbio file fragments” indicates that the system is analyzing file fragmentation. This process can take time.
  • If possible, pre-allocate space for your data files to minimize fragmentation during writes.

Remember that optimizing performance involves a combination of factors, including memory management, query design, and table maintenance. Experiment with the suggestions above and monitor the impact on execution time.

 

Feel free to run OPTIMIZE datalake.table and observe the results. If you encounter any issues, let us know.

Muhammed
New Contributor III
 

Muhammed
New Contributor III

@Kaniz 

We are using framework for data ingestion, hope this will not make any issues to the metadata of the datalake table ?, as per the framework metadata of the table is crucial , any changes happened to it will effect the system .

Some times the particular pipeline would take 2 hrs for just writing 1k records.

Muhammed
New Contributor III

Hi @Kaniz 

Any info on this ?

kulkpd
Contributor

I understand you are getting 'filtering files for query' while writing.

From screenshot it looks like you have 157 million files in source location. can you please try dividing the files per by prefix so that small microbatches can be processed in parallel.

Try to use maxFilesPertrigger option so restrict files per batch.

Muhammed
New Contributor III

@kulkpd 

Where did you get the  info related to 157 million files ? If possible could you pls explain it 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.