cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

exposing RAW files using read_files based views, partition discovery and skipping, performance issue

KrzysztofPrzyso
New Contributor III

Hi,
As a formal requirement in my project I need to keep original, raw (mainly CSVs and XMLs) files on the lake. Later on they are being ingested into Delta format backed medallion stages, bronze, silver, gold etc.

From the audit, operations and discovery point of view it would make sense to expose these files to be consumed (view only) using SQL.

As my system is batch only, the raw files are in stored in hive like folders: `/data_feed/batch_id=<batch_id_value>/`

Partition schema discovery is supported as per this document: read_files table-valued function - Azure Databricks - Databricks SQL | Microsoft Learn

I have created the view as follows:

CSV

create view catalog.schema.view_csv_name
as
select *, _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', format => 'csv', multiLine => true, pathGlobfilter => "*.csv");

XML

create view catalog.schema.view_xml_name
as
select *,  _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', rowTag => "tag:tag_name" );

Both of the above are working, but there is a performance issues. All using the serveless SQL warehouse.
The batch_id partition is correctly discovered and shown as column in the query result.

1. When creating a view, ALL folders (potentially terabytes of data) is being scanned. So, each change on deployment is taking a long time.

2. When querying the data using a typical scenario, retrieval data from a single folder the performance is not satisfactory. It seems that the typical spark PARTITION SKIPPING is NOT WORKING here. It seems that all the folders are being scanned.

select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;

 It works quite well when pointing directly to the folder

select * from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/batch_id=iso_country.2024-04-03T111458.255Z/', format => 'csv', header => true );

but of course it is unpractical to expose each folder as a separate view.
On the other hand sometime it would be to much to ask each Analyst to always input all the settings.

Please advise if I can make any changes to improve performance.

I have tried external tables, but currently XML is not supported there, and I really want to keep it read only.

@Hubert-Dudek Any thoughts?


2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @KrzysztofPrzyso, It sounds like you’re dealing with an interesting challenge related to performance and data organization in your Azure Databricks environment.

Let’s break down the issues you’ve mentioned and explore potential solutions:

  1. Scanning All Folders During View Creation: When creating a view using the read_files table-valued function, it scans all folders within the specified path. This can be time-consuming, especially if you have terabytes of data. To address this issue, consider the following approaches:

    • Incremental View Creation: Instead of recreating the entire view each time, consider creating an incremental process. For example, you could create a view for new data only (based on timestamps or other criteria) and then periodically update it with the latest data.

    • Metadata Caching: Cache the metadata (such as folder structure and file locations) to avoid rescanning all folders during view creation. You can use Databricks’ caching mechanisms to achieve this.

  2. Performance Issues with Querying Data: You’ve noticed that querying data from a single folder doesn’t perform well, and partition skipping doesn’t seem to work as expected. Here are some suggestions:

    • Partition Pruning Optimization:

      • Ensure that your data is properly partitioned. If the batch_id is a partition column, Databricks should automatically skip irrelevant partitions during queries.
      • Verify that the batch_id column is correctly indexed and used for partition pruning. You can check the query execution plan to see if partition pruning is happening as expected.
    • Data Skew Handling:

      • If certain partitions have significantly more data than others (data skew), it can impact performance. Consider optimizing the data distribution to avoid skew.
      • Use the OPTIMIZE command periodically to compact small files and improve query performance.
    • File Format Considerations:

      • The choice of file format (CSV or XML) can impact performance. Parquet is often more efficient for columnar storage and query performance.
      • If possible, convert your data to Parquet format and create views based on Parquet files.
    • View Materialization:

      • Instead of creating views directly on raw data, consider materializing intermediate views (e.g., bronze layer) with optimized data structures (like Parquet) and then create additional views (silver, gold) on top of these materialized views.
    • Query Optimization:

      • Review your SQL queries for any potential bottlenecks. Ensure that you’re using appropriate filters, aggregations, and joins.
      • Use Databricks’ query profiling tools to identify performance bottlenecks.
  3. Balancing Convenience and Performance:

    • Exposing each folder as a separate view may not be practical, as you’ve mentioned. However, consider creating a higher-level view that combines relevant folders based on common criteria (e.g., date ranges, data sources).
    • For analysts, provide documentation or scripts to simplify their interactions with the views. You could create parameterized views that allow them to input specific settings (e.g., batch ID) dynamically.
  4. XML Data Support:

    • Since external tables don’t support XML, you’re right to keep the XML data in read-only views. Consider using a combination of views (CSV and XML) to meet your requirements.

In summary, optimize your view creation process, ensure proper partitioning, choose an efficient file format, and strike a balance between convenience and performance. Regularly monitor and fine-tune your solution based on query patterns and data growth.

Remember that performance tuning can be iterative, so experiment with different approaches and measure the impact on query execution times. Good luck with your project! 😊

Hi @Kaniz_Fatma ,
Many thanks for your answer.
I would like to focus on the main problems at hand here. Please note that we are not discussing external tables based on CSV / XML files nor delta tables just the 'read_files' functionality.


The main issue is the partition skipping when using read_files.

When reading the view, the partitions are discovered correctly, but NOT skipped.
This naturally leads to suboptimal perormance.
All the folders in the location of the read_files query is pointing to, are in the spark partition  format: `/data_feed/batch_id=<batch_id_value>/` i.e.: `batch_id=eda7b3eb7011.2024-04-03T111458.255Z`

There is not much optimisation I can do on the query side, the following code is as simple as it gets:

select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;

When looking at the query profile it seems that the process has stuck at the Optimizing query & pruning files.

KrzysztofPrzyso_0-1715101700962.png

I always cancel it after ~20 minutes.
Maybe it is a bug? Could you check with the product team if the partition skipping should be actually working here?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group