cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
cancel
Showing results for 
Search instead for 
Did you mean: 

exposing RAW files using read_files based views, partition discovery and skipping, performance issue

KrzysztofPrzyso
New Contributor II

Hi,
As a formal requirement in my project I need to keep original, raw (mainly CSVs and XMLs) files on the lake. Later on they are being ingested into Delta format backed medallion stages, bronze, silver, gold etc.

From the audit, operations and discovery point of view it would make sense to expose these files to be consumed (view only) using SQL.

As my system is batch only, the raw files are in stored in hive like folders: `/data_feed/batch_id=<batch_id_value>/`

Partition schema discovery is supported as per this document: read_files table-valued function - Azure Databricks - Databricks SQL | Microsoft Learn

I have created the view as follows:

CSV

create view catalog.schema.view_csv_name
as
select *, _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', format => 'csv', multiLine => true, pathGlobfilter => "*.csv");

XML

create view catalog.schema.view_xml_name
as
select *,  _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', rowTag => "tag:tag_name" );

Both of the above are working, but there is a performance issues. All using the serveless SQL warehouse.
The batch_id partition is correctly discovered and shown as column in the query result.

1. When creating a view, ALL folders (potentially terabytes of data) is being scanned. So, each change on deployment is taking a long time.

2. When querying the data using a typical scenario, retrieval data from a single folder the performance is not satisfactory. It seems that the typical spark PARTITION SKIPPING is NOT WORKING here. It seems that all the folders are being scanned.

select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;

 It works quite well when pointing directly to the folder

select * from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/batch_id=iso_country.2024-04-03T111458.255Z/', format => 'csv', header => true );

but of course it is unpractical to expose each folder as a separate view.
On the other hand sometime it would be to much to ask each Analyst to always input all the settings.

Please advise if I can make any changes to improve performance.

I have tried external tables, but currently XML is not supported there, and I really want to keep it read only.

@Hubert-Dudek Any thoughts?


0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.