Hi,
As a formal requirement in my project I need to keep original, raw (mainly CSVs and XMLs) files on the lake. Later on they are being ingested into Delta format backed medallion stages, bronze, silver, gold etc.
From the audit, operations and discovery point of view it would make sense to expose these files to be consumed (view only) using SQL.
As my system is batch only, the raw files are in stored in hive like folders: `/data_feed/batch_id=<batch_id_value>/`
Partition schema discovery is supported as per this document: read_files table-valued function - Azure Databricks - Databricks SQL | Microsoft Learn
I have created the view as follows:
CSV
create view catalog.schema.view_csv_name
as
select *, _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', format => 'csv', multiLine => true, pathGlobfilter => "*.csv");
XML
create view catalog.schema.view_xml_name
as
select *, _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', rowTag => "tag:tag_name" );
Both of the above are working, but there is a performance issues. All using the serveless SQL warehouse.
The batch_id partition is correctly discovered and shown as column in the query result.
1. When creating a view, ALL folders (potentially terabytes of data) is being scanned. So, each change on deployment is taking a long time.
2. When querying the data using a typical scenario, retrieval data from a single folder the performance is not satisfactory. It seems that the typical spark PARTITION SKIPPING is NOT WORKING here. It seems that all the folders are being scanned.
select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;
It works quite well when pointing directly to the folder
select * from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/batch_id=iso_country.2024-04-03T111458.255Z/', format => 'csv', header => true );
but of course it is unpractical to expose each folder as a separate view.
On the other hand sometime it would be to much to ask each Analyst to always input all the settings.
Please advise if I can make any changes to improve performance.
I have tried external tables, but currently XML is not supported there, and I really want to keep it read only.
@Hubert-Dudek Any thoughts?