Databricks Community

KrzysztofPrzyso · ‎04-25-2024

Hi,
As a formal requirement in my project I need to keep original, raw (mainly CSVs and XMLs) files on the lake. Later on they are being ingested into Delta format backed medallion stages, bronze, silver, gold etc.

From the audit, operations and discovery point of view it would make sense to expose these files to be consumed (view only) using SQL.

As my system is batch only, the raw files are in stored in hive like folders: `/data_feed/batch_id=<batch_id_value>/`

Partition schema discovery is supported as per this document: read_files table-valued function - Azure Databricks - Databricks SQL | Microsoft Learn

I have created the view as follows:

CSV

create view catalog.schema.view_csv_name
as
select *, _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', format => 'csv', multiLine => true, pathGlobfilter => "*.csv");

XML

create view catalog.schema.view_xml_name
as
select *,  _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', rowTag => "tag:tag_name" );

Both of the above are working, but there is a performance issues. All using the serveless SQL warehouse.
The batch_id partition is correctly discovered and shown as column in the query result.

1. When creating a view, ALL folders (potentially terabytes of data) is being scanned. So, each change on deployment is taking a long time.

2. When querying the data using a typical scenario, retrieval data from a single folder the performance is not satisfactory. It seems that the typical spark PARTITION SKIPPING is NOT WORKING here. It seems that all the folders are being scanned.

select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;

It works quite well when pointing directly to the folder

select * from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/batch_id=iso_country.2024-04-03T111458.255Z/', format => 'csv', header => true );

but of course it is unpractical to expose each folder as a separate view.
On the other hand sometime it would be to much to ask each Analyst to always input all the settings.

Please advise if I can make any changes to improve performance.

I have tried external tables, but currently XML is not supported there, and I really want to keep it read only.

@Hubert-Dudek Any thoughts?

KrzysztofPrzyso · ‎05-07-2024

Hi @Retired_mod ,
Many thanks for your answer.
I would like to focus on the main problems at hand here. Please note that we are not discussing external tables based on CSV / XML files nor delta tables just the 'read_files' functionality.

The main issue is the partition skipping when using read_files.

When reading the view, the partitions are discovered correctly, but NOT skipped.
This naturally leads to suboptimal perormance.
All the folders in the location of the read_files query is pointing to, are in the spark partition format: `/data_feed/batch_id=<batch_id_value>/` i.e.: `batch_id=eda7b3eb7011.2024-04-03T111458.255Z`

There is not much optimisation I can do on the query side, the following code is as simple as it gets:

select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;

When looking at the query profile it seems that the process has stuck at the Optimizing query & pruning files.

I always cancel it after ~20 minutes.
Maybe it is a bug? Could you check with the product team if the partition skipping should be actually working here?

bobmclaren · ‎11-07-2024

I am having the same experience when utilizing read_files on a hive-partitioned datalake of text files on Azure. Oddly, the query profile reflects that the partition filters are being applied appropriately, and the total number of files/bytes read reflects that, but the "Optimizing query & pruning files" step jumps from milliseconds to minutes when using partition filters versus querying a subfolder directly, despite the fact that the file/byte count is the same.