Databricks Community

leobocci · ‎12-20-2023

I need to read many delta tables in azure object storage (block blobs). There is no root object delta table, but rather many fragmented delta tables that share a common schema but not common paths.

Iterating over the paths with a for loop is performing very poorly because the list of paths is long and the operation can't be parallelized.

The final result should be a union of all delta tables read from a list of paths. The issue is that I cannot pass directly a list of paths to spark.read.load, because this results in the exception:
Databricks Delta does not support multiple input paths in the load() API. To build a single DataFrame by loading multiple paths from the same Delta table, please load the root path of the Delta table with the corresponding partition filters. If the multiple paths are from different Delta tables, please use Dataset's union()/unionByName() APIs to combine the DataFrames generated by separate load() API calls.

raphaelblg · ‎05-27-2024

Hello @leobocci ,

In order to read multiple Delta tables, multiple read operations are required. You can trigger the read operations simultaneously through the Job Workflows, DLT, Databricks CLI, DBSQL, Interactive Clusters and other resources.

If the problem is the performance while listing the table paths, I'm afraid there's nothing we can do to improve filesystem read/list operations performance as these are not fully managed by Databricks .

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Databricks Community

Parallel read of many delta tables

Photos

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks