Parallel read of many delta tables
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2023 04:48 AM
I need to read many delta tables in azure object storage (block blobs). There is no root object delta table, but rather many fragmented delta tables that share a common schema but not common paths.
Iterating over the paths with a for loop is performing very poorly because the list of paths is long and the operation can't be parallelized.
The final result should be a union of all delta tables read from a list of paths. The issue is that I cannot pass directly a list of paths to spark.read.load, because this results in the exception:
Databricks Delta does not support multiple input paths in the load() API. To build a single DataFrame by loading multiple paths from the same Delta table, please load the root path of the Delta table with the corresponding partition filters. If the multiple paths are from different Delta tables, please use Dataset's union()/unionByName() APIs to combine the DataFrames generated by separate load() API calls.
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-27-2024 01:00 PM - edited 05-27-2024 01:01 PM
Hello @leobocci ,
In order to read multiple Delta tables, multiple read operations are required. You can trigger the read operations simultaneously through the Job Workflows, DLT, Databricks CLI, DBSQL, Interactive Clusters and other resources.
If the problem is the performance while listing the table paths, I'm afraid there's nothing we can do to improve filesystem read/list operations performance as these are not fully managed by Databricks .
Raphael Balogo
Sr. Technical Solutions Engineer
Databricks