Databricks Community

sensanjoy · ‎01-20-2025

Hi All,

We are facing some performance issue and I need your help to know what could be the best approach to follow here.

Existing:

For each region, we have view(Reg1_View,Reg2_View..) to pull data from table( we don't have direct access to table).
And then we are combining data from all view into another view named All_Reg_View(Reg1_View union all Reg2_View..).
Using the combining view All_Reg_View in join condition with other view in notebook and create target delta table.

All views are constructed same way and we are dealing with billions of data.

Proposed approach:

Create one staging table(truncate & load) replacing the combined view(All_Reg_View) by performing union of all views.
Partition the stg table based on region.
Use the stg tables in joining to create target delta table.

Please let me know your thought.

Alberto_Umana · ‎01-20-2025

Hi @sensanjoy,

Could you please advise what kind of performance issue are you facing? Is it slowness while querying the views?

sensanjoy · ‎01-20-2025

@Alberto_Umana Yes, during query execution and joining back with other views resulting more time. We are building target table by performing joining between views created like this way(combining multiple region views).

fmadeiro · ‎01-20-2025

To address the performance challenges associated with All_Reg_View, I propose replacing it with a partitioned staging table. This change will significantly enhance query efficiency, simplify execution plans, and prepare the pipeline for future scalability.

Proposed Solution

Create a Partitioned Staging Table
- Truncate and Load: Materialize data from regional views into a dedicated staging table. Use a TRUNCATE operation followed by a fresh load to ensure efficient data refresh.
- Partitioning: Partition the staging table by region to enable faster and more efficient filtering during queries.
Replace All_Reg_View in Joins
- Utilize the staging table for downstream joins. This eliminates runtime union operations by pre-aggregating data during the staging process, simplifying execution plans and reducing query overhead.

This solution ensures more robust and efficient data processing while supporting the pipeline's long-term goals.

sensanjoy · ‎01-21-2025

Thanks @fmadeiro . We are doing union all between all region like Reg1_View union all Reg2_View..Regn_View to get the combined dataset into stg table. Is there any other efficient way to do it!!

SharathAbh93 · ‎01-21-2025

Does any table hold data of all region
1. if yes. Get a Materialized view created (replacing all_reg_view)

2. i see you already tried creating a staging table replacing the all_reg_view.
Try creating cluster key along with partition.

Cluster key on the region (joining columns) and partition on some date column if available.

3. Analyze the other views for performance which you are joining with All_reg_view.

SA