topic Re: Performance issue when reading data from a view. in Data Engineering

Performance issue when reading data from a view.

sensanjoy — Mon, 20 Jan 2025 15:49:36 GMT

Hi All,

We are facing some performance issue and I need your help to know what could be the best approach to follow here.

Existing:

For each region, we have view(Reg1_View,Reg2_View..) to pull data from table( we don't have direct access to table).
And then we are combining data from all view into another view named All_Reg_View(Reg1_View union all Reg2_View..).
Using the combining view All_Reg_View in join condition with other view in notebook and create target delta table.

All views are constructed same way and we are dealing with billions of data.

Proposed approach:

Create one staging table(truncate & load) replacing the combined view(All_Reg_View) by performing union of all views.
Partition the stg table based on region.
Use the stg tables in joining to create target delta table.

Please let me know your thought.

Re: Performance issue when reading data from a view.

Alberto_Umana — Mon, 20 Jan 2025 16:03:07 GMT

Hi @sensanjoy,

Could you please advise what kind of performance issue are you facing? Is it slowness while querying the views?

Re: Performance issue when reading data from a view.

sensanjoy — Mon, 20 Jan 2025 16:30:36 GMT

@Alberto_Umana Yes, during query execution and joining back with other views resulting more time. We are building target table by performing joining between views created like this way(combining multiple region views).

Re: Performance issue when reading data from a view.

fmadeiro — Mon, 20 Jan 2025 18:52:48 GMT

To address the performance challenges associated with All_Reg_View, I propose replacing it with a partitioned staging table. This change will significantly enhance query efficiency, simplify execution plans, and prepare the pipeline for future scalability.

Proposed Solution

Create a Partitioned Staging Table
- Truncate and Load: Materialize data from regional views into a dedicated staging table. Use a TRUNCATE operation followed by a fresh load to ensure efficient data refresh.
- Partitioning: Partition the staging table by region to enable faster and more efficient filtering during queries.
Replace All_Reg_View in Joins
- Utilize the staging table for downstream joins. This eliminates runtime union operations by pre-aggregating data during the staging process, simplifying execution plans and reducing query overhead.

This solution ensures more robust and efficient data processing while supporting the pipeline's long-term goals.

Re: Performance issue when reading data from a view.

sensanjoy — Tue, 21 Jan 2025 17:44:17 GMT

Thanks @fmadeiro . We are doing union all between all region like Reg1_View union all Reg2_View..Regn_View to get the combined dataset into stg table. Is there any other efficient way to do it!!

Re: Performance issue when reading data from a view.

SharathAbh93 — Tue, 21 Jan 2025 19:29:02 GMT

Does any table hold data of all region
1. if yes. Get a Materialized view created (replacing all_reg_view)

2. i see you already tried creating a staging table replacing the all_reg_view.
Try creating cluster key along with partition.

Cluster key on the region (joining columns) and partition on some date column if available.

3. Analyze the other views for performance which you are joining with All_reg_view.

Re: Performance issue when reading data from a view.

sensanjoy — Wed, 22 Jan 2025 17:46:27 GMT

We dont have access to the table level directly . Its only region level views are accessible and we union them to get combine data.