cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Performance issue when reading data from a view.

sensanjoy
Contributor

Hi All,

We are facing some performance issue and I need your help to know what could be the best approach to follow here.

Existing:

  1.  For each region, we have view(Reg1_View,Reg2_View..) to pull data from table( we don't have direct access to table).
  2. And then we are combining data from all view into another view named All_Reg_View(Reg1_View union all Reg2_View..).
  3. Using the combining view All_Reg_View in join condition with other view in notebook and create target delta table.

View_Vs_Staging_Table.png

All views are constructed same way and we are dealing with billions of data.

Proposed approach:

  1. Create one staging table(truncate & load) replacing the combined view(All_Reg_View) by performing union of all views.
  2. Partition the stg table based on region.
  3. Use the stg tables in joining to create target delta table.

Please let me know your thought.

6 REPLIES 6

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @sensanjoy,

Could you please advise what kind of performance issue are you facing? Is it slowness while querying the views?

@Alberto_Umana  Yes, during query execution and joining back with other views resulting more time. We are building target table by performing joining between views created like this way(combining multiple region views).

fmadeiro
New Contributor III

To address the performance challenges associated with All_Reg_View, I propose replacing it with a partitioned staging table. This change will significantly enhance query efficiency, simplify execution plans, and prepare the pipeline for future scalability.

Proposed Solution

  1. Create a Partitioned Staging Table

    • Truncate and Load: Materialize data from regional views into a dedicated staging table. Use a TRUNCATE operation followed by a fresh load to ensure efficient data refresh.
    • Partitioning: Partition the staging table by region to enable faster and more efficient filtering during queries.
  2. Replace All_Reg_View in Joins

    • Utilize the staging table for downstream joins. This eliminates runtime union operations by pre-aggregating data during the staging process, simplifying execution plans and reducing query overhead.

This solution ensures more robust and efficient data processing while supporting the pipeline's long-term goals.

Thanks @fmadeiro . We are doing union all between all region like Reg1_View union all Reg2_View..Regn_View to get the combined dataset into stg table. Is there any other efficient way to do it!! 

SharathAbh93
New Contributor II

Does any table hold data of all region 
1. if yes. Get a Materialized view created (replacing all_reg_view)

2. i see you already tried creating a staging table replacing the all_reg_view. 
Try creating cluster key along with partition.

Cluster key on the region (joining columns) and partition on some date column if available.

3. Analyze the other views for performance which you are joining with All_reg_view.

 

SA

We dont have access to the table level directly . Its only region level views are accessible and we union them to get combine data.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group