cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to avoid iteration/loop in databricks in the given scenario

shri0509
New Contributor II

Hi all, I need your input.

I am new to Databricks and working with a dataset that consists of around 10,000 systems, each containing approximately 100 to 150 parts. These parts have attributes such as name, version, and serial number. The dataset size is around 10 million records.

The task is to retrieve the parts associated with each system for a particular day and compare them with the parts from the previous day. If there are any changes, such as modifications, deletions, or additions, I need to create a union of all these changes and store the result in Databricks. This updated dataset will then be used for comparison with the parts available on the next day, and the process repeats.

Below, I've provided an example of the input and expected output for a particular system.

Currently, I'm iterating over days for each system and calculating the deltas using left anti and inner joins in each iteration. However, this process is taking a considerable amount of time in Databricks.

Is there a more efficient way to handle this without the iterative approach? I would appreciate any suggestions.

Input:

systemlogDatenameversionserial
123456786/24/24name1version1number1
123456786/24/24name2version2number2
123456786/25/24name1version1number1
123456786/25/24name2version2number2
123456786/26/24name1version1number1
123456786/26/24name2version2number2
123456786/26/24name3version3number3
123456786/27/24name1version1number1
123456786/27/24name2version2.1number2
123456786/27/24name3version3number3
123456786/30/24name2version2.1number2
123456786/30/24name3version3number3.1

Output;

systemlogDatenameversionserial
123456786/24/24name1version1number1
123456786/24/24name2version2number2
123456786/26/24name1version1number1
123456786/26/24name2version2number2
123456786/26/24name3version3number3
123456786/27/24name1version1number1
123456786/27/24name2version2.1number2
123456786/27/24name3version3number3
123456786/30/24name1version1number1
123456786/30/24name2version2.1number2
123456786/30/24name3version3number3.1
5 REPLIES 5

shri0509
New Contributor II

HI @ogdendc ,

I have read your your beware of for loop, but in the scenario described above how to tackle it.

Please guide.

shri0509
New Contributor II

Have tried Window function with lag, however the data i have in not sequencial, what i mean is if i have on Jan 1, then next data be on Jan 5, then Jan 20 and so on. With window function i am able to get modified, common however data deleted is getting missed.

shri0509
New Contributor II

@Kaniz_Fatmaany suggestion..

holly
Valued Contributor III
Valued Contributor III

I'm not sure I fully understand the question, you are right to avoid loops as they are very inefficient in spark. Some techniques that might help you:

Self join using date_add

Join the table to itself on system identifiers, but the date-1, ie 

select T1.*, T2.log_date as log_date_yest, T2.serial as serial_yest
from yourTable T1
inner join yourTable T2
on T1.system = T2.system
and T1.log_date = date_add(T2.log_date, -1) 

date add docs: https://docs.databricks.com/en/sql/language-manual/functions/date_add.html

Merge into

If you wanted a table with only the latest values in it, you could use MERGE INTO. Say you wanted a table that only reflected the latest values, the code might look something like:

MERGE INTO yourTable USING newData
  ON yourTable.system = newData.system 
AND yourTable.logDate = newData.logDate
AND yourTable.name = newData.name
WHEN NOT MATCHED THEN INSERT * WHEN MATCHED AND yourTable.version > newData.version THEN UPDATE SET * --you'd have to create logic to define what data is new

Much better examples here:

https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html

Language

These are in SQL, you could rewrite them in python it won't have much impact on the performance. 

Tuning

Hopefully, this will run much faster. As for right sizing your cluster you have two approaches:

  1. learn spark internals. learn detailed VM infrastructure. follow this giant guide https://www.databricks.com/discover/pages/optimize-data-workloads-guide. Time estimate: 100+ hours
  2. use serverless. Time estimate: 5 seconds. 

Hope this is useful! Let us know how you get on. Holly

AnnieWhite
New Contributor II

Thank you so much for the link.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group