07-31-2024 05:21 AM
Hi all, I need your input.
I am new to Databricks and working with a dataset that consists of around 10,000 systems, each containing approximately 100 to 150 parts. These parts have attributes such as name, version, and serial number. The dataset size is around 10 million records.
The task is to retrieve the parts associated with each system for a particular day and compare them with the parts from the previous day. If there are any changes, such as modifications, deletions, or additions, I need to create a union of all these changes and store the result in Databricks. This updated dataset will then be used for comparison with the parts available on the next day, and the process repeats.
Below, I've provided an example of the input and expected output for a particular system.
Currently, I'm iterating over days for each system and calculating the deltas using left anti and inner joins in each iteration. However, this process is taking a considerable amount of time in Databricks.
Is there a more efficient way to handle this without the iterative approach? I would appreciate any suggestions.
Input:
system | logDate | name | version | serial |
12345678 | 6/24/24 | name1 | version1 | number1 |
12345678 | 6/24/24 | name2 | version2 | number2 |
12345678 | 6/25/24 | name1 | version1 | number1 |
12345678 | 6/25/24 | name2 | version2 | number2 |
12345678 | 6/26/24 | name1 | version1 | number1 |
12345678 | 6/26/24 | name2 | version2 | number2 |
12345678 | 6/26/24 | name3 | version3 | number3 |
12345678 | 6/27/24 | name1 | version1 | number1 |
12345678 | 6/27/24 | name2 | version2.1 | number2 |
12345678 | 6/27/24 | name3 | version3 | number3 |
12345678 | 6/30/24 | name2 | version2.1 | number2 |
12345678 | 6/30/24 | name3 | version3 | number3.1 |
Output;
system | logDate | name | version | serial |
12345678 | 6/24/24 | name1 | version1 | number1 |
12345678 | 6/24/24 | name2 | version2 | number2 |
12345678 | 6/26/24 | name1 | version1 | number1 |
12345678 | 6/26/24 | name2 | version2 | number2 |
12345678 | 6/26/24 | name3 | version3 | number3 |
12345678 | 6/27/24 | name1 | version1 | number1 |
12345678 | 6/27/24 | name2 | version2.1 | number2 |
12345678 | 6/27/24 | name3 | version3 | number3 |
12345678 | 6/30/24 | name1 | version1 | number1 |
12345678 | 6/30/24 | name2 | version2.1 | number2 |
12345678 | 6/30/24 | name3 | version3 | number3.1 |
07-31-2024 10:14 PM
HI @ogdendc ,
I have read your your beware of for loop, but in the scenario described above how to tackle it.
Please guide.
07-31-2024 10:20 PM
Have tried Window function with lag, however the data i have in not sequencial, what i mean is if i have on Jan 1, then next data be on Jan 5, then Jan 20 and so on. With window function i am able to get modified, common however data deleted is getting missed.
08-01-2024 04:00 AM
@Kaniz_Fatmaany suggestion..
08-01-2024 05:23 AM
I'm not sure I fully understand the question, you are right to avoid loops as they are very inefficient in spark. Some techniques that might help you:
Join the table to itself on system identifiers, but the date-1, ie
select T1.*, T2.log_date as log_date_yest, T2.serial as serial_yest
from yourTable T1
inner join yourTable T2
on T1.system = T2.system
and T1.log_date = date_add(T2.log_date, -1)
date add docs: https://docs.databricks.com/en/sql/language-manual/functions/date_add.html
If you wanted a table with only the latest values in it, you could use MERGE INTO. Say you wanted a table that only reflected the latest values, the code might look something like:
MERGE INTO yourTable USING newData ON yourTable.system = newData.system
AND yourTable.logDate = newData.logDate
AND yourTable.name = newData.name
WHEN NOT MATCHED THEN INSERT * WHEN MATCHED AND yourTable.version > newData.version THEN UPDATE SET * --you'd have to create logic to define what data is new
Much better examples here:
https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html
These are in SQL, you could rewrite them in python it won't have much impact on the performance.
Hopefully, this will run much faster. As for right sizing your cluster you have two approaches:
Hope this is useful! Let us know how you get on. Holly
Monday
Thank you so much for the link.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group