Databricks Community

shri0509 · ‎07-31-2024

Hi all, I need your input.

I am new to Databricks and working with a dataset that consists of around 10,000 systems, each containing approximately 100 to 150 parts. These parts have attributes such as name, version, and serial number. The dataset size is around 10 million records.

The task is to retrieve the parts associated with each system for a particular day and compare them with the parts from the previous day. If there are any changes, such as modifications, deletions, or additions, I need to create a union of all these changes and store the result in Databricks. This updated dataset will then be used for comparison with the parts available on the next day, and the process repeats.

Below, I've provided an example of the input and expected output for a particular system.

Currently, I'm iterating over days for each system and calculating the deltas using left anti and inner joins in each iteration. However, this process is taking a considerable amount of time in Databricks.

Is there a more efficient way to handle this without the iterative approach? I would appreciate any suggestions.

Input:

system	logDate	name	version	serial
12345678	6/24/24	name1	version1	number1
12345678	6/24/24	name2	version2	number2
12345678	6/25/24	name1	version1	number1
12345678	6/25/24	name2	version2	number2
12345678	6/26/24	name1	version1	number1
12345678	6/26/24	name2	version2	number2
12345678	6/26/24	name3	version3	number3
12345678	6/27/24	name1	version1	number1
12345678	6/27/24	name2	version2.1	number2
12345678	6/27/24	name3	version3	number3
12345678	6/30/24	name2	version2.1	number2
12345678	6/30/24	name3	version3	number3.1

Output;

system	logDate	name	version	serial
12345678	6/24/24	name1	version1	number1
12345678	6/24/24	name2	version2	number2
12345678	6/26/24	name1	version1	number1
12345678	6/26/24	name2	version2	number2
12345678	6/26/24	name3	version3	number3
12345678	6/27/24	name1	version1	number1
12345678	6/27/24	name2	version2.1	number2
12345678	6/27/24	name3	version3	number3
12345678	6/30/24	name1	version1	number1
12345678	6/30/24	name2	version2.1	number2
12345678	6/30/24	name3	version3	number3.1

shri0509 · ‎07-31-2024

HI @ogdendc ,

I have read your your beware of for loop, but in the scenario described above how to tackle it.

Please guide.

shri0509 · ‎07-31-2024

Have tried Window function with lag, however the data i have in not sequencial, what i mean is if i have on Jan 1, then next data be on Jan 5, then Jan 20 and so on. With window function i am able to get modified, common however data deleted is getting missed.

shri0509 · ‎08-01-2024

@Retired_modany suggestion..

holly · ‎08-01-2024

I'm not sure I fully understand the question, you are right to avoid loops as they are very inefficient in spark. Some techniques that might help you:

Self join using date_add

Join the table to itself on system identifiers, but the date-1, ie

select T1.*, T2.log_date as log_date_yest, T2.serial as serial_yest
from yourTable T1
inner join yourTable T2
on T1.system = T2.system
and T1.log_date = date_add(T2.log_date, -1)

date add docs: https://docs.databricks.com/en/sql/language-manual/functions/date_add.html

Merge into

If you wanted a table with only the latest values in it, you could use MERGE INTO. Say you wanted a table that only reflected the latest values, the code might look something like:

MERGE INTO yourTable USING newData
  ON yourTable.system = newData.system 
  AND yourTable.logDate = newData.logDate
  AND yourTable.name = newData.name
  WHEN NOT MATCHED THEN INSERT *
  WHEN MATCHED AND yourTable.version > newData.version THEN UPDATE SET * --you'd have to create logic to define what data is new

Much better examples here:

https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html

Language

These are in SQL, you could rewrite them in python it won't have much impact on the performance.

Tuning

Hopefully, this will run much faster. As for right sizing your cluster you have two approaches:

learn spark internals. learn detailed VM infrastructure. follow this giant guide https://www.databricks.com/discover/pages/optimize-data-workloads-guide. Time estimate: 100+ hours
use serverless. Time estimate: 5 seconds.

Hope this is useful! Let us know how you get on. Holly

AnnieWhite · ‎09-09-2024

Thank you so much for the link.

And I also would like to help you by sharing https://domypaper.com website with you. If you are not good at writing an essay assignment then you can take help from the given link. It is worth taking help from them.