cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Question about Data Management for Supply-Demand Allocation

milind2000
New Contributor

I have a scenario where I am trying to parallelize supply - demand allotment between sellers and buyers with many to many links. I am unsure of whether I can parallelize the calculation using PySpark operations. I have two columns to keep track of initial supply and initial demand and every row represents a transaction for allotment. Also need to keep track of final available supply and required demand for each row. The conditions to be met are:

1) After alloting supply for an early row the available supply to a later row should reflect an updated amount with the supply depleted by the alloted amount.
2) If a buyer gets partial or full supply alloted in an earlier row then the later row demand should be depleted by the alloted amount.

Doing this in pandas with row operations is straightforward. I am not well-versed in PySpark so wanted to see if it is possible to parallelize the same process either by column operations or any other PySpark row operations. Thanks and any help would be appreciated!

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

Parallelizing supply-demand allotment in PySpark can be challenging due to the need for sequential updates to supply and demand values across rows. However, it is possible to achieve this using PySpark operations, though it may require a different approach compared to pandas.

Here are some steps and considerations to help you parallelize the process:

  1. Initial Setup: Load your data into a PySpark DataFrame. Ensure that your DataFrame has columns for initial supply, initial demand, and any other relevant transaction details.

  2. Window Functions: Use PySpark's window functions to create a running total or cumulative sum that can help track the updated supply and demand values. Window functions allow you to perform operations across a specified range of rows, which can be useful for maintaining the sequential nature of your updates.

  3. Custom Functions: If the logic is too complex for built-in functions, consider using mapInPandas or pandas_udf to apply custom row-wise operations. These functions allow you to leverage pandas within PySpark, enabling more complex transformations while still benefiting from parallel execution.

  4. Iterative Updates: If the updates are highly dependent on the previous rows, you might need to implement an iterative approach. This can be done by repeatedly applying transformations and updating the DataFrame until the desired state is achieved. Note that this approach may be less efficient due to the iterative nature.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group