cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Iteration - Pyspark vs Pandas

elgeo
Valued Contributor II

Hello. Could someone please explain why iteration over a Pyspark dataframe is way slower than over a Pandas dataframe?

Pyspark

df_list = df.collect()

for index in range(0, len(df_list )):

.....

Pandas

df_pnd = df.toPandas()           

for index, row in df_pnd.iterrows():

....

Thank you in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @ELENI GEORGOUSI​,

Iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame for several reasons:

  1. PySpark is a distributed computing engine designed to work with large data sets that cannot fit into memory. This means that PySpark operations are optimized for parallel processing and data shuffling, which can be slower for small data sets that can fit into memory. In contrast, Pandas is optimized for the in-memory processing of smaller datasets so that it can be faster for smaller datasets.
  2. PySpark DataFrame operations are lazily evaluated, meaning operations are not executed until an action is called. This can cause delays when iterating over a PySpark DataFrame, especially if the DataFrame has many partitions or if a lot of shuffling is involved.
  3. PySpark DataFrame operations are implemented in Java and run on the JVM, while Pandas is implemented in Python and runs on the CPython interpreter. This means that PySpark DataFrame operations incur some overhead due to the inter-process communication between the JVM and Python.
  4. PySpark DataFrame operations can involve serialization and deserialization of data, which can be slower than working with data directly in memory, as with Pandas.

In summary, iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame due to the differences in design and implementation.

If you need to work with smaller datasets that can fit into memory, Pandas may be a better choice for performance reasons.

However, if you need to work with larger datasets that cannot fit into memory, PySpark may be necessary for scalability and distributed processing.

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @ELENI GEORGOUSI​,

Iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame for several reasons:

  1. PySpark is a distributed computing engine designed to work with large data sets that cannot fit into memory. This means that PySpark operations are optimized for parallel processing and data shuffling, which can be slower for small data sets that can fit into memory. In contrast, Pandas is optimized for the in-memory processing of smaller datasets so that it can be faster for smaller datasets.
  2. PySpark DataFrame operations are lazily evaluated, meaning operations are not executed until an action is called. This can cause delays when iterating over a PySpark DataFrame, especially if the DataFrame has many partitions or if a lot of shuffling is involved.
  3. PySpark DataFrame operations are implemented in Java and run on the JVM, while Pandas is implemented in Python and runs on the CPython interpreter. This means that PySpark DataFrame operations incur some overhead due to the inter-process communication between the JVM and Python.
  4. PySpark DataFrame operations can involve serialization and deserialization of data, which can be slower than working with data directly in memory, as with Pandas.

In summary, iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame due to the differences in design and implementation.

If you need to work with smaller datasets that can fit into memory, Pandas may be a better choice for performance reasons.

However, if you need to work with larger datasets that cannot fit into memory, PySpark may be necessary for scalability and distributed processing.

Anonymous
Not applicable

Hi @ELENI GEORGOUSI​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.