cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Iteration - Pyspark vs Pandas

elgeo
Valued Contributor II

Hello. Could someone please explain why iteration over a Pyspark dataframe is way slower than over a Pandas dataframe?

Pyspark

df_list = df.collect()

for index in range(0, len(df_list )):

.....

Pandas

df_pnd = df.toPandas()           

for index, row in df_pnd.iterrows():

....

Thank you in advance

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @ELENI GEORGOUSI​,

Iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame for several reasons:

  1. PySpark is a distributed computing engine designed to work with large data sets that cannot fit into memory. This means that PySpark operations are optimized for parallel processing and data shuffling, which can be slower for small data sets that can fit into memory. In contrast, Pandas is optimized for the in-memory processing of smaller datasets so that it can be faster for smaller datasets.
  2. PySpark DataFrame operations are lazily evaluated, meaning operations are not executed until an action is called. This can cause delays when iterating over a PySpark DataFrame, especially if the DataFrame has many partitions or if a lot of shuffling is involved.
  3. PySpark DataFrame operations are implemented in Java and run on the JVM, while Pandas is implemented in Python and runs on the CPython interpreter. This means that PySpark DataFrame operations incur some overhead due to the inter-process communication between the JVM and Python.
  4. PySpark DataFrame operations can involve serialization and deserialization of data, which can be slower than working with data directly in memory, as with Pandas.

In summary, iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame due to the differences in design and implementation.

If you need to work with smaller datasets that can fit into memory, Pandas may be a better choice for performance reasons.

However, if you need to work with larger datasets that cannot fit into memory, PySpark may be necessary for scalability and distributed processing.

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @ELENI GEORGOUSI​,

Iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame for several reasons:

  1. PySpark is a distributed computing engine designed to work with large data sets that cannot fit into memory. This means that PySpark operations are optimized for parallel processing and data shuffling, which can be slower for small data sets that can fit into memory. In contrast, Pandas is optimized for the in-memory processing of smaller datasets so that it can be faster for smaller datasets.
  2. PySpark DataFrame operations are lazily evaluated, meaning operations are not executed until an action is called. This can cause delays when iterating over a PySpark DataFrame, especially if the DataFrame has many partitions or if a lot of shuffling is involved.
  3. PySpark DataFrame operations are implemented in Java and run on the JVM, while Pandas is implemented in Python and runs on the CPython interpreter. This means that PySpark DataFrame operations incur some overhead due to the inter-process communication between the JVM and Python.
  4. PySpark DataFrame operations can involve serialization and deserialization of data, which can be slower than working with data directly in memory, as with Pandas.

In summary, iterating over a PySpark DataFrame can be slower than iterating over a Pandas DataFrame due to the differences in design and implementation.

If you need to work with smaller datasets that can fit into memory, Pandas may be a better choice for performance reasons.

However, if you need to work with larger datasets that cannot fit into memory, PySpark may be necessary for scalability and distributed processing.

Anonymous
Not applicable

Hi @ELENI GEORGOUSI​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!