cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Several unavoidable for loops are slowing this PySpark code. Is it possible to improve it?

397973
New Contributor III

Hi. I have a PySpark notebook that takes 25 minutes to run as opposed to one minute in on-prem Linux + Pandas. How can I speed it up?

It's not a volume issue. The input is around 30k rows. Output is the same because there's no filtering or aggregation; just creating new fields. No collect, count, or display statements (which would slow it down). 

The main thing is a bunch of mappings I need to apply, but it depends on existing fields and there are various models I need to run. So the mappings are different depending on variable and model. That's where the for loops come in. 

Now I'm not iterating over the dataframe itself; just over 15 fields (different variables) and 4 different mappings. Then do that 10 times (once per model).

The worker is m5d 2x large and drivers are r4 2x large, min/max workers are 4/20. This should be fine. 

I attached a pic to illustrate the code flow. Does anything stand out that you think I could change or that you think Spark is slow at, such as json.load or create_map? 

 

1 ACCEPTED SOLUTION

Accepted Solutions

lingareddy_Alva
Honored Contributor II

@397973 

Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).
That's why Pandas is much faster for your specific case now.

Pre-load and Broadcast All Mappings
Instead of loading json.loads inside loop every time

Use Single Bulk Transformation Instead of Nested withColumn
Instead of .withColumn inside two nested loops — build all new columns in one transformation.
Build a list of new columns first, then apply .selectExpr or .select(*cols) once.

Map via UDF or SQL CASE
If mappings are small and fixed, UDF can be very fast:
Or, generate a CASE statement dynamically if mappings are simple.

Consider pandas_on_spark (koalas)
Since your data is tiny (30k rows), maybe don't even use PySpark classic
Way faster because it bypasses Spark DAG overhead for small data.

 

LR

View solution in original post

1 REPLY 1

lingareddy_Alva
Honored Contributor II

@397973 

Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).
That's why Pandas is much faster for your specific case now.

Pre-load and Broadcast All Mappings
Instead of loading json.loads inside loop every time

Use Single Bulk Transformation Instead of Nested withColumn
Instead of .withColumn inside two nested loops — build all new columns in one transformation.
Build a list of new columns first, then apply .selectExpr or .select(*cols) once.

Map via UDF or SQL CASE
If mappings are small and fixed, UDF can be very fast:
Or, generate a CASE statement dynamically if mappings are simple.

Consider pandas_on_spark (koalas)
Since your data is tiny (30k rows), maybe don't even use PySpark classic
Way faster because it bypasses Spark DAG overhead for small data.

 

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now