Several unavoidable for loops are slowing this PyS...

397973 · ‎04-28-2025

Hi. I have a PySpark notebook that takes 25 minutes to run as opposed to one minute in on-prem Linux + Pandas. How can I speed it up?

It's not a volume issue. The input is around 30k rows. Output is the same because there's no filtering or aggregation; just creating new fields. No collect, count, or display statements (which would slow it down).

The main thing is a bunch of mappings I need to apply, but it depends on existing fields and there are various models I need to run. So the mappings are different depending on variable and model. That's where the for loops come in.

Now I'm not iterating over the dataframe itself; just over 15 fields (different variables) and 4 different mappings. Then do that 10 times (once per model).

The worker is m5d 2x large and drivers are r4 2x large, min/max workers are 4/20. This should be fine.

I attached a pic to illustrate the code flow. Does anything stand out that you think I could change or that you think Spark is slow at, such as json.load or create_map?

lingareddy_Alva · ‎04-28-2025

@397973

Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).
That's why Pandas is much faster for your specific case now.

Pre-load and Broadcast All Mappings
Instead of loading json.loads inside loop every time

Use Single Bulk Transformation Instead of Nested withColumn
Instead of .withColumn inside two nested loops — build all new columns in one transformation.
Build a list of new columns first, then apply .selectExpr or .select(*cols) once.

Map via UDF or SQL CASE
If mappings are small and fixed, UDF can be very fast:
Or, generate a CASE statement dynamically if mappings are simple.

Consider pandas_on_spark (koalas)
Since your data is tiny (30k rows), maybe don't even use PySpark classic
Way faster because it bypasses Spark DAG overhead for small data.

LR

View solution in original post

Several unavoidable for loops are slowing this PySpark code. Is it possible to improve it?