Re: Several unavoidable for loops are slowing this...

lingareddy_Alva · ‎04-28-2025

Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).
That's why Pandas is much faster for your specific case now.

Pre-load and Broadcast All Mappings
Instead of loading json.loads inside loop every time

Use Single Bulk Transformation Instead of Nested withColumn
Instead of .withColumn inside two nested loops — build all new columns in one transformation.
Build a list of new columns first, then apply .selectExpr or .select(*cols) once.

Map via UDF or SQL CASE
If mappings are small and fixed, UDF can be very fast:
Or, generate a CASE statement dynamically if mappings are simple.

Consider pandas_on_spark (koalas)
Since your data is tiny (30k rows), maybe don't even use PySpark classic
Way faster because it bypasses Spark DAG overhead for small data.

LR

View solution in original post