- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-28-2025 09:10 AM
Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).
That's why Pandas is much faster for your specific case now.
Pre-load and Broadcast All Mappings
Instead of loading json.loads inside loop every time
Use Single Bulk Transformation Instead of Nested withColumn
Instead of .withColumn inside two nested loops — build all new columns in one transformation.
Build a list of new columns first, then apply .selectExpr or .select(*cols) once.
Map via UDF or SQL CASE
If mappings are small and fixed, UDF can be very fast:
Or, generate a CASE statement dynamically if mappings are simple.
Consider pandas_on_spark (koalas)
Since your data is tiny (30k rows), maybe don't even use PySpark classic
Way faster because it bypasses Spark DAG overhead for small data.