- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-28-2025 06:57 AM
Hi. I have a PySpark notebook that takes 25 minutes to run as opposed to one minute in on-prem Linux + Pandas. How can I speed it up?
It's not a volume issue. The input is around 30k rows. Output is the same because there's no filtering or aggregation; just creating new fields. No collect, count, or display statements (which would slow it down).
The main thing is a bunch of mappings I need to apply, but it depends on existing fields and there are various models I need to run. So the mappings are different depending on variable and model. That's where the for loops come in.
Now I'm not iterating over the dataframe itself; just over 15 fields (different variables) and 4 different mappings. Then do that 10 times (once per model).
The worker is m5d 2x large and drivers are r4 2x large, min/max workers are 4/20. This should be fine.
I attached a pic to illustrate the code flow. Does anything stand out that you think I could change or that you think Spark is slow at, such as json.load or create_map?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-28-2025 09:10 AM
Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).
That's why Pandas is much faster for your specific case now.
Pre-load and Broadcast All Mappings
Instead of loading json.loads inside loop every time
Use Single Bulk Transformation Instead of Nested withColumn
Instead of .withColumn inside two nested loops — build all new columns in one transformation.
Build a list of new columns first, then apply .selectExpr or .select(*cols) once.
Map via UDF or SQL CASE
If mappings are small and fixed, UDF can be very fast:
Or, generate a CASE statement dynamically if mappings are simple.
Consider pandas_on_spark (koalas)
Since your data is tiny (30k rows), maybe don't even use PySpark classic
Way faster because it bypasses Spark DAG overhead for small data.