Databricks Community

AChang · ‎08-08-2023

I have a pyspark dataframe, 61k rows, 3 columns, one of which is a string column which has a max length of 4k. I'm doing about 100 different regexp_replace functions on this dataframe, so, very resource intensive. I'm trying to write this to a delta table, but it seems no matter what compute I use I can't seem to get it to run in an hour. I know the code works because I limited it to 500 rows to test and it ran in about 30 seconds, so I know it just has to do with the magnitude of the data. Has anyone done something on this scale before, and do you know how I get this to run in an hour without breaking the bank?

Leonardo · ‎08-09-2023

It seems that you're trying to apply a lot of transformations, but it's basic stuff, so I'd go for the best practices documentation and find a way to create a compute-optimized cluster.

Ref.: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#basic-batch-etl

Databricks Community

Best Cluster Setup for intensive transformation workload

Upcoming Community BrickTalk: Bringing (Geo)Spatial Awareness to your Conversational Agents

Databricks Community Champion - June 2026 - Amira Bedhiafi

DAIS 2026 Brought 2,800 New Members to the Databricks Community - Welcome Aboard

🌟 Community Pulse: Your Weekly Roundup! June 15 – 21, 2026

Solution Accelerator Series | Creating Brand-Aligned Images Using Generative AI