Databricks Community

AChang · ‎08-08-2023

I have a pyspark dataframe, 61k rows, 3 columns, one of which is a string column which has a max length of 4k. I'm doing about 100 different regexp_replace functions on this dataframe, so, very resource intensive. I'm trying to write this to a delta table, but it seems no matter what compute I use I can't seem to get it to run in an hour. I know the code works because I limited it to 500 rows to test and it ran in about 30 seconds, so I know it just has to do with the magnitude of the data. Has anyone done something on this scale before, and do you know how I get this to run in an hour without breaking the bank?

Leonardo · ‎08-09-2023

It seems that you're trying to apply a lot of transformations, but it's basic stuff, so I'd go for the best practices documentation and find a way to create a compute-optimized cluster.

Ref.: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#basic-batch-etl

Databricks Community

Best Cluster Setup for intensive transformation workload

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐