Databricks Community

AChang · ‎08-08-2023

I have a pyspark dataframe, 61k rows, 3 columns, one of which is a string column which has a max length of 4k. I'm doing about 100 different regexp_replace functions on this dataframe, so, very resource intensive. I'm trying to write this to a delta table, but it seems no matter what compute I use I can't seem to get it to run in an hour. I know the code works because I limited it to 500 rows to test and it ran in about 30 seconds, so I know it just has to do with the magnitude of the data. Has anyone done something on this scale before, and do you know how I get this to run in an hour without breaking the bank?

Leonardo · ‎08-09-2023

It seems that you're trying to apply a lot of transformations, but it's basic stuff, so I'd go for the best practices documentation and find a way to create a compute-optimized cluster.

Ref.: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#basic-batch-etl

Databricks Community

Best Cluster Setup for intensive transformation workload

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!