cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Cluster Setup for intensive transformation workload

AChang
New Contributor III

I have a pyspark dataframe, 61k rows, 3 columns, one of which is a string column which has a max length of 4k. I'm doing about 100 different regexp_replace functions on this dataframe, so, very resource intensive. I'm trying to write this to a delta table, but it seems no matter what compute I use I can't seem to get it to run in an hour. I know the code works because I limited it to 500 rows to test and it ran in about 30 seconds, so I know it just has to do with the magnitude of the data. Has anyone done something on this scale before, and do you know how I get this to run in an hour without breaking the bank?

1 REPLY 1

Leonardo
New Contributor III

It seems that you're trying to apply a lot of transformations, but it's basic stuff, so I'd go for the best practices documentation and find a way to create a compute-optimized cluster.

Ref.: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#basic-batch-etl

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group