Databricks Community

Mirko · ‎03-13-2024

Hello community,

i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.

I am using runtime 14.2 ML

Driver: Standard_DS4_v2 28GB Memory 8 cores

Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)

i allready set spark.task.cpus = 8, but my dataset is still down sampled 😞

Catalog says that my Table got the folowing size:

Size:264.5MiB, 8 files

I dont know how it still doesnt fit.

Any help would be appreciated

Mirko

Mirko · ‎03-15-2024

Thank you for your detailed answer. I followed your sugestions with the following result:

- repartioing of the data didnt change anything

- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)

- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"

Mirko · ‎03-19-2024

I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.

Databricks Community

AutoMl Dataset too large

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon