Databricks Community

Mirko · ‎03-13-2024

Hello community,

i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.

I am using runtime 14.2 ML

Driver: Standard_DS4_v2 28GB Memory 8 cores

Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)

i allready set spark.task.cpus = 8, but my dataset is still down sampled 😞

Catalog says that my Table got the folowing size:

Size:264.5MiB, 8 files

I dont know how it still doesnt fit.

Any help would be appreciated

Mirko

Mirko · ‎03-15-2024

Thank you for your detailed answer. I followed your sugestions with the following result:

- repartioing of the data didnt change anything

- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)

- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"

Mirko · ‎03-19-2024

I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.

Databricks Community

AutoMl Dataset too large

Connect with Databricks Users in Your Area

Securely share data, analytics and AI

Data Intelligence for Data Engineers

Databricks Learning Festival (Virtual): 15 January - 31 January 2025