cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoMl Dataset too large

Mirko
Contributor

Hello community,

i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.

I am using runtime 14.2 ML 

Driver: Standard_DS4_v2 28GB Memory 8 cores

Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)

i allready set spark.task.cpus = 8, but my dataset is still down sampled 😞

 
Catalog says that my Table got the folowing size:
Size:264.5MiB, 8 files
 
I dont know how it still doesnt fit.
 
Any help would be appreciated
 
Mirko

 

2 REPLIES 2

Mirko
Contributor

Thank you for your detailed answer. I followed your sugestions with the following result:

- repartioing of the data didnt change anything

- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)

- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"

 

Mirko
Contributor

I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now