â03-13-2024 06:58 AM
Hello community,
i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.
I am using runtime 14.2 ML
Driver: Standard_DS4_v2 28GB Memory 8 cores
Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)
i allready set spark.task.cpus = 8, but my dataset is still down sampled đ
â03-14-2024 03:06 AM
Hi @Mirko, I understand the challenge youâre facing with your AutoML regression model and the downsampling of your dataset.
Letâs break down the situation and explore potential solutions:
Dataset Size and Downsampling:
AutoML Behavior:
Potential Solutions:
Next Steps:
Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! đ
â03-14-2024 03:06 AM
Hi @Mirko, I understand the challenge youâre facing with your AutoML regression model and the downsampling of your dataset.
Letâs break down the situation and explore potential solutions:
Dataset Size and Downsampling:
AutoML Behavior:
Potential Solutions:
Next Steps:
Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! đ
â03-15-2024 01:56 AM
Thank you for your detailed answer. I followed your sugestions with the following result:
- repartioing of the data didnt change anything
- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)
- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"
â03-19-2024 04:50 AM
I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.