â03-13-2024 06:58 AM
Hello community,
i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.
I am using runtime 14.2 ML
Driver: Standard_DS4_v2 28GB Memory 8 cores
Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)
i allready set spark.task.cpus = 8, but my dataset is still down sampled đ
â03-14-2024 03:06 AM
Hi @Mirko, I understand the challenge youâre facing with your AutoML regression model and the downsampling of your dataset.
Letâs break down the situation and explore potential solutions:
Dataset Size and Downsampling:
AutoML Behavior:
Potential Solutions:
Next Steps:
Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! đ
â03-14-2024 03:06 AM
Hi @Mirko, I understand the challenge youâre facing with your AutoML regression model and the downsampling of your dataset.
Letâs break down the situation and explore potential solutions:
Dataset Size and Downsampling:
AutoML Behavior:
Potential Solutions:
Next Steps:
Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! đ
â03-15-2024 01:56 AM
Thank you for your detailed answer. I followed your sugestions with the following result:
- repartioing of the data didnt change anything
- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)
- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"
â03-19-2024 04:50 AM
I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonât want to miss the chance to attend and share knowledge.
If there isnât a group near you, start one and help create a community that brings people together.
Request a New Group