cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoMl Dataset too large

Mirko
Contributor

Hello community,

i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.

I am using runtime 14.2 ML 

Driver: Standard_DS4_v2 28GB Memory 8 cores

Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)

i allready set spark.task.cpus = 8, but my dataset is still down sampled 😞

 
Catalog says that my Table got the folowing size:
Size:264.5MiB, 8 files
 
I dont know how it still doesnt fit.
 
Any help would be appreciated
 
Mirko

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @MirkoI understand the challenge you’re facing with your AutoML regression model and the downsampling of your dataset.

Let’s break down the situation and explore potential solutions:

  1. Dataset Size and Downsampling:

    • Your dataset is currently downsampled to approximately 30% of its original size during preprocessing.
    • The catalog indicates that your table size is 264.5 MiB and consists of 8 files.
  2. AutoML Behavior:

  3. Potential Solutions:

    • Let’s explore a few steps to address this issue:
      • Increase Worker Resources:
        • You’re currently using Standard_DS4_v2 workers with 28GB memory and 8 cores. Consider increasing the worker resources (both memory and cores) to accommodate a larger dataset.
      • Check Memory Usage:
        • Verify that your dataset isn’t consuming excessive memory during preprocessing. Sometimes, other operations or transformations might be using memory beyond the dataset itself.
      • Partitioning and Parallelism:
        • Ensure that your data is properly partitioned. If your dataset is not partitioned, consider repartitioning it to optimize parallelism during processing.
      • Sample Size and Quality:
        • Assess whether a smaller sample still provides representative data for your regression task. If so, downsampling might be acceptable.
      • Time-Based Split:
        • If your data has a time component (e.g., timestamps), consider splitting it into training, validation, and test sets based on time. This can help maintain temporal consistency.
      • Class Weights (for Imbalanced Data):
      • Review Preprocessing Steps:
        • Double-check any additional preprocessing steps (e.g., feature engineering, missing value imputation) to ensure they’re not inadvertently affecting the dataset size.
  4. Next Steps:

    • I recommend:
      • Inspecting Memory Usage: Check if memory is being used efficiently during preprocessing.
      • Reviewing Partitioning: Ensure proper data partitioning.
      • Assessing Sample Size: Evaluate whether downsampling impacts model performance significantly.
      • Exploring Class Weights: If applicable, consider class weights for imbalanced data.

Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! 😊

 

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @MirkoI understand the challenge you’re facing with your AutoML regression model and the downsampling of your dataset.

Let’s break down the situation and explore potential solutions:

  1. Dataset Size and Downsampling:

    • Your dataset is currently downsampled to approximately 30% of its original size during preprocessing.
    • The catalog indicates that your table size is 264.5 MiB and consists of 8 files.
  2. AutoML Behavior:

  3. Potential Solutions:

    • Let’s explore a few steps to address this issue:
      • Increase Worker Resources:
        • You’re currently using Standard_DS4_v2 workers with 28GB memory and 8 cores. Consider increasing the worker resources (both memory and cores) to accommodate a larger dataset.
      • Check Memory Usage:
        • Verify that your dataset isn’t consuming excessive memory during preprocessing. Sometimes, other operations or transformations might be using memory beyond the dataset itself.
      • Partitioning and Parallelism:
        • Ensure that your data is properly partitioned. If your dataset is not partitioned, consider repartitioning it to optimize parallelism during processing.
      • Sample Size and Quality:
        • Assess whether a smaller sample still provides representative data for your regression task. If so, downsampling might be acceptable.
      • Time-Based Split:
        • If your data has a time component (e.g., timestamps), consider splitting it into training, validation, and test sets based on time. This can help maintain temporal consistency.
      • Class Weights (for Imbalanced Data):
      • Review Preprocessing Steps:
        • Double-check any additional preprocessing steps (e.g., feature engineering, missing value imputation) to ensure they’re not inadvertently affecting the dataset size.
  4. Next Steps:

    • I recommend:
      • Inspecting Memory Usage: Check if memory is being used efficiently during preprocessing.
      • Reviewing Partitioning: Ensure proper data partitioning.
      • Assessing Sample Size: Evaluate whether downsampling impacts model performance significantly.
      • Exploring Class Weights: If applicable, consider class weights for imbalanced data.

Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! 😊

 

Mirko
Contributor

Thank you for your detailed answer. I followed your sugestions with the following result:

- repartioing of the data didnt change anything

- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)

- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"

 

Mirko
Contributor

I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.