cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoMl Dataset too large

Mirko
Contributor

Hello community,

i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.

I am using runtime 14.2 ML 

Driver: Standard_DS4_v2 28GB Memory 8 cores

Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)

i allready set spark.task.cpus = 8, but my dataset is still down sampled 😞

 
Catalog says that my Table got the folowing size:
Size:264.5MiB, 8 files
 
I dont know how it still doesnt fit.
 
Any help would be appreciated
 
Mirko

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @MirkoI understand the challenge you’re facing with your AutoML regression model and the downsampling of your dataset.

Let’s break down the situation and explore potential solutions:

  1. Dataset Size and Downsampling:

    • Your dataset is currently downsampled to approximately 30% of its original size during preprocessing.
    • The catalog indicates that your table size is 264.5 MiB and consists of 8 files.
  2. AutoML Behavior:

  3. Potential Solutions:

    • Let’s explore a few steps to address this issue:
      • Increase Worker Resources:
        • You’re currently using Standard_DS4_v2 workers with 28GB memory and 8 cores. Consider increasing the worker resources (both memory and cores) to accommodate a larger dataset.
      • Check Memory Usage:
        • Verify that your dataset isn’t consuming excessive memory during preprocessing. Sometimes, other operations or transformations might be using memory beyond the dataset itself.
      • Partitioning and Parallelism:
        • Ensure that your data is properly partitioned. If your dataset is not partitioned, consider repartitioning it to optimize parallelism during processing.
      • Sample Size and Quality:
        • Assess whether a smaller sample still provides representative data for your regression task. If so, downsampling might be acceptable.
      • Time-Based Split:
        • If your data has a time component (e.g., timestamps), consider splitting it into training, validation, and test sets based on time. This can help maintain temporal consistency.
      • Class Weights (for Imbalanced Data):
      • Review Preprocessing Steps:
        • Double-check any additional preprocessing steps (e.g., feature engineering, missing value imputation) to ensure they’re not inadvertently affecting the dataset size.
  4. Next Steps:

    • I recommend:
      • Inspecting Memory Usage: Check if memory is being used efficiently during preprocessing.
      • Reviewing Partitioning: Ensure proper data partitioning.
      • Assessing Sample Size: Evaluate whether downsampling impacts model performance significantly.
      • Exploring Class Weights: If applicable, consider class weights for imbalanced data.

Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! 😊

 

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @MirkoI understand the challenge you’re facing with your AutoML regression model and the downsampling of your dataset.

Let’s break down the situation and explore potential solutions:

  1. Dataset Size and Downsampling:

    • Your dataset is currently downsampled to approximately 30% of its original size during preprocessing.
    • The catalog indicates that your table size is 264.5 MiB and consists of 8 files.
  2. AutoML Behavior:

  3. Potential Solutions:

    • Let’s explore a few steps to address this issue:
      • Increase Worker Resources:
        • You’re currently using Standard_DS4_v2 workers with 28GB memory and 8 cores. Consider increasing the worker resources (both memory and cores) to accommodate a larger dataset.
      • Check Memory Usage:
        • Verify that your dataset isn’t consuming excessive memory during preprocessing. Sometimes, other operations or transformations might be using memory beyond the dataset itself.
      • Partitioning and Parallelism:
        • Ensure that your data is properly partitioned. If your dataset is not partitioned, consider repartitioning it to optimize parallelism during processing.
      • Sample Size and Quality:
        • Assess whether a smaller sample still provides representative data for your regression task. If so, downsampling might be acceptable.
      • Time-Based Split:
        • If your data has a time component (e.g., timestamps), consider splitting it into training, validation, and test sets based on time. This can help maintain temporal consistency.
      • Class Weights (for Imbalanced Data):
      • Review Preprocessing Steps:
        • Double-check any additional preprocessing steps (e.g., feature engineering, missing value imputation) to ensure they’re not inadvertently affecting the dataset size.
  4. Next Steps:

    • I recommend:
      • Inspecting Memory Usage: Check if memory is being used efficiently during preprocessing.
      • Reviewing Partitioning: Ensure proper data partitioning.
      • Assessing Sample Size: Evaluate whether downsampling impacts model performance significantly.
      • Exploring Class Weights: If applicable, consider class weights for imbalanced data.

Remember that AutoML aims to simplify the model-building process, but understanding its behavior and tuning parameters can lead to better results. If you encounter any specific issues during these steps, feel free to ask for further assistance! 😊

 

Mirko
Contributor

Thank you for your detailed answer. I followed your sugestions with the following result:

- repartioing of the data didnt change anything

- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)

- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"

 

Mirko
Contributor

I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.