cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

notebook stuck at "filtering data" or waiting to run

harry_dfe
New Contributor

Hi, my data is in vector sparse representaion, and it was working fine (display and training ml models), I added few features that converted data from sparse to dense represenation and after that anything I want to perform on data stuck(display or ml model). The ram maxed out to 28 , I dont know if it is compute or data represeation (dense vs sparse) issue.

 

its not even writing a new table in database, when i run any   block it says "filtering data" and stuck there . I run an ML model during night but it was still stuck in the morning (8 hours)

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @harry_dfe , 

 

Thanks for the details โ€” this almost certainly stems from your data flipping from a sparse vector representation to a dense one, which explodes perโ€‘row memory and stalls actions like display, writes, and ML training.
 

Why this is happening

  • A dense vector of size n stores all n values (8 bytes each for doubles), so its storage grows linearly with n. A sparse vector stores only nonโ€‘zeros plus indices, and is much smaller when most entries are zero. In Spark MLlib, dense costs โ‰ˆ 8ยทn bytes, while sparse costs โ‰ˆ 12ยทnnz + 4 bytes; sparse is better whenever more than ~1/3 of entries are zero.
  • Some transformers force dense output:
    • StandardScaler withMean=true creates a dense output by design; itโ€™s only safe with sparse if withMean=false.
    • MinMaxScaler turns zeros into nonโ€‘zeros during rescaling, so it outputs DenseVector even for sparse input.
  • Many MLlib algorithms leverage sparsity for speed and memory (e.g., logistic regression, SVM, Lasso, Naive Bayes, KMeans), so losing sparsity can dramatically slow training and increase memory pressure.
 

Quick triage you can run now

1. Measure vector size and density on a small sample. If density jumped, that confirms the cause. ```python from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType from pyspark.sql.functions import udf, col from pyspark.ml.linalg import VectorUDT
schema = StructType([ StructField("size", IntegerType(), False), StructField("nnz", IntegerType(), False), StructField("density", DoubleType(), False) ])
def vstats(v): nnz = v.numNonzeros() return (v.size, nnz, float(nnz) / float(v.size) if v.size else 0.0)
vstats_udf = udf(vstats, schema)
sample = df.sample(0.001, seed=42) # adjust fraction as needed density_df = sample.select(vstats_udf("features").alias("s")) \ .select("s.size", "s.nnz", "s.density") density_df.summary("count", "mean", "min", "max").show() ```
  1. Avoid displaying huge vectors. Use display/select on nonโ€‘vector columns, or drop the features column for inspection: python display(df.select("id", "label").limit(1000)) # Or, if you must preview feature values, only show a small slice: from pyspark.sql.functions import expr display(df.select(expr("slice(features, 1, 20) as features_slice")).limit(100))
  2. Remove pipeline stages that force dense:
    • If you use StandardScaler, set withMean=False to preserve sparsity: python from pyspark.ml.feature import StandardScaler scaler = StandardScaler(withMean=False, withStd=True, inputCol="features", outputCol="scaledFeatures") model = scaler.fit(df) df_scaled = model.transform(df)
    • Avoid MinMaxScaler on highโ€‘dimensional sparse features; it will output dense vectors:
  3. If a recent step converted to dense, revert or rework the step:
    • Prefer sparseโ€‘friendly feature engineering (e.g., HashingTF, CountVectorizer, OneHotEncoder) and avoid transformations that create many small nonโ€‘zeros.
  4. For writes that show โ€œfiltering dataโ€ and stall, write without the heavy vector column: python (df.drop("features") .repartition(200) # adjust based on data size .write.mode("overwrite") .saveAsTable("schema.table_name"))

ML pipeline guidance

  • Keep features in SparseVector wherever possible; many MLlib algorithms compute faster and use less memory with sparse input.
  • If you truly need dense for a downstream stage, lower dimensionality first (e.g., PCA or feature selection) so the dense vector is small. Note that some reducers still output dense, so apply them after youโ€™ve minimized feature count.
  • Be careful with scalers:
    • StandardScaler: use withMean=false to keep sparse; withMean=true centers data and forces dense output.
    • MinMaxScaler: always outputs DenseVector; avoid on wide sparse vectors.

If you still need to convert back to sparse

Only as a last resort (and only on reasonably sized vectors), you can convert DenseVector to SparseVector, but it requires scanning the array โ€” do this on a small subset or after dimensionality reduction: ```python from pyspark.ml.linalg import Vectors, VectorUDT, DenseVector from pyspark.sql.functions import udf
def to_sparse(v): if isinstance(v, DenseVector): arr = v.toArray() idx_vals = [(i, float(val)) for i, val in enumerate(arr) if val != 0.0] return Vectors.sparse(v.size, idx_vals) return v # already sparse
to_sparse_udf = udf(to_sparse, VectorUDT()) df_sparse = df.withColumn("features", to_sparse_udf("features")) ```
 

What likely caused the โ€œRAM maxed to 28โ€ symptom

Converting highโ€‘dimensional sparse features to dense increases each rowโ€™s inโ€‘memory footprint by orders of magnitude. Actions like display (which collect a sample to the driver), wide shuffles during writes, and ML training that materializes feature vectors will then hit driver/executor memory ceilings and appear โ€œstuck.โ€ Switching back to sparse or reducing dimensionality typically resolves this.
 
Hope this helps, Louis.