Greetings @harry_dfe ,
Thanks for the details โ this almost certainly stems from your data flipping from a sparse vector representation to a dense one, which explodes perโrow memory and stalls actions like display, writes, and ML training.
Why this is happening
- A dense vector of size n stores all n values (8 bytes each for doubles), so its storage grows linearly with n. A sparse vector stores only nonโzeros plus indices, and is much smaller when most entries are zero. In Spark MLlib, dense costs โ 8ยทn bytes, while sparse costs โ 12ยทnnz + 4 bytes; sparse is better whenever more than ~1/3 of entries are zero.
-
Some
transformers force dense output:
- StandardScaler withMean=true creates a dense output by design; itโs only safe with sparse if withMean=false.
- MinMaxScaler turns zeros into nonโzeros during rescaling, so it outputs DenseVector even for sparse input.
-
Many MLlib algorithms leverage sparsity for speed and memory (e.g., logistic regression, SVM, Lasso, Naive Bayes, KMeans), so losing sparsity can dramatically slow training and increase memory pressure.
Quick triage you can run now
1. Measure vector size and density on a small sample. If density jumped, that confirms the cause. ```python from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType from pyspark.sql.functions import udf, col from pyspark.ml.linalg import VectorUDT
schema = StructType([ StructField("size", IntegerType(), False), StructField("nnz", IntegerType(), False), StructField("density", DoubleType(), False) ])
def vstats(v): nnz = v.numNonzeros() return (v.size, nnz, float(nnz) / float(v.size) if v.size else 0.0)
vstats_udf = udf(vstats, schema)
sample = df.sample(0.001, seed=42) # adjust fraction as needed density_df = sample.select(vstats_udf("features").alias("s")) \ .select("s.size", "s.nnz", "s.density") density_df.summary("count", "mean", "min", "max").show() ```
-
Avoid displaying huge vectors. Use display/select on nonโvector columns, or drop the features column for inspection: python
display(df.select("id", "label").limit(1000))
# Or, if you must preview feature values, only show a small slice:
from pyspark.sql.functions import expr
display(df.select(expr("slice(features, 1, 20) as features_slice")).limit(100))
-
Remove pipeline stages that force dense:
- If you use StandardScaler, set withMean=False to preserve sparsity:
python
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(withMean=False, withStd=True, inputCol="features", outputCol="scaledFeatures")
model = scaler.fit(df)
df_scaled = model.transform(df)
- Avoid MinMaxScaler on highโdimensional sparse features; it will output dense vectors:
-
If a recent step converted to dense, revert or rework the step:
- Prefer sparseโfriendly feature engineering (e.g., HashingTF, CountVectorizer, OneHotEncoder) and avoid transformations that create many small nonโzeros.
-
For writes that show โfiltering dataโ and stall, write without the heavy vector column: python
(df.drop("features")
.repartition(200) # adjust based on data size
.write.mode("overwrite")
.saveAsTable("schema.table_name"))
ML pipeline guidance
- Keep features in SparseVector wherever possible; many MLlib algorithms compute faster and use less memory with sparse input.
-
If you truly need dense for a downstream stage, lower dimensionality first (e.g., PCA or feature selection) so the dense vector is small. Note that some reducers still output dense, so apply them after youโve minimized feature count.
-
Be careful with scalers:
- StandardScaler: use withMean=false to keep sparse; withMean=true centers data and forces dense output.
- MinMaxScaler: always outputs DenseVector; avoid on wide sparse vectors.
If you still need to convert back to sparse
Only as a last resort (and only on reasonably sized vectors), you can convert DenseVector to SparseVector, but it requires scanning the array โ do this on a small subset or after dimensionality reduction: ```python from pyspark.ml.linalg import Vectors, VectorUDT, DenseVector from pyspark.sql.functions import udf
def to_sparse(v): if isinstance(v, DenseVector): arr = v.toArray() idx_vals = [(i, float(val)) for i, val in enumerate(arr) if val != 0.0] return Vectors.sparse(v.size, idx_vals) return v # already sparse
to_sparse_udf = udf(to_sparse, VectorUDT()) df_sparse = df.withColumn("features", to_sparse_udf("features")) ```
What likely caused the โRAM maxed to 28โ symptom
Converting highโdimensional sparse features to dense increases each rowโs inโmemory footprint by orders of magnitude. Actions like display (which collect a sample to the driver), wide shuffles during writes, and ML training that materializes feature vectors will then hit driver/executor memory ceilings and appear โstuck.โ Switching back to sparse or reducing dimensionality typically resolves this.
Hope this helps, Louis.