cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Why do Spark MLlib models only accept a vector column as input?

User16826992666
Valued Contributor

In other libraries I can just use the feature columns themselves as inputs, why do I need to make a vector out of my features when I use MLlib?

2 REPLIES 2

User16826992666
Valued Contributor

The modeling algorithms in Spark MLlib will only accept a vectorized column as input. This is done for reasons of efficiency and scaling.

The vector assembler will express the features efficiently using techniques like spark vector, which allow a larger amount of data to be handled with less memory. This helps the modeling algorithms run efficiently even on large data columns.

sean_owen
Honored Contributor II
Honored Contributor II

Yeah, it's more a design choice. Rather than have every implementation take column(s) params, this is handled once in VectorAssembler for all of them. One way or the other, most implementations need a vector of inputs anyway. VectorAssembler can do some optimizations to use sparse vectors too where applicable.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.