cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why do Spark MLlib models only accept a vector column as input?

User16826992666
Valued Contributor

In other libraries I can just use the feature columns themselves as inputs, why do I need to make a vector out of my features when I use MLlib?

2 REPLIES 2

User16826992666
Valued Contributor

The modeling algorithms in Spark MLlib will only accept a vectorized column as input. This is done for reasons of efficiency and scaling.

The vector assembler will express the features efficiently using techniques like spark vector, which allow a larger amount of data to be handled with less memory. This helps the modeling algorithms run efficiently even on large data columns.

sean_owen
Honored Contributor II
Honored Contributor II

Yeah, it's more a design choice. Rather than have every implementation take column(s) params, this is handled once in VectorAssembler for all of them. One way or the other, most implementations need a vector of inputs anyway. VectorAssembler can do some optimizations to use sparse vectors too where applicable.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!