cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Feature store : Can create_training_set() be implemented to execute an inner join?

thib
New Contributor III

For timeseries feature tables, an inner join is made at the creation of the feature table. For the other type of feature tables, a left join is made, so NaN values can show up in the training set. Can the inner join in create_training_set() method be implemented with a parameter?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

create_training_set performs left join. It is just a simple function which select data from Spark SQL database used by feature store. You can just write own code with inner join:

customer_features_df = spark.sql("SELECT * FROM recommender_system.customer_features")
product_features_df = spark.sql("SELECT * FROM recommender_system.product_features")
 
training_df.join(
  customer_features_df,
  on=[training_df.cid == customer_features_df.customer_id,
      training_df.transaction_dt == customer_features_df.dt],
  how="inner"
).join(
  product_features_df,
  on="product_id",
  how="inner"
)

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

Hello, @Thibault Daoulasโ€‹! My name is Piper, and I'm a moderator here in the community. It's nice to meet you and welcome to the community. Thank you for your question!

We'll give the community some time to respond, and then we will come back if we need to. ๐Ÿ™‚

Hubert-Dudek
Esteemed Contributor III

create_training_set performs left join. It is just a simple function which select data from Spark SQL database used by feature store. You can just write own code with inner join:

customer_features_df = spark.sql("SELECT * FROM recommender_system.customer_features")
product_features_df = spark.sql("SELECT * FROM recommender_system.product_features")
 
training_df.join(
  customer_features_df,
  on=[training_df.cid == customer_features_df.customer_id,
      training_df.transaction_dt == customer_features_df.dt],
  how="inner"
).join(
  product_features_df,
  on="product_id",
  how="inner"
)

thib
New Contributor III

Thank you Hubert, that's a good alternative, I just thought I'd stick to the api as much as possible, but this solves it.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group