Databricks Community

thib · ‎02-01-2022

For timeseries feature tables, an inner join is made at the creation of the feature table. For the other type of feature tables, a left join is made, so NaN values can show up in the training set. Can the inner join in create_training_set() method be implemented with a parameter?

Hubert-Dudek · ‎02-01-2022

create_training_set performs left join. It is just a simple function which select data from Spark SQL database used by feature store. You can just write own code with inner join:

customer_features_df = spark.sql("SELECT * FROM recommender_system.customer_features")
product_features_df = spark.sql("SELECT * FROM recommender_system.product_features")
 
training_df.join(
  customer_features_df,
  on=[training_df.cid == customer_features_df.customer_id,
      training_df.transaction_dt == customer_features_df.dt],
  how="inner"
).join(
  product_features_df,
  on="product_id",
  how="inner"
)

My blog: https://databrickster.medium.com/

View solution in original post

Anonymous · ‎02-01-2022

Hello, @Thibault Daoulas! My name is Piper, and I'm a moderator here in the community. It's nice to meet you and welcome to the community. Thank you for your question!

We'll give the community some time to respond, and then we will come back if we need to. 🙂

Hubert-Dudek · ‎02-01-2022

create_training_set performs left join. It is just a simple function which select data from Spark SQL database used by feature store. You can just write own code with inner join:

customer_features_df = spark.sql("SELECT * FROM recommender_system.customer_features")
product_features_df = spark.sql("SELECT * FROM recommender_system.product_features")
 
training_df.join(
  customer_features_df,
  on=[training_df.cid == customer_features_df.customer_id,
      training_df.transaction_dt == customer_features_df.dt],
  how="inner"
).join(
  product_features_df,
  on="product_id",
  how="inner"
)

My blog: https://databrickster.medium.com/

thib · ‎02-02-2022

Thank you Hubert, that's a good alternative, I just thought I'd stick to the api as much as possible, but this solves it.

Databricks Community

Feature store : Can create_training_set() be implemented to execute an inner join?

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST