cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Feature table: merge very slow

Ashley1
Contributor

Hi All,

We're just started to look at the feature store capabilities of Databricks. Our first attempt to create a feature table has resulted in very slow write. To avoid the time incurred by the feature functions I generated a dataframe with same key's but the feature values where generated from rand(). It is <700K rows and ~280 feature columns. It doesn't seem to matter whether we use merge mode or overwrite, it is very slow (8.5mins on standalone 4 core cluster). I've attached the SQL detail from Sparq UI. It is taking on 3.5s to scan the files for matches and nearly 8.5mins to rewrite the matched files. While it appears to 1.5GB of output that seems an incredibly long time to rewrite the changes. I've tried with and without partitioning, repartitioning. Can anyone offer some insights or suggestions to improve the performance of the feature/delta table writes?

Regards,

Ashley

Historical Spark UI for cluster 0622-013318-zoqth84b, driver 332737051535251367 - Details for Query 352

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

Do you merge on 280 columns? Cannot it be just by id only to simplify?

I would merge per id and optimize and zorder if possible by id before the merge.

I'm sure the merge is hinged on the primary key declared when the feature table is created. For this experiment I wrote the feature table first with only primary keys and then later added the features (~280), joined to the original primary key only dataframe, and wrote that using:

fs.write_table(

  name='dsci_features.ab_test',

  df=to_write_df,

  mode='merge'

)

The primary key are in the order they were originally written, they're also in order (customer id and observation date).

Here's how the table was created:

fs.create_table(

  name='dsci_features.ab_test',

  primary_keys=['customer_id', 'obs_date'],

  # If df is provided, this data will be saved in a Delta table

  #df=feature_df,

  # If schema is provided, no data will be saved yet

  schema=base_df.schema,

  #partition_columns=['cal_quarter'],

  description='AB Test'

)

primary keys written:

fs.write_table(

  name='dsci_features.ab_test',

  df=base_df,

  mode='merge'

)

features written:

fs.write_table(

  name='dsci_features.ab_test',

  df=base_df,

  mode='merge'

)

If I just insert the features into the newly created feature table, without priming the table with primary keys, effectively an append, it takes around 5mins. The insert of primary keys in the experiment was to force a merge. 5 minutes for the insert only path seems a bit excessive as well.

Vidula
Honored Contributor

Hi @Ashley Betts​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group