Databricks

Ashley1 · ‎07-05-2022

Hi All,

We're just started to look at the feature store capabilities of Databricks. Our first attempt to create a feature table has resulted in very slow write. To avoid the time incurred by the feature functions I generated a dataframe with same key's but the feature values where generated from rand(). It is <700K rows and ~280 feature columns. It doesn't seem to matter whether we use merge mode or overwrite, it is very slow (8.5mins on standalone 4 core cluster). I've attached the SQL detail from Sparq UI. It is taking on 3.5s to scan the files for matches and nearly 8.5mins to rewrite the matched files. While it appears to 1.5GB of output that seems an incredibly long time to rewrite the changes. I've tried with and without partitioning, repartitioning. Can anyone offer some insights or suggestions to improve the performance of the feature/delta table writes?

Regards,

Ashley

Hubert-Dudek · ‎07-06-2022

Do you merge on 280 columns? Cannot it be just by id only to simplify?

I would merge per id and optimize and zorder if possible by id before the merge.

Ashley1 · ‎07-06-2022

I'm sure the merge is hinged on the primary key declared when the feature table is created. For this experiment I wrote the feature table first with only primary keys and then later added the features (~280), joined to the original primary key only dataframe, and wrote that using:

fs.write_table(

name='dsci_features.ab_test',

df=to_write_df,

mode='merge'

)

The primary key are in the order they were originally written, they're also in order (customer id and observation date).

Ashley1 · ‎07-06-2022

Here's how the table was created:

fs.create_table(

name='dsci_features.ab_test',

primary_keys=['customer_id', 'obs_date'],

# If df is provided, this data will be saved in a Delta table

#df=feature_df,

# If schema is provided, no data will be saved yet

schema=base_df.schema,

#partition_columns=['cal_quarter'],

description='AB Test'

)

primary keys written:

fs.write_table(

name='dsci_features.ab_test',

df=base_df,

mode='merge'

)

features written:

fs.write_table(

name='dsci_features.ab_test',

df=base_df,

mode='merge'

)

Ashley1 · ‎07-06-2022

If I just insert the features into the newly created feature table, without priming the table with primary keys, effectively an append, it takes around 5mins. The insert of primary keys in the experiment was to force a merge. 5 minutes for the insert only path seems a bit excessive as well.

Vidula · ‎08-31-2022

Hi @Ashley Betts

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!