Databricks Community

ChaseM · ‎10-18-2023

So I have a sklearn style model which predicts on a pandas df. The data to predict on is a spark df. Simply converting the whole thing at once to pandas and predicting is not an option due to time and memory constraints.

Is there a way to chunk a spark df, and use the worker nodes to convert to pandas and predict on the chunks, then get all the predictions back in the driver node?

A bit new to databricks ecosystem so sorry if the question isn't phrased in the best way, but hopefully I got the goal across.

Thank you so much in advance!

ChaseM · ‎10-19-2023

right, that's exactly what I'm trying to do, but have no idea how to do it!

I can chunk the spark df with the following:

def df_in_chunks(df, row_count):    """    in: df    out: [df1, df2, ..., df100]    """    count = df.count()
    if count > row_count:        num_chunks = count//row_count        chunk_percent = 1/num_chunks  # 1% would become 0.01        return df.randomSplit([chunk_percent]*num_chunks, seed=1234)    return [df]

so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?

Thank you so much!

Databricks Community

how to make distributed predictions with sklearn model?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences