topic Re: how to make distributed predictions with sklearn model? in Get Started Discussions

how to make distributed predictions with sklearn model?

ChaseM — Wed, 18 Oct 2023 19:07:41 GMT

So I have a sklearn style model which predicts on a pandas df. The data to predict on is a spark df. Simply converting the whole thing at once to pandas and predicting is not an option due to time and memory constraints.

Is there a way to chunk a spark df, and use the worker nodes to convert to pandas and predict on the chunks, then get all the predictions back in the driver node?

A bit new to databricks ecosystem so sorry if the question isn't phrased in the best way, but hopefully I got the goal across.

Thank you so much in advance!

Re: how to make distributed predictions with sklearn model?

ChaseM — Thu, 19 Oct 2023 15:35:19 GMT

right, that's exactly what I'm trying to do, but have no idea how to do it!

I can chunk the spark df with the following:

def df_in_chunks(df, row_count):    """    in: df    out: [df1, df2, ..., df100]    """    count = df.count()
    if count > row_count:        num_chunks = count//row_count        chunk_percent = 1/num_chunks  # 1% would become 0.01        return df.randomSplit([chunk_percent]*num_chunks, seed=1234)    return [df]

so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?

Thank you so much!