cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to make distributed predictions with sklearn model?

ChaseM
New Contributor II

So I have a sklearn style model which predicts on a pandas df. The data to predict on is a spark df. Simply converting the whole thing at once to pandas and predicting is not an option due to time and memory constraints.

Is there a way to chunk a spark df, and use the worker nodes to convert to pandas and predict on the chunks, then get all the predictions back in the driver node?

A bit new to databricks ecosystem so sorry if the question isn't phrased in the best way, but hopefully I got the goal across.

Thank you so much in advance!

1 REPLY 1

ChaseM
New Contributor II

right, that's exactly what I'm trying to do, but have no idea how to do it!

I can chunk the spark df with the following:

def df_in_chunks(df, row_count):    """    in: df    out: [df1, df2, ..., df100]    """    count = df.count()
    if count > row_count:        num_chunks = count//row_count        chunk_percent = 1/num_chunks  # 1% would become 0.01        return df.randomSplit([chunk_percent]*num_chunks, seed=1234)    return [df]

 so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?

Thank you so much!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group