Databricks Community

ChaseM · ‎10-18-2023

So I have a sklearn style model which predicts on a pandas df. The data to predict on is a spark df. Simply converting the whole thing at once to pandas and predicting is not an option due to time and memory constraints.

Is there a way to chunk a spark df, and use the worker nodes to convert to pandas and predict on the chunks, then get all the predictions back in the driver node?

A bit new to databricks ecosystem so sorry if the question isn't phrased in the best way, but hopefully I got the goal across.

Thank you so much in advance!

ChaseM · ‎10-19-2023

right, that's exactly what I'm trying to do, but have no idea how to do it!

I can chunk the spark df with the following:

def df_in_chunks(df, row_count):    """    in: df    out: [df1, df2, ..., df100]    """    count = df.count()
    if count > row_count:        num_chunks = count//row_count        chunk_percent = 1/num_chunks  # 1% would become 0.01        return df.randomSplit([chunk_percent]*num_chunks, seed=1234)    return [df]

so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?

Thank you so much!

Databricks Community

how to make distributed predictions with sklearn model?

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐