<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: how to make distributed predictions with sklearn model? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/how-to-make-distributed-predictions-with-sklearn-model/m-p/49541#M1610</link>
    <description>&lt;P&gt;right, that's exactly what I'm trying to do, but have no idea how to do it!&lt;/P&gt;&lt;P&gt;I can chunk the spark df with the following:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN&gt;def df_in_chunks(df, row_count):&lt;/SPAN&gt;&lt;SPAN&gt;    """&lt;/SPAN&gt;&lt;SPAN&gt;    in: df&lt;/SPAN&gt;&lt;SPAN&gt;    out: [df1, df2, ..., df100]&lt;/SPAN&gt;&lt;SPAN&gt;    """&lt;/SPAN&gt;&lt;SPAN&gt;    count = df.count()&lt;/SPAN&gt;
&lt;SPAN&gt;    if count &amp;gt; row_count:&lt;/SPAN&gt;&lt;SPAN&gt;        num_chunks = count//row_count&lt;/SPAN&gt;&lt;SPAN&gt;        chunk_percent = 1/num_chunks  # 1% would become 0.01&lt;/SPAN&gt;&lt;SPAN&gt;        return df.randomSplit([chunk_percent]*num_chunks, seed=1234)&lt;/SPAN&gt;&lt;SPAN&gt;    return [df]&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?&lt;/P&gt;&lt;P&gt;Thank you so much!&lt;/P&gt;</description>
    <pubDate>Thu, 19 Oct 2023 15:35:19 GMT</pubDate>
    <dc:creator>ChaseM</dc:creator>
    <dc:date>2023-10-19T15:35:19Z</dc:date>
    <item>
      <title>how to make distributed predictions with sklearn model?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-make-distributed-predictions-with-sklearn-model/m-p/49495#M1605</link>
      <description>&lt;P&gt;So I have a sklearn style model which predicts on a pandas df. The data to predict on is a spark df. Simply converting the whole thing at once to pandas and predicting is not an option due to time and memory constraints.&lt;/P&gt;&lt;P&gt;Is there a way to chunk a spark df, and use the worker nodes to convert to pandas and predict on the chunks, then get all the predictions back in the driver node?&lt;/P&gt;&lt;P&gt;A bit new to databricks ecosystem so sorry if the question isn't phrased in the best way, but hopefully I got the goal across.&lt;/P&gt;&lt;P&gt;Thank you so much in advance!&lt;/P&gt;</description>
      <pubDate>Wed, 18 Oct 2023 19:07:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-make-distributed-predictions-with-sklearn-model/m-p/49495#M1605</guid>
      <dc:creator>ChaseM</dc:creator>
      <dc:date>2023-10-18T19:07:41Z</dc:date>
    </item>
    <item>
      <title>Re: how to make distributed predictions with sklearn model?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-to-make-distributed-predictions-with-sklearn-model/m-p/49541#M1610</link>
      <description>&lt;P&gt;right, that's exactly what I'm trying to do, but have no idea how to do it!&lt;/P&gt;&lt;P&gt;I can chunk the spark df with the following:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN&gt;def df_in_chunks(df, row_count):&lt;/SPAN&gt;&lt;SPAN&gt;    """&lt;/SPAN&gt;&lt;SPAN&gt;    in: df&lt;/SPAN&gt;&lt;SPAN&gt;    out: [df1, df2, ..., df100]&lt;/SPAN&gt;&lt;SPAN&gt;    """&lt;/SPAN&gt;&lt;SPAN&gt;    count = df.count()&lt;/SPAN&gt;
&lt;SPAN&gt;    if count &amp;gt; row_count:&lt;/SPAN&gt;&lt;SPAN&gt;        num_chunks = count//row_count&lt;/SPAN&gt;&lt;SPAN&gt;        chunk_percent = 1/num_chunks  # 1% would become 0.01&lt;/SPAN&gt;&lt;SPAN&gt;        return df.randomSplit([chunk_percent]*num_chunks, seed=1234)&lt;/SPAN&gt;&lt;SPAN&gt;    return [df]&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;so I have a list of spark dfs, but if I do "for df in dfs: df_pd = df.toPandas(); model.predict(df_pd)" it does it serially not in parallel, do you have any suggestion on how to make it parallel?&lt;/P&gt;&lt;P&gt;Thank you so much!&lt;/P&gt;</description>
      <pubDate>Thu, 19 Oct 2023 15:35:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-to-make-distributed-predictions-with-sklearn-model/m-p/49541#M1610</guid>
      <dc:creator>ChaseM</dc:creator>
      <dc:date>2023-10-19T15:35:19Z</dc:date>
    </item>
  </channel>
</rss>

