<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Scaling issue for inference with a spark.mllib model in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25367#M17635</link>
    <description>&lt;P&gt;Thanks for the help.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;skew : I did not focus on this point, i'll have a look&lt;/LI&gt;&lt;LI&gt;&lt;I&gt;spark.sql.shuffle.partitions : &lt;/I&gt;already done&lt;/LI&gt;&lt;LI&gt;bigger driver : already done&lt;/LI&gt;&lt;LI&gt;there is indeed an udf used for the prediction score retrieval, i will have a look too&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Update : &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;skew : there was a large one but problem remained after the fix&lt;/LI&gt;&lt;LI&gt;udf : only a model udf exists which should already be optimized&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Thu, 17 Mar 2022 12:28:08 GMT</pubDate>
    <dc:creator>admo</dc:creator>
    <dc:date>2022-03-17T12:28:08Z</dc:date>
    <item>
      <title>Scaling issue for inference with a spark.mllib model</title>
      <link>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25365#M17633</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm writing this because I have tried a lot of different directions to get a simple model inference working with no success.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is the outline of the job&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# 1 - Load the base data (~1 billion lines of ~6 columns)
interaction = build_initial_df()
&amp;nbsp;
# 2 - Preprocessing of some of the data (~ 200 to 2000 columns) 
feature_df = add_metadata(interaction, feature_transformer_model)
&amp;nbsp;
# 3 - cross feature building and logisitic regression inference
model = PipelineModel.load(model_save_path)
prediction_per_user = model.transform(feature_df)
final_df = prediction_per_user.select(...)
&amp;nbsp;
# 4 - write the final result
write_df_to_snowflake(final_df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This work reasonnably well when the size of the input data is around 100 smaller.&lt;/P&gt;&lt;P&gt;But fails on the full scale.&lt;/P&gt;&lt;P&gt;The size of the cluster is reasonable : up to 40 r5d.2xlarge giving a bit more than 2Tb in RAM&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Execution problem :&lt;/B&gt;&lt;/P&gt;&lt;P&gt;Very intense ressource usage and a lot of disk spill&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Question : &lt;/B&gt;&lt;/P&gt;&lt;P&gt;My understanding ia model inference is is similar to a map operation thus very fractionable and can work in parallel given a reasonable amount of compute.&lt;/P&gt;&lt;P&gt;How could I make it work given my resource budget ? What am I missing ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Already tried : &lt;/B&gt;&lt;/P&gt;&lt;P&gt;1) I have already tried using a MLflow model udf but it doesn't work on list feature outputed by a previous feature pipeline model&lt;/P&gt;&lt;P&gt;2) I disabled some of the optimization of spark because it would run a stage that would fill a single executor disk or time out&lt;/P&gt;&lt;P&gt;3) Forcing some repartition of the dataframe to process the dataset in smaller chunks (it may be improved)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance&lt;/P&gt;</description>
      <pubDate>Thu, 17 Mar 2022 09:11:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25365#M17633</guid>
      <dc:creator>admo</dc:creator>
      <dc:date>2022-03-17T09:11:05Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling issue for inference with a spark.mllib model</title>
      <link>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25366#M17634</link>
      <description>&lt;P&gt;It is hard to analyze without Spark UI and more detailed information, but anyway few tips:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;look for data skews some partitions can be very big some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions())&lt;/LI&gt;&lt;LI&gt;increase shuffle size &lt;I&gt;spark.sql.shuffle.partitions &lt;/I&gt;default is 200 try bigger, I would go to 1000 at least even&lt;/LI&gt;&lt;LI&gt;increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integarte datadog with cluster)&lt;/LI&gt;&lt;LI&gt;make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 17 Mar 2022 10:42:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25366#M17634</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-17T10:42:49Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling issue for inference with a spark.mllib model</title>
      <link>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25367#M17635</link>
      <description>&lt;P&gt;Thanks for the help.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;skew : I did not focus on this point, i'll have a look&lt;/LI&gt;&lt;LI&gt;&lt;I&gt;spark.sql.shuffle.partitions : &lt;/I&gt;already done&lt;/LI&gt;&lt;LI&gt;bigger driver : already done&lt;/LI&gt;&lt;LI&gt;there is indeed an udf used for the prediction score retrieval, i will have a look too&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Update : &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;skew : there was a large one but problem remained after the fix&lt;/LI&gt;&lt;LI&gt;udf : only a model udf exists which should already be optimized&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 17 Mar 2022 12:28:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25367#M17635</guid>
      <dc:creator>admo</dc:creator>
      <dc:date>2022-03-17T12:28:08Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling issue for inference with a spark.mllib model</title>
      <link>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25369#M17637</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Thanks for checking, unfortunately no. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I believe that my core issue is about not being able to properly set the size of the processed chunks of data given my cluster memory.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My job looks like these few map operations : &lt;/P&gt;&lt;P&gt;prepare data &amp;lt;200 columns&amp;gt; =&amp;gt; inference &amp;lt;40143 columns + model size&amp;gt; =&amp;gt; final result &amp;lt;3columns&amp;gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I do understand that the inference part can be costly. But if processed per chunk, worst case scenario, it should be slow and not fail because of memory issues.&lt;/P&gt;&lt;P&gt;Any guidance on forcing this behavior would be welcome.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PS : Things have also become more difficult with loss of logging infos on the spark ui and ganglia. Would you know what the root cause may be ?&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Capture d’écran 2022-03-22 à 16.34"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2013i66C1DB701828E368/image-size/large?v=v2&amp;amp;px=999" role="button" title="Capture d’écran 2022-03-22 à 16.34" alt="Capture d’écran 2022-03-22 à 16.34" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Mar 2022 15:53:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25369#M17637</guid>
      <dc:creator>admo</dc:creator>
      <dc:date>2022-03-22T15:53:23Z</dc:date>
    </item>
    <item>
      <title>Re: Scaling issue for inference with a spark.mllib model</title>
      <link>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25370#M17638</link>
      <description>&lt;P&gt;The philosophy for the job would be something like this in Scala : &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;feature_dataset.foreachPartition { block =&amp;gt;
   block.grouped(10000).foreach { chunk =&amp;gt;
   run_inference_and_write_to_db(chunk)
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Would you know how to do this with pyspark and rdds ?&lt;/P&gt;</description>
      <pubDate>Tue, 22 Mar 2022 16:01:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/scaling-issue-for-inference-with-a-spark-mllib-model/m-p/25370#M17638</guid>
      <dc:creator>admo</dc:creator>
      <dc:date>2022-03-22T16:01:42Z</dc:date>
    </item>
  </channel>
</rss>

