<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to save model produce by distributed training? in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/63590#M3106</link>
    <description>&lt;P&gt;Is there any update on the answer? I am curious too.&lt;/P&gt;&lt;P&gt;Is there a merge operation after all the distributed training finished?&lt;/P&gt;</description>
    <pubDate>Wed, 13 Mar 2024 17:40:32 GMT</pubDate>
    <dc:creator>Xiaowei</dc:creator>
    <dc:date>2024-03-13T17:40:32Z</dc:date>
    <item>
      <title>How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20214#M1108</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am trying to save model after distributed training via the following code&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import sys
&amp;nbsp;
from spark_tensorflow_distributor import MirroredStrategyRunner
&amp;nbsp;
import mlflow.keras
&amp;nbsp;
mlflow.keras.autolog()
&amp;nbsp;
mlflow.log_param("learning_rate", 0.001)
&amp;nbsp;
import tensorflow as tf
&amp;nbsp;
import time
&amp;nbsp;
from sklearn.model_selection import train_test_split
&amp;nbsp;
from sklearn.datasets import load_breast_canc # add er,  because databrick doesn't allow canc.... 
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
def train():
&amp;nbsp;
 strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
&amp;nbsp;
 #tf.distribute.experimental.CollectiveCommunication.NCCL
&amp;nbsp;
 model = None
&amp;nbsp;
 with strategy.scope():
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  data = load_breast_canc()  # add er,  because databrick doesn't allow canc.... 
&amp;nbsp;
  X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
&amp;nbsp;
  N, D = X_train.shape # number of observation and variables
&amp;nbsp;
  from sklearn.preprocessing import StandardScaler
&amp;nbsp;
  scaler = StandardScaler()
&amp;nbsp;
  X_train = scaler.fit_transform(X_train)
&amp;nbsp;
  X_test = scaler.transform(X_test)
&amp;nbsp;
  model = tf.keras.models.Sequential([
&amp;nbsp;
   tf.keras.layers.Input(shape=(D,)),
&amp;nbsp;
   tf.keras.layers.Dense(1, activation='sigmoid') # use sigmoid function for every epochs
&amp;nbsp;
  ])
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  model.compile(optimizer='adam', # use adaptive momentum
&amp;nbsp;
    loss='binary_crossentropy',
&amp;nbsp;
    metrics=['accuracy']) 
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  # Train the Model
&amp;nbsp;
  r = model.fit(X_train, y_train, validation_data=(X_test, y_test))
&amp;nbsp;
  print("Train score:", model.evaluate(X_train, y_train)) # evaluate returns loss and accuracy
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
  mlflow.keras.log_model(model, "mymodel")
&amp;nbsp;
&amp;nbsp;
&amp;nbsp;
MirroredStrategyRunner(num_slots=4, use_custom_strategy=True).run(train)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;@https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have  a couple questions&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;setting num_slots = 4 will cause mlflow to log 4 models , for which each model is not good at predicting the dataset, but. I expect the chief node to log one model that has at least 80% accuracy , is there a way to save only one model or merge the model?&lt;/LI&gt;&lt;LI&gt;how to save your model without mlflow.log , if I save via dbutil I would get race condition, but it is not clear from the spark distributor which node is the chief node&lt;/LI&gt;&lt;LI&gt;is every node getting all data instead of partial data?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Nov 2022 19:52:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20214#M1108</guid>
      <dc:creator>kng88</dc:creator>
      <dc:date>2022-11-28T19:52:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20215#M1109</link>
      <description>&lt;P&gt;It is very good that there are now many useful programs that make it easy to use, such as &lt;A href="https://sites.google.com/view/cat-et-software/home" alt="https://sites.google.com/view/cat-et-software/home" target="_blank"&gt;cat et software&lt;/A&gt;. I recommend it to everyone.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 14:34:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20215#M1109</guid>
      <dc:creator>Alexx02</dc:creator>
      <dc:date>2022-11-29T14:34:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20216#M1110</link>
      <description>&lt;P&gt;ModelCheckpoint callback is used in conjunction with training using model. fit() to save a model or weights (in a checkpoint file) at some interval, so the model or weights can be loaded later to continue the training from the state saved.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.aceflareaccount.net/" alt="https://www.aceflareaccount.net/" target="_blank"&gt;ACEFlareAccount&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 11:25:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20216#M1110</guid>
      <dc:creator>Frost69</dc:creator>
      <dc:date>2022-11-30T11:25:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20217#M1111</link>
      <description>&lt;P&gt;how does model checkpoint knows who is the chief node?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;there should be an api for 1 resulting model from distributed training? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 20:31:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/20217#M1111</guid>
      <dc:creator>kng88</dc:creator>
      <dc:date>2022-11-30T20:31:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/63590#M3106</link>
      <description>&lt;P&gt;Is there any update on the answer? I am curious too.&lt;/P&gt;&lt;P&gt;Is there a merge operation after all the distributed training finished?&lt;/P&gt;</description>
      <pubDate>Wed, 13 Mar 2024 17:40:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/63590#M3106</guid>
      <dc:creator>Xiaowei</dc:creator>
      <dc:date>2024-03-13T17:40:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/63704#M3112</link>
      <description>&lt;P&gt;I guess&amp;nbsp;spark_tensorflow_distributor&amp;nbsp; is probably obsolete since there is no update since 2020.&lt;/P&gt;&lt;P&gt;Horovod (&lt;SPAN&gt;&lt;A href="https://github.com/horovod" target="_blank"&gt;https://github.com/horovod&lt;/A&gt;&lt;/SPAN&gt;) seems a better choice on using tensorflow in Databricks with Spark.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Mar 2024 13:44:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/63704#M3112</guid>
      <dc:creator>Xiaowei</dc:creator>
      <dc:date>2024-03-14T13:44:29Z</dc:date>
    </item>
    <item>
      <title>Re: How to save model produce by distributed training?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/64297#M3141</link>
      <description>&lt;P&gt;I think I finally worked this out.&lt;/P&gt;&lt;P&gt;Here is the extra code to save out the model only once and from the 1st node:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;context = pyspark.BarrierTaskContext.get()
if context.partitionId() == 0: mlflow.keras.log_model(model, "mymodel")&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Mar 2024 13:50:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-save-model-produce-by-distributed-training/m-p/64297#M3141</guid>
      <dc:creator>Xiaowei</dc:creator>
      <dc:date>2024-03-21T13:50:55Z</dc:date>
    </item>
  </channel>
</rss>

