<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column? in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/133968#M4343</link>
    <description>&lt;P&gt;Sorry for the lack of reply. I had to switch tasks, but I hope to be able to test this. You suggest that Pandas On Spark is more efficient implemented than Pandas UDF?&lt;/P&gt;</description>
    <pubDate>Mon, 06 Oct 2025 14:30:50 GMT</pubDate>
    <dc:creator>excavator-matt</dc:creator>
    <dc:date>2025-10-06T14:30:50Z</dc:date>
    <item>
      <title>What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/130622#M4281</link>
      <description>&lt;P&gt;We're trying to run the bundled sentence-transformers&amp;nbsp;library&amp;nbsp;from SBert&amp;nbsp;in a notebook running Databricks ML 16.4 on an &lt;A href="https://aws.amazon.com/ec2/instance-types/g4/" target="_self"&gt;AWS g4dn.2xlarge [T4] instance&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;However, we're experiencing out of memory crashes and are wondering what the optimal to run sentence vector encoding in Databricks is.&lt;/P&gt;&lt;P&gt;We have tried three different approaches, but neither really works.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;1. Skip spark entirely&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;In this naive approach, we skip spark entirely and continue to run it in standard Python using the toPandas() function on the Spark DataFrame&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;projects_pdf = df_projects.toPandas()&lt;BR /&gt;max_seq_length = 256&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;sentence_model_name = "paraphrase-multilingual-mpnet-base-v2"&lt;BR /&gt;sentence_model = SentenceTransformer(sentence_model_name)&lt;BR /&gt;sentence_model.max_seq_length = max_seq_length&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;text_to_encode = projects_pdf["project_text"].tolist()&lt;BR /&gt;np_text_embeddings = sentence_model.encode(text_to_encode, batch_size=128, show_progress_bar=True, convert_to_numpy=True)&lt;/P&gt;&lt;P&gt;This runs and renders the progress bar nicely, but the problem is now converting back into Delta table.&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;projects_pdf["text_embeddings"] = np_text_embeddings.tolist()&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;projects_pdf.to_delta("europe_prod_catalog.ad_hoc.project_recommendation_stage", mode="overwrite")&lt;/P&gt;&lt;P&gt;This part will crash with memory issue ("&lt;SPAN&gt;The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.")&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Use Pandas UDF&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The second approach is stolen from &lt;A href="https://stackoverflow.com/questions/72398129/creating-a-sentence-transformer-model-in-spark-mllib" target="_self"&gt;StackOverflow&lt;/A&gt;&amp;nbsp;and is based on Spark's &lt;A href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html" target="_self"&gt;pandas_udf&lt;/A&gt;, but does work four our volume of data.&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; sentence_transformers &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; SentenceTransformer&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; pandas &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; pd&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql.functions &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; F&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; pyspark.sql.types &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; ArrayType, DoubleType, StringType&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; sentence_transformers &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; SentenceTransformer&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;sentence_model_name &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;"paraphrase-multilingual-mpnet-base-v2"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;max_seq_length &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;256&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;mpnet_sentence_model &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;SentenceTransformer&lt;/SPAN&gt;&lt;SPAN&gt;(sentence_model_name)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;mpnet_sentence_model.max_seq_length &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; max_seq_length&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;@&lt;/SPAN&gt;&lt;SPAN&gt;F&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;pandas_udf&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;returnType&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;ArrayType&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;DoubleType&lt;/SPAN&gt;&lt;SPAN&gt;()))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;def&lt;/SPAN&gt; &lt;SPAN&gt;mpnet_encode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;x&lt;/SPAN&gt;&lt;SPAN&gt;: pd.Series) -&amp;gt; pd.Series:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;return&lt;/SPAN&gt;&lt;SPAN&gt; pd.&lt;/SPAN&gt;&lt;SPAN&gt;Series&lt;/SPAN&gt;&lt;SPAN&gt;(mpnet_sentence_model.&lt;/SPAN&gt;&lt;SPAN&gt;encode&lt;/SPAN&gt;&lt;SPAN&gt;(x, &lt;/SPAN&gt;&lt;SPAN&gt;batch_size&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;128&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;tolist&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;# apply udf and show &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;project_df_2 &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; projects_df.&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"project_text_embedding"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;mpnet_encode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"project_text"&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;project_df_2.write.mode("overwrite").saveAsTable("my_table)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;This delays execution, but once you try to save it with&amp;nbsp;saveAsTable, we get the same memory error ("The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."). I also couldn't get the progress bar to work here.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;STRONG&gt;3. Use MLFlow Spark UDF&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;I am not entirely sure what MLFlow does and if it is any different from the previous approach, but I also tried using &lt;A href="https://mlflow.org/docs/latest/api_reference/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf" target="_self"&gt;spark_udf&lt;/A&gt;.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;P&gt;from sentence_transformers import SentenceTransformer&lt;BR /&gt;import mlflow&lt;/P&gt;&lt;P&gt;sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")&lt;BR /&gt;sentence_model.max_seq_length = 256&lt;BR /&gt;data = "MLflow is awesome!"&lt;BR /&gt;signature = mlflow.models.infer_signature(&lt;BR /&gt;model_input=data,&lt;BR /&gt;model_output=sentence_model.encode(data),&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;with mlflow.start_run() as run:&lt;BR /&gt;mlflow.sentence_transformers.log_model(&lt;BR /&gt;artifact_path="paraphrase-multilingual-mpnet-base-v2-256",&lt;BR /&gt;model=sentence_model,&lt;BR /&gt;signature=signature,&lt;BR /&gt;input_example=data,&lt;BR /&gt;)&lt;BR /&gt;model_uri = f"runs:/{run.info.run_id}/paraphrase-multilingual-mpnet-base-v2-256"&lt;BR /&gt;print(model_uri)&lt;/P&gt;&lt;P&gt;udf = mlflow.pyfunc.spark_udf(&lt;BR /&gt;spark,&lt;BR /&gt;model_uri=model_uri,&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;# Apply the Spark UDF to the DataFrame. This performs batch predictions across all rows in a distributed manner.&lt;BR /&gt;df_project_embedding = df_projects.withColumn("prediction", udf(df_projects["project_text"]))&lt;/P&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;This ticks, but you don't see if it makes any progress.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;Conclusion&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;The current workaround is to go with the first approach and skip the spark part by storing it as a file in a Databricks Volume instead. However, this fundamentally tabular data (although it involves vector as a column) and having it in a volume loses all the benefits of Databricks.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Another aspect we considered was to create our own batching solution, but the point of Spark is that it should abstract big data handling, so it also seems wrong.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;What is the ideal approach here?&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 03 Sep 2025 09:32:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/130622#M4281</guid>
      <dc:creator>excavator-matt</dc:creator>
      <dc:date>2025-09-03T09:32:33Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/131257#M4295</link>
      <description>&lt;P&gt;Spark is designed to handle very large datasets by distributing processing across a cluster, which is why working with Spark DataFrames unlocks these scalability benefits. In contrast, Python and Pandas are not inherently distributed; Pandas dataframes are eagerly evaluated and executed locally, so you can encounter memory issues when working with large datasets. For instance, exceeding around 95 GB of data in Pandas often leads to out-of-memory errors because only the driver node handles all computation, regardless of cluster size.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;To bridge this gap, consider using the Pandas API on Spark, which is part of the Spark ecosystem. This API provides Pandas-equivalent syntax and functionality, while leveraging Spark’s distributed processing to handle larger data volumes efficiently. You can learn more here: &lt;A href="https://docs.databricks.com/aws/en/pandas/pandas-on-spark" target="_blank"&gt;https://docs.databricks.com/aws/en/pandas/pandas-on-spark&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;In short, the Pandas API on Spark lets you write familiar Pandas-style code but benefit from distributed computation. It greatly reduces memory bottlenecks and scales to bigger datasets than native Pandas workflows allow.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2025 17:32:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/131257#M4295</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-09-08T17:32:20Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/133968#M4343</link>
      <description>&lt;P&gt;Sorry for the lack of reply. I had to switch tasks, but I hope to be able to test this. You suggest that Pandas On Spark is more efficient implemented than Pandas UDF?&lt;/P&gt;</description>
      <pubDate>Mon, 06 Oct 2025 14:30:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/133968#M4343</guid>
      <dc:creator>excavator-matt</dc:creator>
      <dc:date>2025-10-06T14:30:50Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/134092#M4344</link>
      <description>&lt;P&gt;Yes because you will be using Spark dataframes which are distributed.&amp;nbsp; Pandas dataframes are not distributed.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Oct 2025 15:26:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/134092#M4344</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-10-07T15:26:54Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/134123#M4345</link>
      <description>&lt;P&gt;If you didn't get this to work with Pandas API on Spark, you might also try&amp;nbsp;importing and instantiating the SentenceTransformer model inside the pandas UDF for proper distributed execution.&lt;/P&gt;
&lt;P&gt;Each executor runs code independently, and when Spark executes a pandas UDF the function is serialized and sent to worker nodes. If you instantiate the model globally (outside the UDF), only the driver knows about it, and Spark would then try to serialize the entire model object and send it to the workers. This could fail or lead to memory issues with complex objects like ML models.&lt;/P&gt;
&lt;P&gt;By creating the model inside the UDF function, you ensure that each executor loads the model locally and has everything it needs to process the batch or partition of data it receives. Perhaps something like this...&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;@F.pandas_udf(returnType=ArrayType(DoubleType()))
def mpnet_encode(x: pd.Series) -&amp;gt; pd.Series:
    # Import and instantiate model inside the UDF
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
    model.max_seq_length = 256
    return pd.Series(model.encode(x.tolist(), batch_size=128).tolist())&lt;/LI-CODE&gt;
&lt;P&gt;I hope that helps. Let us know if it works out for you.&lt;/P&gt;
&lt;P&gt;-James&lt;/P&gt;</description>
      <pubDate>Tue, 07 Oct 2025 21:34:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/134123#M4345</guid>
      <dc:creator>jamesl</dc:creator>
      <dc:date>2025-10-07T21:34:12Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/136773#M4390</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34815"&gt;@Louis_Frolio&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried the Pandas on Spark approach.&lt;BR /&gt;&lt;BR /&gt;How do I from Delta table into Pandas on Spark DataFrame. Is this the best way?&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;projects_df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"my_catalog.my_schema.my_project_table"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;projects_spdf&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; ps.&lt;/SPAN&gt;&lt;SPAN&gt;from_pandas&lt;/SPAN&gt;&lt;SPAN&gt;(projects_df.&lt;/SPAN&gt;&lt;SPAN&gt;toPandas&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;It runs past the sentence transformers bit, but when I try to&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P class="lia-indent-padding-left-30px"&gt;projects_spdf["text_embeddings"] = np_text_embeddings.tolist()&lt;/P&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I get this strange error.&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;/databricks/spark/python/pyspark/pandas/frame.py&lt;/SPAN&gt; in &lt;SPAN class=""&gt;?&lt;/SPAN&gt;&lt;SPAN class=""&gt;(psdf, this_column_labels, that_column_labels)&lt;/SPAN&gt; 13467 def assign_columns( 13468 psdf&lt;SPAN class=""&gt;:&lt;/SPAN&gt; DataFrame&lt;SPAN class=""&gt;,&lt;/SPAN&gt; this_column_labels&lt;SPAN class=""&gt;:&lt;/SPAN&gt; List&lt;SPAN class=""&gt;[&lt;/SPAN&gt;Label&lt;SPAN class=""&gt;]&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; that_column_labels&lt;SPAN class=""&gt;:&lt;/SPAN&gt; List&lt;SPAN class=""&gt;[&lt;/SPAN&gt;Label&lt;SPAN class=""&gt;]&lt;/SPAN&gt; 13469 &lt;SPAN class=""&gt;)&lt;/SPAN&gt; &lt;SPAN class=""&gt;-&amp;gt;&lt;/SPAN&gt; Iterator&lt;SPAN class=""&gt;[&lt;/SPAN&gt;Tuple&lt;SPAN class=""&gt;[&lt;/SPAN&gt;&lt;SPAN class=""&gt;"Series"&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; Label&lt;SPAN class=""&gt;]&lt;/SPAN&gt;&lt;SPAN class=""&gt;]&lt;/SPAN&gt;&lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;SPAN class=""&gt;&amp;gt; 13470&lt;/SPAN&gt; &lt;SPAN class=""&gt;assert&lt;/SPAN&gt; len&lt;SPAN class=""&gt;(&lt;/SPAN&gt;key&lt;SPAN class=""&gt;)&lt;/SPAN&gt; &lt;SPAN class=""&gt;==&lt;/SPAN&gt; len&lt;SPAN class=""&gt;(&lt;/SPAN&gt;that_column_labels&lt;SPAN class=""&gt;)&lt;/SPAN&gt; 13471 &lt;SPAN class=""&gt;# Note that here intentionally uses `zip_longest` that combine&lt;/SPAN&gt; 13472 &lt;SPAN class=""&gt;# that_columns.&lt;/SPAN&gt; 13473 for k, this_label, that_label in zip_longest(&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;At least it isn't a memory issue, but the attempt to run standard pandas at least went past this&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Perhaps my issues is not so much sentence transformers but how to go from a massive list of arrays back into Delta table. That why I am a bit hesitant about&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/127181"&gt;@jamesl&lt;/a&gt;&amp;nbsp;, but I could give it a try. Maybe there is lazy load issue somewhere.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 30 Oct 2025 15:59:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/136773#M4390</guid>
      <dc:creator>excavator-matt</dc:creator>
      <dc:date>2025-10-30T15:59:03Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/136808#M4393</link>
      <description>&lt;P&gt;Also, upgrading to 17.3 ML still gives the same error.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Oct 2025 18:13:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/136808#M4393</guid>
      <dc:creator>excavator-matt</dc:creator>
      <dc:date>2025-10-30T18:13:58Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/137031#M4394</link>
      <description>&lt;P class="p1"&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179384"&gt;@excavator-matt&lt;/a&gt;&amp;nbsp;I’d recommend a quick refresher on the Pandas API on Spark to understand the implementation details. This video breaks it down clearly: &lt;A href="https://youtu.be/tdZDotqKtps?si=pcIzCUYs2s_TeQKx" target="_blank"&gt;https://youtu.be/tdZDotqKtps?si=pcIzCUYs2s_TeQKx&lt;/A&gt;&lt;/P&gt;
&lt;P class="p1"&gt;Hope this helps. — Louis&lt;/P&gt;</description>
      <pubDate>Fri, 31 Oct 2025 15:14:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/137031#M4394</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-10-31T15:14:29Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/149289#M4556</link>
      <description>&lt;P&gt;If anyone else is still interested in this topic, I tried running it with Databricks runtime 18.1 ML. We still have the same memory issue, but we now also get the nice error message for in approach 3 as&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 5.62 MiB is free&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;One thing I didn't discuss in this thread is to try to use the hosted models somehow. If your end goal is a vector index, you can now also bypass this step and simply &lt;A href="https://docs.databricks.com/aws/en/vector-search/create-vector-search#create-a-vector-search-index" target="_blank"&gt;select embedding model&lt;/A&gt; instead of precomputing it. This raises the question on how to reason about host. The hosted models are limited compared to the world wide web and I think those are pay per token instead of runtime.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 25 Feb 2026 12:51:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/149289#M4556</guid>
      <dc:creator>excavator-matt</dc:creator>
      <dc:date>2026-02-25T12:51:23Z</dc:date>
    </item>
    <item>
      <title>Re: What is the most efficient way of running sentence-transformers on a Spark DataFrame column?</title>
      <link>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/149297#M4557</link>
      <description>&lt;P&gt;Also, I forgot to mention the workaround solution for the first approach. If you write to parquet in a volume, you can then convert it back to a Delta table in a later cell.&lt;BR /&gt;&lt;BR /&gt;Instead of this&lt;/P&gt;&lt;P class="lia-indent-padding-left-30px"&gt;projects_pdf.to_delta("europe_prod_catalog.ad_hoc.project_recommendation_stage", mode="overwrite")&lt;/P&gt;&lt;P&gt;You do this&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# Avoid datetime64 timestamps error&lt;/P&gt;&lt;DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;def&lt;/SPAN&gt; &lt;SPAN&gt;convert_datetime_columns_to_str&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;df&lt;/SPAN&gt;&lt;SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-60px"&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; col &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; df.columns:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-60px"&gt;&lt;SPAN&gt;if&lt;/SPAN&gt;&lt;SPAN&gt; pd.api.types.&lt;/SPAN&gt;&lt;SPAN&gt;is_datetime64_any_dtype&lt;/SPAN&gt;&lt;SPAN&gt;(df[col]):&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-90px"&gt;&lt;SPAN&gt;df[col] &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; df[col].&lt;/SPAN&gt;&lt;SPAN&gt;astype&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;str&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-90px"&gt;&lt;SPAN&gt;return&lt;/SPAN&gt;&lt;SPAN&gt; df&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;projects_pdf_fixed &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;convert_datetime_columns_to_str&lt;/SPAN&gt;&lt;SPAN&gt;(projects_pdf)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;projects_pdf_fixed.&lt;/SPAN&gt;&lt;SPAN&gt;to_parquet&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"/Volumes/europe_prod_catalog/ad_hoc/temp/project_recommendation_embedding.parquet"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Then in next cell you do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;project_embedded_df &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;parquet&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"/Volumes/europe_prod_catalog/ad_hoc/temp/project_recommendation_embedding.parquet"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;project_embedded_df.write.&lt;/SPAN&gt;&lt;SPAN&gt;mode&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;saveAsTable&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"europe_prod_catalog.ad_hoc.project_recommendation_embedding"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 25 Feb 2026 13:35:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/what-is-the-most-efficient-way-of-running-sentence-transformers/m-p/149297#M4557</guid>
      <dc:creator>excavator-matt</dc:creator>
      <dc:date>2026-02-25T13:35:14Z</dc:date>
    </item>
  </channel>
</rss>

