โ09-03-2025 02:21 AM - edited โ09-03-2025 02:32 AM
We're trying to run the bundled sentence-transformers library from SBert in a notebook running Databricks ML 16.4 on an AWS g4dn.2xlarge [T4] instance.
However, we're experiencing out of memory crashes and are wondering what the optimal to run sentence vector encoding in Databricks is.
We have tried three different approaches, but neither really works.
1. Skip spark entirely
In this naive approach, we skip spark entirely and continue to run it in standard Python using the toPandas() function on the Spark DataFrame
projects_pdf = df_projects.toPandas()
max_seq_length = 256
sentence_model_name = "paraphrase-multilingual-mpnet-base-v2"
sentence_model = SentenceTransformer(sentence_model_name)
sentence_model.max_seq_length = max_seq_length
text_to_encode = projects_pdf["project_text"].tolist()
np_text_embeddings = sentence_model.encode(text_to_encode, batch_size=128, show_progress_bar=True, convert_to_numpy=True)
This runs and renders the progress bar nicely, but the problem is now converting back into Delta table.
projects_pdf["text_embeddings"] = np_text_embeddings.tolist()
projects_pdf.to_delta("europe_prod_catalog.ad_hoc.project_recommendation_stage", mode="overwrite")
This part will crash with memory issue ("The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.")
2. Use Pandas UDF
The second approach is stolen from StackOverflow and is based on Spark's pandas_udf, but does work four our volume of data.
from sentence_transformers import SentenceTransformer
import mlflow
sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
sentence_model.max_seq_length = 256
data = "MLflow is awesome!"
signature = mlflow.models.infer_signature(
model_input=data,
model_output=sentence_model.encode(data),
)
with mlflow.start_run() as run:
mlflow.sentence_transformers.log_model(
artifact_path="paraphrase-multilingual-mpnet-base-v2-256",
model=sentence_model,
signature=signature,
input_example=data,
)
model_uri = f"runs:/{run.info.run_id}/paraphrase-multilingual-mpnet-base-v2-256"
print(model_uri)
udf = mlflow.pyfunc.spark_udf(
spark,
model_uri=model_uri,
)
# Apply the Spark UDF to the DataFrame. This performs batch predictions across all rows in a distributed manner.
df_project_embedding = df_projects.withColumn("prediction", udf(df_projects["project_text"]))
โ09-08-2025 10:32 AM
Spark is designed to handle very large datasets by distributing processing across a cluster, which is why working with Spark DataFrames unlocks these scalability benefits. In contrast, Python and Pandas are not inherently distributed; Pandas dataframes are eagerly evaluated and executed locally, so you can encounter memory issues when working with large datasets. For instance, exceeding around 95 GB of data in Pandas often leads to out-of-memory errors because only the driver node handles all computation, regardless of cluster size.
To bridge this gap, consider using the Pandas API on Spark, which is part of the Spark ecosystem. This API provides Pandas-equivalent syntax and functionality, while leveraging Sparkโs distributed processing to handle larger data volumes efficiently. You can learn more here: https://docs.databricks.com/aws/en/pandas/pandas-on-spark.
In short, the Pandas API on Spark lets you write familiar Pandas-style code but benefit from distributed computation. It greatly reduces memory bottlenecks and scales to bigger datasets than native Pandas workflows allow.
Hope this helps, Louis.
โ09-08-2025 10:32 AM
Spark is designed to handle very large datasets by distributing processing across a cluster, which is why working with Spark DataFrames unlocks these scalability benefits. In contrast, Python and Pandas are not inherently distributed; Pandas dataframes are eagerly evaluated and executed locally, so you can encounter memory issues when working with large datasets. For instance, exceeding around 95 GB of data in Pandas often leads to out-of-memory errors because only the driver node handles all computation, regardless of cluster size.
To bridge this gap, consider using the Pandas API on Spark, which is part of the Spark ecosystem. This API provides Pandas-equivalent syntax and functionality, while leveraging Sparkโs distributed processing to handle larger data volumes efficiently. You can learn more here: https://docs.databricks.com/aws/en/pandas/pandas-on-spark.
In short, the Pandas API on Spark lets you write familiar Pandas-style code but benefit from distributed computation. It greatly reduces memory bottlenecks and scales to bigger datasets than native Pandas workflows allow.
Hope this helps, Louis.
a week ago
Sorry for the lack of reply. I had to switch tasks, but I hope to be able to test this. You suggest that Pandas On Spark is more efficient implemented than Pandas UDF?
a week ago
Yes because you will be using Spark dataframes which are distributed. Pandas dataframes are not distributed.
Tuesday
If you didn't get this to work with Pandas API on Spark, you might also try importing and instantiating the SentenceTransformer model inside the pandas UDF for proper distributed execution.
Each executor runs code independently, and when Spark executes a pandas UDF the function is serialized and sent to worker nodes. If you instantiate the model globally (outside the UDF), only the driver knows about it, and Spark would then try to serialize the entire model object and send it to the workers. This could fail or lead to memory issues with complex objects like ML models.
By creating the model inside the UDF function, you ensure that each executor loads the model locally and has everything it needs to process the batch or partition of data it receives. Perhaps something like this...
@F.pandas_udf(returnType=ArrayType(DoubleType()))
def mpnet_encode(x: pd.Series) -> pd.Series:
# Import and instantiate model inside the UDF
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
model.max_seq_length = 256
return pd.Series(model.encode(x.tolist(), batch_size=128).tolist())
I hope that helps. Let us know if it works out for you.
-James
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now