- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-07-2025 02:34 PM
If you didn't get this to work with Pandas API on Spark, you might also try importing and instantiating the SentenceTransformer model inside the pandas UDF for proper distributed execution.
Each executor runs code independently, and when Spark executes a pandas UDF the function is serialized and sent to worker nodes. If you instantiate the model globally (outside the UDF), only the driver knows about it, and Spark would then try to serialize the entire model object and send it to the workers. This could fail or lead to memory issues with complex objects like ML models.
By creating the model inside the UDF function, you ensure that each executor loads the model locally and has everything it needs to process the batch or partition of data it receives. Perhaps something like this...
@F.pandas_udf(returnType=ArrayType(DoubleType()))
def mpnet_encode(x: pd.Series) -> pd.Series:
# Import and instantiate model inside the UDF
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
model.max_seq_length = 256
return pd.Series(model.encode(x.tolist(), batch_size=128).tolist())
I hope that helps. Let us know if it works out for you.
-James