Databricks Community

Awoke101 · ‎06-18-2024

The "dense_vector" column does not output on show(). Instead I get the error below. Any idea why it doesn't work on the shared access mode? Any alternatives?

from fastembed import TextEmbedding, SparseTextEmbedding
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType
import pandas as pd
from pyspark.sql.functions import col

@pandas_udf(ArrayType(FloatType()))
def generate_dense_embeddings(contents: pd.Series) ->  pd.Series:
    small_embedding_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_dir="/tmp/local_cache/")
    dense_embeddings_list = small_embedding_model.embed(contents)
    return pd.Series(list(dense_embeddings_list))

df=df.limit(50)
df.show(10)
embeddings = df.withColumn("dense_vector", generate_dense_embeddings(col("content")))
embeddings.show(10)

Py4JJavaError: An error occurred while calling o474.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 28) (172.16.2.140 executor 0): org.apache.spark.SparkRuntimeException: [UDF_ERROR.ENV_LOST] Execution of function generate_dense_embeddings(content#73) failed  - the execution environment was lost during execution. This may be caused by the code crashing or the process exiting prematurely.

Awoke101 · ‎06-19-2024

@jacovangelder thanks but the error was solved by adding this in my UDF.

user = os.environ.get("USER")

View solution in original post

jacovangelder · ‎06-18-2024

Not 100% sure but I'm guessing it is because of the cache_dir. The Shared access clusters are meant for UC and should point to UC volumes instead of local paths. Can you try to change it to a UC volume?

Awoke101 · ‎06-18-2024

Got this error along with the one above, even though the model cached in the UC Volume.

from fastembed import TextEmbedding, SparseTextEmbedding
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType
import pandas as pd

@pandas_udf(ArrayType(FloatType()))
def generate_dense_embeddings(contents: pd.Series) ->  pd.Series:
    small_embedding_model = TextEmbedding(model_name='BAAI/bge-base-en',cache_dir="/Volumes/qdrant_cache/default/model_cache/")
    dense_embeddings_list = small_embedding_model.embed(contents)
    return pd.Series(list(dense_embeddings_list))

from pyspark.sql.functions import col

df=df.limit(50)
df.show(10)
embeddings = df.withColumn("dense_vector", generate_dense_embeddings(col("content")))
embeddings.show(10)

Write not supported
Files in Repos are currently read-only. Please try writing to /tmp/<filename>. Alternatively, contact your Databricks representative to enable programmatically writing to files in a repository.

Awoke101 · ‎06-18-2024

@jacovangelder It throws the same error without cache_dir but will try with UC volumes.

jacovangelder · ‎06-19-2024

Hmm interesting, then it's something else.

The below code works for me on a Shared access mode cluster. (I don't know what your input dataset looks like):

df = spark.sql("SELECT '1' as content")

from fastembed import TextEmbedding, SparseTextEmbedding
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType
import pandas as pd
from pyspark.sql.functions import col

@pandas_udf(ArrayType(FloatType()))
def generate_dense_embeddings(contents: pd.Series) ->  pd.Series:
    small_embedding_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_dir="/tmp/local_cache/")
    dense_embeddings_list = small_embedding_model.embed(contents)
    return pd.Series(list(dense_embeddings_list))

df=df.limit(50)
df.show(10)
embeddings = df.withColumn("dense_vector", generate_dense_embeddings(col("content")))
embeddings.show(10)

Are you sure your cluster setup is sufficient enough for what you're trying to achieve?

Awoke101 · ‎06-19-2024

@jacovangelder I think the resources are sufficient since it works on the personal cluster which has lesser resources. I tried to run the code you sent on my shared access mode cluster and it still didn't work. Maybe I need to make some changes to the cluster config? This is my current shared cluster config.

EDIT: Can you also share the version of the python packages? I had to downgrade my numpy version for DBR to work so that may also be the cause of this issue. Using fastembed v0.3.1 doesn't require a numpy downgrade but it still doesn't work with 13.3LTS. I am getting warnings due to version incompatibilities in pip.

jacovangelder · ‎06-19-2024

For some reason a moderator is removing my pip freeze? no idea why. Maybe too long/spammy for a comment.
Anyway, I am using DBR 14.3 LTS with Shared Access Mode. I haven't installed any other version apart from fastembed==0.3.1. Included a screenshot of my cluster config too.