โ06-18-2024 08:50 AM - edited โ06-18-2024 08:53 AM
The "dense_vector" column does not output on show(). Instead I get the error below. Any idea why it doesn't work on the shared access mode? Any alternatives?
from fastembed import TextEmbedding, SparseTextEmbedding
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType
import pandas as pd
from pyspark.sql.functions import col
@pandas_udf(ArrayType(FloatType()))
def generate_dense_embeddings(contents: pd.Series) -> pd.Series:
small_embedding_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_dir="/tmp/local_cache/")
dense_embeddings_list = small_embedding_model.embed(contents)
return pd.Series(list(dense_embeddings_list))
df=df.limit(50)
df.show(10)
embeddings = df.withColumn("dense_vector", generate_dense_embeddings(col("content")))
embeddings.show(10)
Py4JJavaError: An error occurred while calling o474.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 28) (172.16.2.140 executor 0): org.apache.spark.SparkRuntimeException: [UDF_ERROR.ENV_LOST] Execution of function generate_dense_embeddings(content#73) failed - the execution environment was lost during execution. This may be caused by the code crashing or the process exiting prematurely.
โ06-19-2024 11:04 PM
@jacovangelder thanks but the error was solved by adding this in my UDF.
user = os.environ.get("USER")
โ06-18-2024 10:58 AM
Not 100% sure but I'm guessing it is because of the cache_dir. The Shared access clusters are meant for UC and should point to UC volumes instead of local paths. Can you try to change it to a UC volume?
โ06-18-2024 11:47 PM
Got this error along with the one above, even though the model cached in the UC Volume.
from fastembed import TextEmbedding, SparseTextEmbedding
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType
import pandas as pd
@pandas_udf(ArrayType(FloatType()))
def generate_dense_embeddings(contents: pd.Series) -> pd.Series:
small_embedding_model = TextEmbedding(model_name='BAAI/bge-base-en',cache_dir="/Volumes/qdrant_cache/default/model_cache/")
dense_embeddings_list = small_embedding_model.embed(contents)
return pd.Series(list(dense_embeddings_list))
from pyspark.sql.functions import col
df=df.limit(50)
df.show(10)
embeddings = df.withColumn("dense_vector", generate_dense_embeddings(col("content")))
embeddings.show(10)
Write not supported
Files in Repos are currently read-only. Please try writing to /tmp/<filename>. Alternatively, contact your Databricks representative to enable programmatically writing to files in a repository.
โ06-18-2024 10:58 PM
@jacovangelder It throws the same error without cache_dir but will try with UC volumes.
โ06-19-2024 12:11 AM
Hmm interesting, then it's something else.
The below code works for me on a Shared access mode cluster. (I don't know what your input dataset looks like):
df = spark.sql("SELECT '1' as content")
from fastembed import TextEmbedding, SparseTextEmbedding
from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType
import pandas as pd
from pyspark.sql.functions import col
@pandas_udf(ArrayType(FloatType()))
def generate_dense_embeddings(contents: pd.Series) -> pd.Series:
small_embedding_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5", cache_dir="/tmp/local_cache/")
dense_embeddings_list = small_embedding_model.embed(contents)
return pd.Series(list(dense_embeddings_list))
df=df.limit(50)
df.show(10)
embeddings = df.withColumn("dense_vector", generate_dense_embeddings(col("content")))
embeddings.show(10)
Are you sure your cluster setup is sufficient enough for what you're trying to achieve?
โ06-19-2024 02:16 AM - edited โ06-19-2024 09:52 PM
@jacovangelder I think the resources are sufficient since it works on the personal cluster which has lesser resources. I tried to run the code you sent on my shared access mode cluster and it still didn't work. Maybe I need to make some changes to the cluster config? This is my current shared cluster config.
EDIT: Can you also share the version of the python packages? I had to downgrade my numpy version for DBR to work so that may also be the cause of this issue. Using fastembed v0.3.1 doesn't require a numpy downgrade but it still doesn't work with 13.3LTS. I am getting warnings due to version incompatibilities in pip.
โ06-19-2024 11:01 PM - edited โ06-19-2024 11:05 PM
โ06-19-2024 11:04 PM
@jacovangelder thanks but the error was solved by adding this in my UDF.
user = os.environ.get("USER")
โ06-19-2024 11:07 PM
Glad its resolved ๐
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group