UDF LLM DataBrick pickle error

llmnerd
New Contributor

Hi there,

I am trying to parellize a text extraction via the Databrick foundational model.

Any pointers to suggestions or examples are welcome

The code and error below.

model = "databricks-meta-llama-3-1-70b-instruct"
temperature=0.0
max_tokens=1024

schema_llm = StructType([
    StructField("contains_vulnerability", BooleanType(), True),
])

chat_model = ChatDatabricks(
            endpoint=model,
            temperature=temperature,
            max_tokens=max_tokens
        )

chain_llm: LLMChain = (chat_prompt | chat_model.with_structured_output(VulnerabilityReport))

@udf(returnType=schema_llm) 
def CheckContent(text:str): 
    out = chain_llm.invoke({"content":text})
    return (out["contains_vulnerability"])
    
expand_df = sample_df.withColumn("content_check", CheckContent("file_content"))
display(expand_df)<div><span>And I am getting a pickle error:<div> <li-code lang="markup">Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 559, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/core/context.py", line 525, in __getnewargs__
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.