Hi there,
I am trying to parellize a text extraction via the Databrick foundational model.
Any pointers to suggestions or examples are welcome
The code and error below.
model = "databricks-meta-llama-3-1-70b-instruct"
temperature=0.0
max_tokens=1024
schema_llm = StructType([
StructField("contains_vulnerability", BooleanType(), True),
])
chat_model = ChatDatabricks(
endpoint=model,
temperature=temperature,
max_tokens=max_tokens
)
chain_llm: LLMChain = (chat_prompt | chat_model.with_structured_output(VulnerabilityReport))
@udf(returnType=schema_llm)
def CheckContent(text:str):
out = chain_llm.invoke({"content":text})
return (out["contains_vulnerability"])
expand_df = sample_df.withColumn("content_check", CheckContent("file_content"))
display(expand_df)<div><span>And I am getting a pickle error:<div> <li-code lang="markup">Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 559, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/core/context.py", line 525, in __getnewargs__
raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.