I'm trying to use Databricks ARC (Automated Record Connector) and running into an object issue. I assume I'm missing something rather trivial that's not related to ARC.
#Databricks Python notebook
#CMD1 import AutoLinker
from arc.autolinker import AutoLinker
import arc
arc.enable_arc()
#CMD2 create dataframe from table data
import os
data_1 = spark.read.table("temp.list_all")
#CMD3 run autolinker
autolinker = AutoLinker()
attribute_columns = ["first_name", "last_name", "dob", "address_line_1", "zip_code"]
#runs fine up to this point
autolinker.auto_link(
data=data_1,
attribute_columns=attribute_columns,
unique_id="pid",
comparison_size_limit=100000,
max_evals=100
)
Then I receive this error when running autolinker.auto_link() and not sure how to troubleshoot.
AttributeError: 'DataFrame' object has no attribute 'sparkSession'
My cluster Runtime Version is 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12). Within my cluster, I do not have any spark configurations set. I'm not sure if this needs to be changed, and if so, which properties to set. Currently researching....