Databricks Community

Isolated · ‎10-12-2023

I'm trying to use Databricks ARC (Automated Record Connector) and running into an object issue. I assume I'm missing something rather trivial that's not related to ARC.

#Databricks Python notebook

#CMD1 import AutoLinker
from arc.autolinker import AutoLinker
import arc

arc.enable_arc()

#CMD2 create dataframe from table data
import os
data_1 = spark.read.table("temp.list_all")

#CMD3 run autolinker
autolinker = AutoLinker()

attribute_columns = ["first_name", "last_name", "dob", "address_line_1", "zip_code"]
#runs fine up to this point
autolinker.auto_link(
  data=data_1, 
  attribute_columns=attribute_columns,
  unique_id="pid",  
  comparison_size_limit=100000,
  max_evals=100
)

Then I receive this error when running autolinker.auto_link() and not sure how to troubleshoot.

AttributeError: 'DataFrame' object has no attribute 'sparkSession'

My cluster Runtime Version is 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12). Within my cluster, I do not have any spark configurations set. I'm not sure if this needs to be changed, and if so, which properties to set. Currently researching....