I read the link about Databricks ARC - https://github.com/databricks-industry-solutions/auto-data-linkage
and run on DBR 12.2 LTS ML runtime environment on DB cloud community
But I got the error below:
2024/07/08 04:25:33 INFO mlflow.tracking.fluent: Experiment with name '/Users/ha@infinitelambda.com/Databricks Autolinker 2024-07-08 04:25:33.405046' does not exist. Creating a new experiment.
IndexError: list index out of range
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<command-3969223619542297> in <module>
17 # )
18
---> 19 autolinker.auto_link(
20 data=data_df,
21 attribute_columns=attribute_columns,
/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in auto_link(self, data, attribute_columns, unique_id, comparison_size_limit, max_evals, cleaning, threshold, true_label, random_seed, metric, sample_for_blocking_rules)
803 self.spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 'False')
804 if self.linker_mode == "dedupe_only":
--> 805 space = self._create_hyperopt_space(self._autolink_data, self.attribute_columns, comparison_size_limit, sample_for_blocking_rules)
806 else:
807 # use the larger dataframe as baseline
/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _create_hyperopt_space(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
329
330 # Generate candidate blocking rules
--> 331 self.blocking_rules = self._generate_candidate_blocking_rules(
332 data=data,
333 attribute_columns=attribute_columns,
/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _generate_candidate_blocking_rules(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
296
297 # set deterministic rules to be 500th largest (or largest) blocking rule
--> 298 self.deterministic_columns = df_rules.orderBy(F.col("rule_squared_count")).limit(500).orderBy(F.col("rule_squared_count").desc()).limit(1).collect()[0]["splink_rule"]
299
300 df_rules.unpersist()
IndexError: list index out of range
Thanks in advance