Cannot use Databricks ARC as demo code

hadoan — Mon, 08 Jul 2024 04:37:06 GMT

I read the link about Databricks ARC - https://github.com/databricks-industry-solutions/auto-data-linkage

and run on DBR 12.2 LTS ML runtime environment on DB cloud community

But I got the error below:

2024/07/08 04:25:33 INFO mlflow.tracking.fluent: Experiment with name '/Users/ha@infinitelambda.com/Databricks Autolinker 2024-07-08 04:25:33.405046' does not exist. Creating a new experiment. IndexError: list index out of range --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <command-3969223619542297> in <module> 17 # ) 18 ---> 19 autolinker.auto_link( 20 data=data_df, 21 attribute_columns=attribute_columns, /local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in auto_link(self, data, attribute_columns, unique_id, comparison_size_limit, max_evals, cleaning, threshold, true_label, random_seed, metric, sample_for_blocking_rules) 803 self.spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 'False') 804 if self.linker_mode == "dedupe_only": --> 805 space = self._create_hyperopt_space(self._autolink_data, self.attribute_columns, comparison_size_limit, sample_for_blocking_rules) 806 else: 807 # use the larger dataframe as baseline /local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _create_hyperopt_space(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule) 329 330 # Generate candidate blocking rules --> 331 self.blocking_rules = self._generate_candidate_blocking_rules( 332 data=data, 333 attribute_columns=attribute_columns, /local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _generate_candidate_blocking_rules(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule) 296 297 # set deterministic rules to be 500th largest (or largest) blocking rule --> 298 self.deterministic_columns = df_rules.orderBy(F.col("rule_squared_count")).limit(500).orderBy(F.col("rule_squared_count").desc()).limit(1).collect()[0]["splink_rule"] 299 300 df_rules.unpersist() IndexError: list index out of range

Thanks in advance

topic Cannot use Databricks ARC as demo code in Machine Learning

Cannot use Databricks ARC as demo code