Databricks Community

hadoan · ‎07-07-2024

I read the link about Databricks ARC - https://github.com/databricks-industry-solutions/auto-data-linkage

and run on DBR 12.2 LTS ML runtime environment on DB cloud community

But I got the error below:

2024/07/08 04:25:33 INFO mlflow.tracking.fluent: Experiment with name '/Users/ha@infinitelambda.com/Databricks Autolinker 2024-07-08 04:25:33.405046' does not exist. Creating a new experiment.
IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-3969223619542297> in <module>
     17 # )
     18 
---> 19 autolinker.auto_link(
     20   data=data_df,
     21   attribute_columns=attribute_columns,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in auto_link(self, data, attribute_columns, unique_id, comparison_size_limit, max_evals, cleaning, threshold, true_label, random_seed, metric, sample_for_blocking_rules)
    803     self.spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 'False')
    804     if self.linker_mode == "dedupe_only":
--> 805       space = self._create_hyperopt_space(self._autolink_data, self.attribute_columns, comparison_size_limit, sample_for_blocking_rules)
    806     else:
    807       # use the larger dataframe as baseline

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _create_hyperopt_space(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
    329 
    330     # Generate candidate blocking rules
--> 331     self.blocking_rules = self._generate_candidate_blocking_rules(
    332       data=data,
    333       attribute_columns=attribute_columns,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _generate_candidate_blocking_rules(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
    296 
    297     # set deterministic rules to be 500th largest (or largest) blocking rule
--> 298     self.deterministic_columns = df_rules.orderBy(F.col("rule_squared_count")).limit(500).orderBy(F.col("rule_squared_count").desc()).limit(1).collect()[0]["splink_rule"]
    299 
    300     df_rules.unpersist()

IndexError: list index out of range

Thanks in advance