cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cannot use Databricks ARC as demo code

hadoan
New Contributor II

I read the link about Databricks ARC - https://github.com/databricks-industry-solutions/auto-data-linkage

and run on DBR 12.2 LTS ML runtime environment on DB cloud community

But I got the error below:

 

2024/07/08 04:25:33 INFO mlflow.tracking.fluent: Experiment with name '/Users/ha@infinitelambda.com/Databricks Autolinker 2024-07-08 04:25:33.405046' does not exist. Creating a new experiment.
IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-3969223619542297> in <module>
     17 # )
     18 
---> 19 autolinker.auto_link(
     20   data=data_df,
     21   attribute_columns=attribute_columns,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in auto_link(self, data, attribute_columns, unique_id, comparison_size_limit, max_evals, cleaning, threshold, true_label, random_seed, metric, sample_for_blocking_rules)
    803     self.spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 'False')
    804     if self.linker_mode == "dedupe_only":
--> 805       space = self._create_hyperopt_space(self._autolink_data, self.attribute_columns, comparison_size_limit, sample_for_blocking_rules)
    806     else:
    807       # use the larger dataframe as baseline

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _create_hyperopt_space(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
    329 
    330     # Generate candidate blocking rules
--> 331     self.blocking_rules = self._generate_candidate_blocking_rules(
    332       data=data,
    333       attribute_columns=attribute_columns,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _generate_candidate_blocking_rules(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
    296 
    297     # set deterministic rules to be 500th largest (or largest) blocking rule
--> 298     self.deterministic_columns = df_rules.orderBy(F.col("rule_squared_count")).limit(500).orderBy(F.col("rule_squared_count").desc()).limit(1).collect()[0]["splink_rule"]
    299 
    300     df_rules.unpersist()

IndexError: list index out of range

 Thanks in advance

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now