Databricks Community

hadoan · ‎07-07-2024

I read the link about Databricks ARC - https://github.com/databricks-industry-solutions/auto-data-linkage

and run on DBR 12.2 LTS ML runtime environment on DB cloud community

But I got the error below:

2024/07/08 04:25:33 INFO mlflow.tracking.fluent: Experiment with name '/Users/ha@infinitelambda.com/Databricks Autolinker 2024-07-08 04:25:33.405046' does not exist. Creating a new experiment.
IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-3969223619542297> in <module>
     17 # )
     18 
---> 19 autolinker.auto_link(
     20   data=data_df,
     21   attribute_columns=attribute_columns,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in auto_link(self, data, attribute_columns, unique_id, comparison_size_limit, max_evals, cleaning, threshold, true_label, random_seed, metric, sample_for_blocking_rules)
    803     self.spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 'False')
    804     if self.linker_mode == "dedupe_only":
--> 805       space = self._create_hyperopt_space(self._autolink_data, self.attribute_columns, comparison_size_limit, sample_for_blocking_rules)
    806     else:
    807       # use the larger dataframe as baseline

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _create_hyperopt_space(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
    329 
    330     # Generate candidate blocking rules
--> 331     self.blocking_rules = self._generate_candidate_blocking_rules(
    332       data=data,
    333       attribute_columns=attribute_columns,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a67c339d-1de3-4b7b-8a4e-766134e34824/lib/python3.9/site-packages/arc/autolinker/autolinker.py in _generate_candidate_blocking_rules(self, data, attribute_columns, comparison_size_limit, sample_for_blocking_rules, max_columns_per_and_rule, max_rules_per_or_rule)
    296 
    297     # set deterministic rules to be 500th largest (or largest) blocking rule
--> 298     self.deterministic_columns = df_rules.orderBy(F.col("rule_squared_count")).limit(500).orderBy(F.col("rule_squared_count").desc()).limit(1).collect()[0]["splink_rule"]
    299 
    300     df_rules.unpersist()

IndexError: list index out of range

Thanks in advance

Kaniz_Fatma · ‎07-08-2024

Hi @hadoan,

Ensure that the data you are providing to the auto_link function is in the correct format and does not have any issues, such as missing values or inconsistent data types. The ARC package relies on the data being in a valid Spark DataFrame format.
The comparison_size_limit parameter controls the maximum number of record pairs that the ARC package will consider for comparison. If this value is set too low, the package may not be able to find any valid blocking rules. Try increasing this value and see if it resolves the issue.
The sample_for_blocking_rules parameter controls the size of the sample used to generate the candidate blocking rules. If the sample size is too small, the package may not be able to find any valid rules. Try increasing this value and see if it helps.
Increase the max_columns_per_and_rule and max_rules_per_or_rule parameters: These parameters control the maximum number of columns per "AND" rule and the maximum number of "OR" rules, respectively. If these values are set too low, the package may not be able to generate valid blocking rules. Try increasing these values and see if it resolves the issue.
Ensure that you are running the ARC package on the correct Databricks Runtime (DBR) version, which is 12.2 LTS ML as mentioned in the documentation.
Carefully review the ARC package documentation, especially the sections on data preparation, parameter optimization, and troubleshooting, to ensure that you are using the package correctly.

If the issue persists after trying these steps, you may need to provide more information about your data and the specific error message you are encountering to get more targeted assistance.

Databricks Community

Cannot use Databricks ARC as demo code

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Community Champion - September 2024 - Szymon Dybczak

Intelligent Data Engineering: Beyond the AI Hype

GenAI: The Shift to Data Intelligence

Big Book of Data Engineering — 3rd Edition