<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Isolation Forest prediction failing DLT pipeline, the same model works fine when prediction is done outside DLT pipeline. in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4799#M214</link>
    <description>&lt;P&gt;Hey community members&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am new to Databricks and was building a simple DLT pipleine that loads data from S3 and runs an Isolation forest prediction to detect anomalies. The model has been stored in Model Registry. Here's the code for the pipeline:&lt;/P&gt;&lt;P&gt;@dlt.table&lt;/P&gt;&lt;P&gt;def trucklocation():&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; return (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; spark.readStream&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .format("cloudFiles")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("cloudFiles.format", "json")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("cloudFiles.inferColumnTypes", True)&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .load(f"{source}/trucklocation")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .select(&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; F.current_timestamp().alias("processing_time"),&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "*"&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)&lt;/P&gt;&lt;P&gt;@dlt.table&lt;/P&gt;&lt;P&gt;def velocity_predictions():&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; return (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; dlt.read("trucklocation")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn('predictions', loaded_model_udf(struct(*map(col, ['velocity']))))&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The pipeline errors out with the following error:&lt;/P&gt;&lt;P&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 804.0 failed 4 times, most recent failure: Lost task 0.3 in stage 804.0 (TID 1285) (10.55.136.232 executor 0): org.apache.spark.api.python.PythonException: 'AttributeError: 'IsolationForest' object has no attribute 'n_features_''. Full traceback below:&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 1293, in udf&lt;/P&gt;&lt;P&gt;    os.kill(scoring_server_proc.pid, signal.SIGTERM)&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 1080, in _predict_row_batch&lt;/P&gt;&lt;P&gt;    result = predict_fn(pdf)&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 1274, in batch_predict_fn&lt;/P&gt;&lt;P&gt;    return loaded_model.predict(pdf)&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 427, in predict&lt;/P&gt;&lt;P&gt;    return self._predict_fn(data)&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.9/site-packages/sklearn/ensemble/_iforest.py", line 314, in predict&lt;/P&gt;&lt;P&gt;    is_inlier[self.decision_function(X) &amp;lt; 0] = -1&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.9/site-packages/sklearn/ensemble/_iforest.py", line 347, in decision_function&lt;/P&gt;&lt;P&gt;    return self.score_samples(X) - self.offset_&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.9/site-packages/sklearn/ensemble/_iforest.py", line 379, in score_samples&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;    if self.n_features_ != X.shape[1]:&lt;/I&gt;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;AttributeError: 'IsolationForest' object has no attribute 'n_features_'&lt;/I&gt;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried running the prediction directly as well this way and it worked fine:&lt;/P&gt;&lt;P&gt;loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)&lt;/P&gt;&lt;P&gt;df_location = spark.read.format('json').option("inferSchema", "true").load(s3path)&lt;/P&gt;&lt;P&gt;df = df_location.withColumn('predictions', loaded_model_udf(struct(*map(col, ['velocity']))))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any help on why the pipeline would fail is appreciated.&lt;/P&gt;</description>
    <pubDate>Thu, 04 May 2023 12:48:45 GMT</pubDate>
    <dc:creator>MukulDegweker</dc:creator>
    <dc:date>2023-05-04T12:48:45Z</dc:date>
    <item>
      <title>Isolation Forest prediction failing DLT pipeline, the same model works fine when prediction is done outside DLT pipeline.</title>
      <link>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4799#M214</link>
      <description>&lt;P&gt;Hey community members&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am new to Databricks and was building a simple DLT pipleine that loads data from S3 and runs an Isolation forest prediction to detect anomalies. The model has been stored in Model Registry. Here's the code for the pipeline:&lt;/P&gt;&lt;P&gt;@dlt.table&lt;/P&gt;&lt;P&gt;def trucklocation():&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; return (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; spark.readStream&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .format("cloudFiles")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("cloudFiles.format", "json")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("cloudFiles.inferColumnTypes", True)&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .load(f"{source}/trucklocation")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .select(&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; F.current_timestamp().alias("processing_time"),&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "*"&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)&lt;/P&gt;&lt;P&gt;@dlt.table&lt;/P&gt;&lt;P&gt;def velocity_predictions():&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; return (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; dlt.read("trucklocation")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withColumn('predictions', loaded_model_udf(struct(*map(col, ['velocity']))))&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The pipeline errors out with the following error:&lt;/P&gt;&lt;P&gt;org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 804.0 failed 4 times, most recent failure: Lost task 0.3 in stage 804.0 (TID 1285) (10.55.136.232 executor 0): org.apache.spark.api.python.PythonException: 'AttributeError: 'IsolationForest' object has no attribute 'n_features_''. Full traceback below:&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 1293, in udf&lt;/P&gt;&lt;P&gt;    os.kill(scoring_server_proc.pid, signal.SIGTERM)&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 1080, in _predict_row_batch&lt;/P&gt;&lt;P&gt;    result = predict_fn(pdf)&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 1274, in batch_predict_fn&lt;/P&gt;&lt;P&gt;    return loaded_model.predict(pdf)&lt;/P&gt;&lt;P&gt;  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-a35783aa-d900-4f51-9233-f8eb37babc87/lib/python3.9/site-packages/mlflow/pyfunc/__init__.py", line 427, in predict&lt;/P&gt;&lt;P&gt;    return self._predict_fn(data)&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.9/site-packages/sklearn/ensemble/_iforest.py", line 314, in predict&lt;/P&gt;&lt;P&gt;    is_inlier[self.decision_function(X) &amp;lt; 0] = -1&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.9/site-packages/sklearn/ensemble/_iforest.py", line 347, in decision_function&lt;/P&gt;&lt;P&gt;    return self.score_samples(X) - self.offset_&lt;/P&gt;&lt;P&gt;  File "/databricks/python/lib/python3.9/site-packages/sklearn/ensemble/_iforest.py", line 379, in score_samples&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;    if self.n_features_ != X.shape[1]:&lt;/I&gt;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;AttributeError: 'IsolationForest' object has no attribute 'n_features_'&lt;/I&gt;&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried running the prediction directly as well this way and it worked fine:&lt;/P&gt;&lt;P&gt;loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)&lt;/P&gt;&lt;P&gt;df_location = spark.read.format('json').option("inferSchema", "true").load(s3path)&lt;/P&gt;&lt;P&gt;df = df_location.withColumn('predictions', loaded_model_udf(struct(*map(col, ['velocity']))))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any help on why the pipeline would fail is appreciated.&lt;/P&gt;</description>
      <pubDate>Thu, 04 May 2023 12:48:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4799#M214</guid>
      <dc:creator>MukulDegweker</dc:creator>
      <dc:date>2023-05-04T12:48:45Z</dc:date>
    </item>
    <item>
      <title>Re: Isolation Forest prediction failing DLT pipeline, the same model works fine when prediction is done outside DLT pipeline.</title>
      <link>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4801#M216</link>
      <description>&lt;P&gt;Hi @Mukul Degweker​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2023 08:09:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4801#M216</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-05-19T08:09:17Z</dc:date>
    </item>
    <item>
      <title>Re: Isolation Forest prediction failing DLT pipeline, the same model works fine when prediction is done outside DLT pipeline.</title>
      <link>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4800#M215</link>
      <description>&lt;P&gt;Hi, I found a similar thread here: &lt;A href="https://stackoverflow.com/questions/11685936/why-am-i-getting-attributeerror-object-has-no-attribute" alt="https://stackoverflow.com/questions/11685936/why-am-i-getting-attributeerror-object-has-no-attribute" target="_blank"&gt;https://stackoverflow.com/questions/11685936/why-am-i-getting-attributeerror-object-has-no-attribute&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Please let us know if this helps. Also, please tag&amp;nbsp;&lt;A href="https://community.databricks.com/s/profile/0053f000000WWwvAAG" alt="https://community.databricks.com/s/profile/0053f000000WWwvAAG" target="_blank"&gt;@Debayan&lt;/A&gt;​&amp;nbsp;with your next comment so that I will get notified. Thank you!&lt;/P&gt;</description>
      <pubDate>Sat, 06 May 2023 05:51:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/isolation-forest-prediction-failing-dlt-pipeline-the-same-model/m-p/4800#M215</guid>
      <dc:creator>Debayan</dc:creator>
      <dc:date>2023-05-06T05:51:17Z</dc:date>
    </item>
  </channel>
</rss>

