<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: One-hot encoding of strong cardinality features failing, causes downstream issues in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/96940#M3754</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105962"&gt;@rtreves&lt;/a&gt;&amp;nbsp;, sorry I was not able to investigate on the above. Not sure if you would be able to create a support ticket with Databricks as it may be an involved effort to review the code.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I do have a suggestion, instead of relying on the automatic one-hot encoding by AutoML, you can manually perform one-hot encoding on these columns. This way, you can ensure that the encoding is correctly applied and the resulting columns are of the appropriate type.&lt;/P&gt;</description>
    <pubDate>Thu, 31 Oct 2024 09:18:39 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2024-10-31T09:18:39Z</dc:date>
    <item>
      <title>One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71254#M3314</link>
      <description>&lt;P&gt;Hi Databricks support,&lt;/P&gt;&lt;P&gt;I'm training an ML model using mlflow on DBR 13.3 LTS ML, Spark 3.4.1 using databricks.automl_runtime&amp;nbsp;&lt;SPAN&gt;0.2.17&lt;/SPAN&gt;&amp;nbsp;and databricks.automl&amp;nbsp;&lt;SPAN&gt;1.20.3, with shap&amp;nbsp;0.45.1&lt;/SPAN&gt;. My training data has two float-type columns with three or fewer unique values, which automl flags for one-hot encoding. My training experiment finishes without error. When I examined the notebook of the best-performing model, I toggled `shap_enabled` to `True` to see the shap values. However, in the cell that produces shap values, the following error is produced: "&lt;SPAN class=""&gt;TypeError: &lt;/SPAN&gt;&lt;SPAN&gt;no supported conversion for types: (dtype('O'),)" (full traceback attached).&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;From my debugging, I believe the error occurs because the one-hot encoding of the two aforementioned columns fails, leading to object columns being passed to `scipy.sparse.csr_matrix` within the shap package. Indeed, when I go into the training notebook and try to fit the one-hot encoder to the two columns, I get the message "Warning: No categorical columns found. Calling 'transform' will only return input data."&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Let me know if a full reprex is needed, and the best way to supply it.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks in advance!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2024 13:53:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71254#M3314</guid>
      <dc:creator>rtreves</dc:creator>
      <dc:date>2024-05-31T13:53:35Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71303#M3318</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105962"&gt;@rtreves&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;TypeError:&amp;nbsp;&lt;/SPAN&gt;no supported conversion for types: (dtype('O'),)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;error means you passed some data type that it doesn’t support, like categorical values (strings probably).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The function&amp;nbsp;expects numeric values, and you provided non-numeric leading to this.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Jun 2024 08:22:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71303#M3318</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-06-01T08:22:57Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71344#M3319</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/23233"&gt;@NandiniN&lt;/a&gt;&amp;nbsp;I've confirmed the features in question are passed in as pandas series with dtype float. Automl flags these features as potentially categorical precisely because they are passed in as numeric types but have few unique values.&lt;/P&gt;</description>
      <pubDate>Sun, 02 Jun 2024 02:50:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71344#M3319</guid>
      <dc:creator>rtreves</dc:creator>
      <dc:date>2024-06-02T02:50:09Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71350#M3320</link>
      <description>&lt;P&gt;You earlier mentioned you could share a repro. Can you please do that so that I can check further?&lt;/P&gt;</description>
      <pubDate>Sun, 02 Jun 2024 04:57:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71350#M3320</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-06-02T04:57:13Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71384#M3321</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/23233"&gt;@NandiniN&lt;/a&gt;&amp;nbsp;, thanks for taking a look! I'm linking below a reprex notebook in two formats: .py for running as a databricks notebook on DBR LTS, and a .ipynb notebook for running as a jupyter notebook natively (though I haven't tested this format). Let me know if you have issues getting it running.&lt;/P&gt;&lt;P&gt;&lt;A href="https://drive.google.com/drive/folders/1V5hMzGlP3-nxXQUc-g8Y2qhZ3hs40ENs?usp=sharing" target="_blank"&gt;https://drive.google.com/drive/folders/1V5hMzGlP3-nxXQUc-g8Y2qhZ3hs40ENs?usp=sharing&lt;/A&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2024 02:38:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/71384#M3321</guid>
      <dc:creator>rtreves</dc:creator>
      <dc:date>2024-06-03T02:38:02Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/93754#M3725</link>
      <description>&lt;P&gt;Hi there,&lt;/P&gt;&lt;P&gt;I have a same issue after running the AutoML without error. Is there any update on this link? Cheers&lt;/P&gt;</description>
      <pubDate>Mon, 14 Oct 2024 04:19:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/93754#M3725</guid>
      <dc:creator>lilir5</dc:creator>
      <dc:date>2024-10-14T04:19:29Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/94123#M3728</link>
      <description>&lt;P&gt;No, unfortunately I haven't found any resolution to this issue yet.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Oct 2024 14:25:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/94123#M3728</guid>
      <dc:creator>rtreves</dc:creator>
      <dc:date>2024-10-15T14:25:03Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/94125#M3729</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/23233"&gt;@NandiniN&lt;/a&gt;&amp;nbsp;Were you able to use my reprex above to investigate this issue at all? Thank you.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Oct 2024 14:25:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/94125#M3729</guid>
      <dc:creator>rtreves</dc:creator>
      <dc:date>2024-10-15T14:25:47Z</dc:date>
    </item>
    <item>
      <title>Re: One-hot encoding of strong cardinality features failing, causes downstream issues</title>
      <link>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/96940#M3754</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105962"&gt;@rtreves&lt;/a&gt;&amp;nbsp;, sorry I was not able to investigate on the above. Not sure if you would be able to create a support ticket with Databricks as it may be an involved effort to review the code.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I do have a suggestion, instead of relying on the automatic one-hot encoding by AutoML, you can manually perform one-hot encoding on these columns. This way, you can ensure that the encoding is correctly applied and the resulting columns are of the appropriate type.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 09:18:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/one-hot-encoding-of-strong-cardinality-features-failing-causes/m-p/96940#M3754</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-10-31T09:18:39Z</dc:date>
    </item>
  </channel>
</rss>

