<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: pyspark.pandas PandasNotImplementedError in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-pandasnotimplementederror/m-p/62138#M31907</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/2427"&gt;@pjp94&lt;/a&gt;&amp;nbsp; - The error indicates the pandas pyspark implementation does not have the below method implemented.&lt;/P&gt;
&lt;PRE class="lia-code-sample  language-markup"&gt;&lt;CODE&gt;pd.Series.duplicated()&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Next steps is to use dataframe methods such as distinct, groupBy, dropDuplicates to resolve this.&lt;/P&gt;</description>
    <pubDate>Tue, 27 Feb 2024 17:47:22 GMT</pubDate>
    <dc:creator>shan_chandra</dc:creator>
    <dc:date>2024-02-27T17:47:22Z</dc:date>
    <item>
      <title>pyspark.pandas PandasNotImplementedError</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-pandasnotimplementederror/m-p/61621#M31815</link>
      <description>&lt;P&gt;Can someone explain why this below code is throwing an error? My intuition is telling me it's my spark version (3.2.1) but would like confirmation:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;d = {'key':['a','a','c','d','e','f','g','h'],
     'data':[1,2,3,4,5,6,7,8]}

x = ps.DataFrame(d)
x[x['key'].duplicated()]

-----------------------------------------------------------------------------
PandasNotImplementedError                 Traceback (most recent call last)
&amp;lt;command-6518367&amp;gt; in &amp;lt;module&amp;gt;
      3 
      4 x = ps.DataFrame(d)
----&amp;gt; 5 x[x['key'].duplicated()]

/databricks/spark/python/pyspark/pandas/usage_logging/__init__.py in wrapper(*args, **kwargs)
    192             start = time.perf_counter()
    193             try:
--&amp;gt; 194                 res = func(*args, **kwargs)
    195                 logger.log_success(
    196                     class_name, function_name, time.perf_counter() - start, signature

/databricks/spark/python/pyspark/pandas/series.py in __getattr__(self, item)
   6276             property_or_func = getattr(MissingPandasLikeSeries, item)
   6277             if isinstance(property_or_func, property):
-&amp;gt; 6278                 return property_or_func.fget(self)  # type: ignore
   6279             else:
   6280                 return partial(property_or_func, self)

/databricks/spark/python/pyspark/pandas/usage_logging/__init__.py in wrapper(self)
    261     def wrapper(self):
    262         try:
--&amp;gt; 263             return prop.fget(self)
    264         finally:
    265             logger.log_missing(class_name, property_name, is_deprecated)

/databricks/spark/python/pyspark/pandas/missing/__init__.py in unsupported_property(self)
     36     @property
     37     def unsupported_property(self):
---&amp;gt; 38         raise PandasNotImplementedError(
     39             class_name=class_name, property_name=property_name, reason=reason
     40         )

PandasNotImplementedError: The property `pd.Series.duplicated()` is not implemented. 'duplicated' API returns np.ndarray and the data size is too large.You can just use DataFrame.deduplicated instead&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Feb 2024 22:37:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-pandasnotimplementederror/m-p/61621#M31815</guid>
      <dc:creator>pjp94</dc:creator>
      <dc:date>2024-02-22T22:37:11Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.pandas PandasNotImplementedError</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-pandasnotimplementederror/m-p/62138#M31907</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/2427"&gt;@pjp94&lt;/a&gt;&amp;nbsp; - The error indicates the pandas pyspark implementation does not have the below method implemented.&lt;/P&gt;
&lt;PRE class="lia-code-sample  language-markup"&gt;&lt;CODE&gt;pd.Series.duplicated()&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Next steps is to use dataframe methods such as distinct, groupBy, dropDuplicates to resolve this.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Feb 2024 17:47:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-pandasnotimplementederror/m-p/62138#M31907</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-02-27T17:47:22Z</dc:date>
    </item>
  </channel>
</rss>

