<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pyspark Pandas column or index name appears to persist after being dropped or removed. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19275#M12910</link>
    <description>&lt;P&gt;Yeah, I found it works perfectly fine in normal pandas but not in pyspark.pandas, ultimately, I want to use pyspark.pandas. Apologies, I should have included that in the original post. It appears to be a pyspark problem.&lt;/P&gt;</description>
    <pubDate>Thu, 01 Dec 2022 18:55:03 GMT</pubDate>
    <dc:creator>Callum</dc:creator>
    <dc:date>2022-12-01T18:55:03Z</dc:date>
    <item>
      <title>Pyspark Pandas column or index name appears to persist after being dropped or removed.</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19273#M12908</link>
      <description>&lt;P&gt;So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column before the merge, and then bring it back to index after the merge and remove the index name.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pyspark.pandas as ps
&amp;nbsp;
&amp;nbsp;
def merge_dataframes(left=None, right=None, how='inner', on=None, left_on=None,
                     right_on=None, left_index=False, right_index=False):
    merging_left = left.copy()
    merging_left['weird_index_name'] = merging_left.index
&amp;nbsp;
    new_df = merging_left.merge(right, on=on, how=how, left_on=left_on, right_on=right_on, suffixes=('', '_dupe_right'),
                                left_index=left_index, right_index=right_index)
&amp;nbsp;
    returning_df = new_df.set_index('weird_index_name')
&amp;nbsp;
    returning_df.index.name = None
    return returning_df
&amp;nbsp;
&amp;nbsp;
df_1 = ps.DataFrame({
    'join_column': [1, 2, 3, 4],
    'value1': ['A', 'B', 'C', 'D']
}, index=['Index1', 'Index2', 'Index3', 'Index4'])
df_2 = ps.DataFrame({
    'join_column': [1, 2, 3, 4, 5],
    'value2': ['a', 'b', 'c', 'd', 'e']
})
df_3 = ps.DataFrame({
    'join_column': [1, 2, 3, 4, 6, 7],
    'value3': [1.1, 2.2, 3.3, 4.4, 6.6, 7.7]
})
&amp;nbsp;
input_list = [df_1, df_2, df_3]
&amp;nbsp;
expected_result = ps.DataFrame({
    'join_column': [1, 2, 3, 4],
    'value1': ['A', 'B', 'C', 'D'],
    'value2': ['a', 'b', 'c', 'd'],
    'value3': [1.1, 2.2, 3.3, 4.4]
}, index=['Index1', 'Index2', 'Index3', 'Index4'])
&amp;nbsp;
final_df = input_list[0]
for next_df in input_list[1:]:
    final_df = merge_dataframes(left=final_df, right=next_df, how='left', on='join_column')
&amp;nbsp;
print(final_df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This works perfectly fine for merging two dataframes but as soon as I have a list of dataframes to be merged together using a for loop. I get this error:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;AnalysisException                         Traceback (most recent call last)
&amp;lt;command-1950358519842329&amp;gt; in &amp;lt;module&amp;gt;
     42 final_df = input_list[0]
     43 for next_df in input_list[1:]:
---&amp;gt; 44     final_df = merge_dataframes(left=final_df, right=next_df, how='left', on='join_column')
     45 
     46 print(final_df)
&amp;nbsp;
&amp;lt;command-1950358519842329&amp;gt; in merge_dataframes(left, right, how, on, left_on, right_on, left_index, right_index)
      7     merging_left['weird_index_name'] = merging_left.index
      8 
----&amp;gt; 9     new_df = merging_left.merge(right, on=on, how=how, left_on=left_on, right_on=right_on, suffixes=('', '_dupe_right'),
     10                                 left_index=left_index, right_index=right_index)
     11 
&amp;nbsp;
/databricks/spark/python/pyspark/pandas/usage_logging/__init__.py in wrapper(*args, **kwargs)
    192             start = time.perf_counter()
    193             try:
--&amp;gt; 194                 res = func(*args, **kwargs)
    195                 logger.log_success(
    196                     class_name, function_name, time.perf_counter() - start, signature
&amp;nbsp;
/databricks/spark/python/pyspark/pandas/frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, suffixes)
   7655             )
   7656 
-&amp;gt; 7657         left_internal = self._internal.resolved_copy
   7658         right_internal = resolve(right._internal, "right")
   7659 
&amp;nbsp;
/databricks/spark/python/pyspark/pandas/utils.py in wrapped_lazy_property(self)
    578     def wrapped_lazy_property(self):
    579         if not hasattr(self, attr_name):
--&amp;gt; 580             setattr(self, attr_name, fn(self))
    581         return getattr(self, attr_name)
    582 
&amp;nbsp;
/databricks/spark/python/pyspark/pandas/internal.py in resolved_copy(self)
   1169         return self.copy(
   1170             spark_frame=sdf,
-&amp;gt; 1171             index_spark_columns=[scol_for(sdf, col) for col in self.index_spark_column_names],
   1172             data_spark_columns=[scol_for(sdf, col) for col in self.data_spark_column_names],
   1173         )
&amp;nbsp;
/databricks/spark/python/pyspark/pandas/internal.py in &amp;lt;listcomp&amp;gt;(.0)
   1169         return self.copy(
   1170             spark_frame=sdf,
-&amp;gt; 1171             index_spark_columns=[scol_for(sdf, col) for col in self.index_spark_column_names],
   1172             data_spark_columns=[scol_for(sdf, col) for col in self.data_spark_column_names],
   1173         )
&amp;nbsp;
/databricks/spark/python/pyspark/pandas/utils.py in scol_for(sdf, column_name)
    590 def scol_for(sdf: SparkDataFrame, column_name: str) -&amp;gt; Column:
    591     """Return Spark Column for the given column name."""
--&amp;gt; 592     return sdf["`{}`".format(column_name)]
    593 
    594 
&amp;nbsp;
/databricks/spark/python/pyspark/sql/dataframe.py in __getitem__(self, item)
   1775         """
   1776         if isinstance(item, str):
-&amp;gt; 1777             jc = self._jdf.apply(item)
   1778             return Column(jc)
   1779         elif isinstance(item, Column):
&amp;nbsp;
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-&amp;gt; 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
&amp;nbsp;
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    121                 # Hide where the exception came from that shows a non-Pythonic
    122                 # JVM exception message.
--&amp;gt; 123                 raise converted from None
    124             else:
    125                 raise
&amp;nbsp;
AnalysisException: Reference 'weird_index_name' is ambiguous, could be: weird_index_name, weird_index_name.&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;From my understanding this suggests there is another column called "weird_index_name". However, displaying the dataframe before going into the merge I get a dataframe where there is only one column "weird_index_name" on both calls of the function.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This led me to think that "weird_index_name" persists as the index name after:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;returning_df = new_df.set_index('weird_index_name')
returning_df.index.name = None&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, printing out the returning_df.index.name returns None after the merge of the first two dataframes, before the second call of the function. This is also true when printing the merging_left.index.name before the second call of the function.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I introduce this line before the merge:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;print(merging_left.index.to_list())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The merge of the first two dataframes works and then I get this error when trying to join on the third (second call of the function) at that line:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Length mismatch: Expected axis has 5 elements, new values have 4 elements&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This leads me to believe there is an issue with the index and "weird_index_name" somehow persists.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any help would be appreciated!!&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 15:05:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19273#M12908</guid>
      <dc:creator>Callum</dc:creator>
      <dc:date>2022-12-01T15:05:53Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Pandas column or index name appears to persist after being dropped or removed.</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19274#M12909</link>
      <description>&lt;P&gt;This worked for me:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pandas as ps
&amp;nbsp;
&amp;nbsp;
 
def merge_dataframes(left=None, right=None, how='inner', on=None, left_on=None,
                     right_on=None, left_index=False, right_index=False):
    merging_left = left.copy()
    merging_left['weird_index_name'] = merging_left.index
 
    new_df = merging_left.merge(right, on=on, how=how, left_on=left_on, right_on=right_on, suffixes=('', '_dupe_right'),
                                left_index=left_index, right_index=right_index)
 
    returning_df = new_df.set_index('weird_index_name')
 
    returning_df.index.name = None
    return returning_df
&amp;nbsp;
&amp;nbsp;
 
df_1 = ps.DataFrame({
    'join_column': [1, 2, 3, 4],
    'value1': ['A', 'B', 'C', 'D']
}, index=['Index1', 'Index2', 'Index3', 'Index4'])
df_2 = ps.DataFrame({
    'join_column': [1, 2, 3, 4, 5],
    'value2': ['a', 'b', 'c', 'd', 'e']
})
df_3 = ps.DataFrame({
    'join_column': [1, 2, 3, 4, 5, 6],
    'value3': [1.1, 2.2, 3.3, 4.4, 6.6, 7.7]
})
 
input_list = [df_1, df_2, df_3]
 
&amp;nbsp;
&amp;nbsp;
#print(type(input_list[0]))
final_df = input_list[0]
#print(final_df)
&amp;nbsp;
&amp;nbsp;
for next_df in input_list[1:]:
    final_df = merge_dataframes(left=final_df, right=next_df, how='left', on='join_column')
 
print(final_df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 18:41:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19274#M12909</guid>
      <dc:creator>irfanaziz</dc:creator>
      <dc:date>2022-12-01T18:41:23Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Pandas column or index name appears to persist after being dropped or removed.</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19275#M12910</link>
      <description>&lt;P&gt;Yeah, I found it works perfectly fine in normal pandas but not in pyspark.pandas, ultimately, I want to use pyspark.pandas. Apologies, I should have included that in the original post. It appears to be a pyspark problem.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 18:55:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19275#M12910</guid>
      <dc:creator>Callum</dc:creator>
      <dc:date>2022-12-01T18:55:03Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Pandas column or index name appears to persist after being dropped or removed.</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19276#M12911</link>
      <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried adding some extra debug lines in your merge_dataframes function:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-01-31 at 11.46.10"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1069i310976461829AC7D/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-01-31 at 11.46.10" alt="Screenshot 2023-01-31 at 11.46.10" /&gt;&lt;/span&gt;and after executing that, I also executed the rest of the code but I did stop before the loop.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Instead of running the loop, i broke the code down and try to run it piece by piece.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;First lets load the first df:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-01-31 at 11.49.07"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1064i11C8D2F8B3702427/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-01-31 at 11.49.07" alt="Screenshot 2023-01-31 at 11.49.07" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then lets use the merge trying to use the first element (1).&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-01-31 at 11.54.55"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1071i2595AA37E083479F/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-01-31 at 11.54.55" alt="Screenshot 2023-01-31 at 11.54.55" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We can see that the "weird_index_name" gets created on this df.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To show why it would fail, i will execute final_df = input_list[0] again to reinitialize the df and then run the merge on the second element instead:ﬁ &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Screenshot 2023-01-31 at 11.57.40"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1062i33912D99D49E105C/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2023-01-31 at 11.57.40" alt="Screenshot 2023-01-31 at 11.57.40" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can see that in both cases the intermediate dataframe is created with the same column "weird_index_name" .&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So if you try to run your original code, in the third iteration there is already a dataframe with this column which explains why you get the error: Reference 'weird_index_name' is ambiguous&lt;/P&gt;</description>
      <pubDate>Tue, 31 Jan 2023 11:01:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-column-or-index-name-appears-to-persist-after/m-p/19276#M12911</guid>
      <dc:creator>Serlal</dc:creator>
      <dc:date>2023-01-31T11:01:12Z</dc:date>
    </item>
  </channel>
</rss>

