<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38912#M26815</link>
    <description>&lt;P&gt;Hi Kaniz,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your reply.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the&amp;nbsp;@ symbol in these cells, I am able to read the file.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I tried the following code:&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; pandas &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; pd&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df = pd.read_excel(file_path, sheet_name=&lt;/SPAN&gt;&lt;SPAN&gt;'sheet_name'&lt;/SPAN&gt;&lt;SPAN&gt;, engine=&lt;/SPAN&gt;&lt;SPAN&gt;'xlrd'&lt;/SPAN&gt;&lt;SPAN&gt;, convert_float=&lt;/SPAN&gt;&lt;SPAN&gt;False&lt;/SPAN&gt;&lt;SPAN&gt;, dtype=&lt;/SPAN&gt;&lt;SPAN&gt;'str'&lt;/SPAN&gt;&lt;SPAN&gt;, errors=&lt;/SPAN&gt;&lt;SPAN&gt;'coerce'&lt;/SPAN&gt;&lt;SPAN&gt;), but this resulted in the error "read_excel() got an unexpected keyword argument 'errors'".&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 02 Aug 2023 11:02:46 GMT</pubDate>
    <dc:creator>Christine</dc:creator>
    <dc:date>2023-08-02T11:02:46Z</dc:date>
    <item>
      <title>pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38115#M26564</link>
      <description>&lt;P&gt;Not sure if this is the right place to ask this question, so let me know if it is not.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula".&lt;/P&gt;&lt;P&gt;I assume the error is occurring due to the #REF values in the file, so my questions is if there is any way I can ignore the error and convert #REF to None? I need to read it into pyspark.&lt;/P&gt;&lt;P&gt;I tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() and expected it to read the file, but get the error "read_excel() got an unexpected keyword argument 'errors'".&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jul 2023 13:41:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38115#M26564</guid>
      <dc:creator>Christine</dc:creator>
      <dc:date>2023-07-21T13:41:36Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38866#M26795</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;Thanks for you message.&amp;nbsp;&lt;BR /&gt;You might want to directly load the dataframe into the spark dataframe. A couple of example can be found in this stack overflow link:&amp;nbsp;&lt;A href="https://community.databricks.com/" target="_blank"&gt;https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe.&amp;nbsp;&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;If this doesn't help, please ask because I will deep further into it.&lt;BR /&gt;&lt;BR /&gt;Kind regards,&lt;/P&gt;</description>
      <pubDate>Tue, 01 Aug 2023 19:30:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38866#M26795</guid>
      <dc:creator>Siebert_Looije</dc:creator>
      <dc:date>2023-08-01T19:30:31Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38899#M26808</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your reply &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have already tried spark.read.format("com.crealytics.spark.excel"), but the excel file version I am using is too old. I get the error "The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)"&amp;nbsp;&lt;/P&gt;&lt;P&gt;I also tried spark.read.format("excel"), but this command never finish.&lt;/P&gt;&lt;P&gt;I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the&amp;nbsp;@ symbol in these cells, I am able to read the file.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Would this help you in the investigation?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Aug 2023 08:05:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38899#M26808</guid>
      <dc:creator>Christine</dc:creator>
      <dc:date>2023-08-02T08:05:42Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38912#M26815</link>
      <description>&lt;P&gt;Hi Kaniz,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your reply.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the&amp;nbsp;@ symbol in these cells, I am able to read the file.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I tried the following code:&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; pandas &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; pd&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df = pd.read_excel(file_path, sheet_name=&lt;/SPAN&gt;&lt;SPAN&gt;'sheet_name'&lt;/SPAN&gt;&lt;SPAN&gt;, engine=&lt;/SPAN&gt;&lt;SPAN&gt;'xlrd'&lt;/SPAN&gt;&lt;SPAN&gt;, convert_float=&lt;/SPAN&gt;&lt;SPAN&gt;False&lt;/SPAN&gt;&lt;SPAN&gt;, dtype=&lt;/SPAN&gt;&lt;SPAN&gt;'str'&lt;/SPAN&gt;&lt;SPAN&gt;, errors=&lt;/SPAN&gt;&lt;SPAN&gt;'coerce'&lt;/SPAN&gt;&lt;SPAN&gt;), but this resulted in the error "read_excel() got an unexpected keyword argument 'errors'".&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Aug 2023 11:02:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/38912#M26815</guid>
      <dc:creator>Christine</dc:creator>
      <dc:date>2023-08-02T11:02:46Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/39386#M26952</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with&lt;SPAN&gt;&amp;nbsp;pyspark.pandas.read_excel().&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;One way to handle invalid references is to use the&lt;SPAN&gt;&amp;nbsp;openpyxl&lt;SPAN&gt;&amp;nbsp;engine instead of&lt;SPAN&gt;&amp;nbsp;xlrd.&lt;SPAN&gt;&amp;nbsp;openpyxl&lt;SPAN&gt;&amp;nbsp;can handle invalid references and replace them with a&lt;SPAN&gt;&amp;nbsp;None&lt;SPAN&gt;&amp;nbsp;value.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Here's an example of how you can read your Excel file using&lt;SPAN&gt;&amp;nbsp;pyspark.pandas&lt;SPAN&gt;&amp;nbsp;and the&lt;SPAN&gt;&amp;nbsp;openpyxl&lt;SPAN&gt;&amp;nbsp;engine:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;python&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;import&lt;SPAN class=""&gt;&amp;nbsp;pandas&amp;nbsp;&lt;SPAN class=""&gt;as&lt;SPAN class=""&gt;&amp;nbsp;pd&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;from&lt;SPAN class=""&gt;&amp;nbsp;pyspark.sql.functions&amp;nbsp;&lt;SPAN class=""&gt;import&lt;SPAN class=""&gt;&amp;nbsp;col&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;from&lt;SPAN class=""&gt;&amp;nbsp;pyspark.sql.types&amp;nbsp;&lt;SPAN class=""&gt;import&lt;SPAN class=""&gt;&amp;nbsp;StringType&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;import&lt;SPAN class=""&gt;&amp;nbsp;pyspark.pandas&amp;nbsp;&lt;SPAN class=""&gt;as&lt;SPAN class=""&gt;&amp;nbsp;ps&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;#&amp;nbsp;Set&amp;nbsp;up&amp;nbsp;the&amp;nbsp;file&amp;nbsp;path&amp;nbsp;and&amp;nbsp;sheet&amp;nbsp;name&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;file_path&amp;nbsp;=&amp;nbsp;&lt;SPAN class=""&gt;"/path/to/your/file.xlsx"&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;sheet_name&amp;nbsp;=&amp;nbsp;&lt;SPAN class=""&gt;"sheet1"&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;#&amp;nbsp;Set&amp;nbsp;up&amp;nbsp;the&amp;nbsp;options&amp;nbsp;and&amp;nbsp;read&amp;nbsp;the&amp;nbsp;file&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;options&amp;nbsp;=&amp;nbsp;&lt;SPAN class=""&gt;dict&lt;SPAN class=""&gt;(header=&lt;SPAN class=""&gt;1&lt;SPAN class=""&gt;,&amp;nbsp;keep_default_na=&lt;SPAN class=""&gt;False&lt;SPAN class=""&gt;,&amp;nbsp;engine=&lt;SPAN class=""&gt;"openpyxl"&lt;SPAN class=""&gt;)&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;df_pandas&amp;nbsp;=&amp;nbsp;pd.read_excel(file_path,&amp;nbsp;sheet_name=sh&lt;SPAN class=""&gt;eet_name,&amp;nbsp;**options&lt;SPAN class=""&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;#&amp;nbsp;Convert&amp;nbsp;the&amp;nbsp;pandas&amp;nbsp;dataframe&amp;nbsp;to&amp;nbsp;a&amp;nbsp;PySpark&amp;nbsp;DataFr&lt;SPAN class=""&gt;ame&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;df_spark&amp;nbsp;=&amp;nbsp;ps.DataFrame(df_pandas&lt;SPAN class=""&gt;)&lt;SPAN class=""&gt;.to_spark(&lt;SPAN class=""&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;#&amp;nbsp;Replace&amp;nbsp;#REF&amp;nbsp;values&amp;nbsp;with&amp;nbsp;None&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;df_spark&amp;nbsp;=&amp;nbsp;df_spark.withColumn(&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;SPAN class=""&gt;"_tmp"&lt;SPAN class=""&gt;,&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;col(&lt;SPAN class=""&gt;"invalid_column_name"&lt;SPAN class=""&gt;)&lt;SPAN class=""&gt;.cast(StringType(&lt;SPAN class=""&gt;))&lt;SPAN class=""&gt;.cast(&lt;SPAN class=""&gt;"double"&lt;SPAN class=""&gt;)&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;)&lt;SPAN class=""&gt;.drop(&lt;SPAN class=""&gt;"invalid_column_name"&lt;SPAN class=""&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;#&amp;nbsp;Show&amp;nbsp;the&amp;nbsp;resulting&amp;nbsp;dataframe&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;df_spark.show(&lt;SPAN class=""&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;In this example,&lt;SPAN&gt;&amp;nbsp;read_excel()&lt;SPAN&gt;&amp;nbsp;is configured to use the&lt;SPAN&gt;&amp;nbsp;openpyxl&lt;SPAN&gt;&amp;nbsp;engine instead of&lt;SPAN&gt;&amp;nbsp;xlrd&lt;SPAN&gt;&amp;nbsp;using the&lt;SPAN&gt;&amp;nbsp;engine="openpyxl"&lt;SPAN&gt;&amp;nbsp;option. This allows you to read the Excel file and handle invalid references.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;After reading the file, the resulting Pandas dataframe is converted to a PySpark dataframe using&lt;SPAN&gt;&amp;nbsp;pyspark.pandas.DataFrame(df_pandas).to_spark(). A temporary column ("_tmp") is then created by casting the problematic column to a double, and is then cast again to string. Finally, #REF values are replaced with None.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;This approach should allow you to read your Excel file into PySpark and handle invalid references.&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 08 Aug 2023 16:03:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-read-excel-engine-xlrd-reading-xls-file-with-ref/m-p/39386#M26952</guid>
      <dc:creator>youssefmrini</dc:creator>
      <dc:date>2023-08-08T16:03:19Z</dc:date>
    </item>
  </channel>
</rss>

