07-21-2023 06:41 AM
Not sure if this is the right place to ask this question, so let me know if it is not.
I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula".
I assume the error is occurring due to the #REF values in the file, so my questions is if there is any way I can ignore the error and convert #REF to None? I need to read it into pyspark.
I tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() and expected it to read the file, but get the error "read_excel() got an unexpected keyword argument 'errors'".
Thank you!
08-01-2023 12:30 PM
Hi,
Thanks for you message.
You might want to directly load the dataframe into the spark dataframe. A couple of example can be found in this stack overflow link: https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe.
If this doesn't help, please ask because I will deep further into it.
Kind regards,
08-02-2023 01:05 AM
Hi,
Thank you for your reply 🙂
I have already tried spark.read.format("com.crealytics.spark.excel"), but the excel file version I am using is too old. I get the error "The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)"
I also tried spark.read.format("excel"), but this command never finish.
I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file.
Would this help you in the investigation?
08-02-2023 03:51 AM
Hi @Christine, you are trying to read an xls file containing #REF values in Databricks with PySpark Pandas based on the information provided.
When you try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()", you get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula". It is assumed that the error is occurring due to the #REF values in the file, and you want to know if there is any way to ignore the error and convert #REF to None.
You also tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() but got the error "read_excel() got an unexpected keyword argument 'errors'".Unfortunately, the errors
the parameter is not supported in pyspark.pandas.read_excel()
.
However, you can try the following workaround to convert #REF values to None:
1. Read the Excel file using which supports the errors
parameter. Set errors='coerce'
to convert #REF values to NaN.
import pandas as pd
df = pd.read_excel(file_path, sheet_name='sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce')
2. Replace NaN values with None using the fillna()
method.
df = df.fillna(value=None)
3. Convert the resulting pandas DataFrame to a PySpark DataFrame using spark.createDataFrame()
.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
spark_df = spark.createDataFrame(df)
08-02-2023 04:02 AM
Hi Kaniz,
Thank you for your reply.
I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file.
I tried the following code:
08-08-2023 09:03 AM
It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with pyspark.pandas.read_excel().
One way to handle invalid references is to use the openpyxl engine instead of xlrd. openpyxl can handle invalid references and replace them with a None value.
Here's an example of how you can read your Excel file using pyspark.pandas and the openpyxl engine:
In this example, read_excel() is configured to use the openpyxl engine instead of xlrd using the engine="openpyxl" option. This allows you to read the Excel file and handle invalid references.
After reading the file, the resulting Pandas dataframe is converted to a PySpark dataframe using pyspark.pandas.DataFrame(df_pandas).to_spark(). A temporary column ("_tmp") is then created by casting the problematic column to a double, and is then cast again to string. Finally, #REF values are replaced with None.
This approach should allow you to read your Excel file into PySpark and handle invalid references.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.