Databricks

JLSy · ‎04-16-2023

I am receiving an error similar to the post in this link: https://community.databricks.com/s/question/0D58Y00009d8h4tSAA/cannot-convert-parquet-type-int64-to-...

However, instead of type double the error message states that the type cannot be converted into string.

In short, I am trying to mount data from our S3 bucket into a Databricks workspace instance via the spark.read and spark.write method but encounter the error message "Error while reading file: Schema conversion error: cannot convert Parquet type INT64 to Photon type string"

I have tried the spark cluster configuration stated in the post mentioned but it does not solve my current issue. I was wondering if a similar configuration is needed (one with only a small edit to the previous solution) or if some other solution is available that would also be good

pvignesh92 · ‎04-17-2023

Hi @John Laurence Sy Could you clarify if the parquet files you are reading has different datatype for the same column? I'm wondering why Spark is trying to convert the schema from INT to String?

JLSy · ‎04-17-2023

Hello @Vigneshraja Palaniraj, I have verified that all the columns are assigned only one of the following types: string, date, double, bigint, decimal(20,2), int.

Anonymous · ‎04-18-2023

@John Laurence Sy :

It sounds like you are encountering a schema conversion error when trying to read in a Parquet file that contains an INT64 column that cannot be converted to a string type. This error can occur when the Parquet file has a schema that is incompatible with the expected schema of the Spark DataFrame.

One possible solution is to explicitly specify the schema of the Parquet file when reading it into a Spark DataFrame using the schema parameter of the spark.read.parquet() method. This will ensure that the Parquet file is read in with the correct schema and any type conversion errors are avoided. For example:

python

from pyspark.sql.types import StructType, StructField, LongType, StringType
 
# Define the schema of the Parquet file
schema = StructType([
    StructField("int_column", LongType(), True),
    StructField("string_column", StringType(), True)
])
 
# Read in the Parquet file with the specified schema
df = spark.read.parquet("s3://path/to/parquet/file", schema=schema)

In this example, the schema of the Parquet file contains an INT64 column and a string column, which are explicitly defined using the StructType and StructField classes. The LongType() and StringType() functions are used to define the data types of the columns.

Alternatively, you can try converting the INT64 column to a string column in the Parquet file itself before reading it into Spark. This can be done using tools like Apache Arrow or Pandas. Once the column is converted, the Parquet file can be read in normally without encountering any schema conversion errors.

I hope this helps! Let me know if you have any further questions.