cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

cannot convert Parquet type INT64 to Photon type string

JLSy
New Contributor III

I am receiving an error similar to the post in this link: https://community.databricks.com/s/question/0D58Y00009d8h4tSAA/cannot-convert-parquet-type-int64-to-...

However, instead of type double the error message states that the type cannot be converted into string.

In short, I am trying to mount data from our S3 bucket into a Databricks workspace instance via the spark.read and spark.write method but encounter the error message "Error while reading file: Schema conversion error: cannot convert Parquet type INT64 to Photon type string"

I have tried the spark cluster configuration stated in the post mentioned but it does not solve my current issue. I was wondering if a similar configuration is needed (one with only a small edit to the previous solution) or if some other solution is available that would also be good

5 REPLIES 5

pvignesh92
Honored Contributor

Hi @John Laurence Sy​ Could you clarify if the parquet files you are reading has different datatype for the same column? I'm wondering why Spark is trying to convert the schema from INT to String?

JLSy
New Contributor III

Hello @Vigneshraja Palaniraj​, I have verified that all the columns are assigned only one of the following types: string, date, double, bigint, decimal(20,2), int.

Anonymous
Not applicable

@John Laurence Sy​ :

It sounds like you are encountering a schema conversion error when trying to read in a Parquet file that contains an INT64 column that cannot be converted to a string type. This error can occur when the Parquet file has a schema that is incompatible with the expected schema of the Spark DataFrame.

One possible solution is to explicitly specify the schema of the Parquet file when reading it into a Spark DataFrame using the schema parameter of the spark.read.parquet() method. This will ensure that the Parquet file is read in with the correct schema and any type conversion errors are avoided. For example:

python

from pyspark.sql.types import StructType, StructField, LongType, StringType
 
# Define the schema of the Parquet file
schema = StructType([
    StructField("int_column", LongType(), True),
    StructField("string_column", StringType(), True)
])
 
# Read in the Parquet file with the specified schema
df = spark.read.parquet("s3://path/to/parquet/file", schema=schema)

In this example, the schema of the Parquet file contains an INT64 column and a string column, which are explicitly defined using the StructType and StructField classes. The LongType() and StringType() functions are used to define the data types of the columns.

Alternatively, you can try converting the INT64 column to a string column in the Parquet file itself before reading it into Spark. This can be done using tools like Apache Arrow or Pandas. Once the column is converted, the Parquet file can be read in normally without encountering any schema conversion errors.

I hope this helps! Let me know if you have any further questions.

JLSy
New Contributor III

Hello @Suteja Kanuri​ ,

I'll go ahead and implement this method, thanks! I'll update this thread if there are any issues.

JLSy
New Contributor III

I have tried specifying the schema and assigning the following mapping to each column type:

  • string - StringType()
  • date - DateType()
  • double - DoubleType()
  • bigint - LongType()
  • int - LongType()
  • decimal(20,2) - LongType()

I have also tried using other spark types for the decimal(20,2), int, bigint, and double columns, however, the error still persists.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.