cannot convert Parquet type INT64 to Photon type string
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-16-2023 07:29 PM
I am receiving an error similar to the post in this link: https://community.databricks.com/s/question/0D58Y00009d8h4tSAA/cannot-convert-parquet-type-int64-to-...
However, instead of type double the error message states that the type cannot be converted into string.
In short, I am trying to mount data from our S3 bucket into a Databricks workspace instance via the spark.read and spark.write method but encounter the error message "Error while reading file: Schema conversion error: cannot convert Parquet type INT64 to Photon type string"
I have tried the spark cluster configuration stated in the post mentioned but it does not solve my current issue. I was wondering if a similar configuration is needed (one with only a small edit to the previous solution) or if some other solution is available that would also be good
- Labels:
-
Copy into
-
Error
-
Error Message
-
Parquet Type
-
Photon
-
Post
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-17-2023 04:46 AM
Hi @John Laurence Syโ Could you clarify if the parquet files you are reading has different datatype for the same column? I'm wondering why Spark is trying to convert the schema from INT to String?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-17-2023 06:12 PM
Hello @Vigneshraja Palanirajโ, I have verified that all the columns are assigned only one of the following types: string, date, double, bigint, decimal(20,2), int.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-18-2023 01:53 AM
@John Laurence Syโ :
It sounds like you are encountering a schema conversion error when trying to read in a Parquet file that contains an INT64 column that cannot be converted to a string type. This error can occur when the Parquet file has a schema that is incompatible with the expected schema of the Spark DataFrame.
One possible solution is to explicitly specify the schema of the Parquet file when reading it into a Spark DataFrame using the schema parameter of the spark.read.parquet() method. This will ensure that the Parquet file is read in with the correct schema and any type conversion errors are avoided. For example:
python
from pyspark.sql.types import StructType, StructField, LongType, StringType
# Define the schema of the Parquet file
schema = StructType([
StructField("int_column", LongType(), True),
StructField("string_column", StringType(), True)
])
# Read in the Parquet file with the specified schema
df = spark.read.parquet("s3://path/to/parquet/file", schema=schema)
In this example, the schema of the Parquet file contains an INT64 column and a string column, which are explicitly defined using the StructType and StructField classes. The LongType() and StringType() functions are used to define the data types of the columns.
Alternatively, you can try converting the INT64 column to a string column in the Parquet file itself before reading it into Spark. This can be done using tools like Apache Arrow or Pandas. Once the column is converted, the Parquet file can be read in normally without encountering any schema conversion errors.
I hope this helps! Let me know if you have any further questions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-18-2023 07:19 PM
Hello @Suteja Kanuriโ ,
I'll go ahead and implement this method, thanks! I'll update this thread if there are any issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-19-2023 01:53 AM
I have tried specifying the schema and assigning the following mapping to each column type:
- string - StringType()
- date - DateType()
- double - DoubleType()
- bigint - LongType()
- int - LongType()
- decimal(20,2) - LongType()
I have also tried using other spark types for the decimal(20,2), int, bigint, and double columns, however, the error still persists.