@John Laurence Syโ :
It sounds like you are encountering a schema conversion error when trying to read in a Parquet file that contains an INT64 column that cannot be converted to a string type. This error can occur when the Parquet file has a schema that is incompatible with the expected schema of the Spark DataFrame.
One possible solution is to explicitly specify the schema of the Parquet file when reading it into a Spark DataFrame using the schema parameter of the spark.read.parquet() method. This will ensure that the Parquet file is read in with the correct schema and any type conversion errors are avoided. For example:
python
from pyspark.sql.types import StructType, StructField, LongType, StringType
# Define the schema of the Parquet file
schema = StructType([
StructField("int_column", LongType(), True),
StructField("string_column", StringType(), True)
])
# Read in the Parquet file with the specified schema
df = spark.read.parquet("s3://path/to/parquet/file", schema=schema)
In this example, the schema of the Parquet file contains an INT64 column and a string column, which are explicitly defined using the StructType and StructField classes. The LongType() and StringType() functions are used to define the data types of the columns.
Alternatively, you can try converting the INT64 column to a string column in the Parquet file itself before reading it into Spark. This can be done using tools like Apache Arrow or Pandas. Once the column is converted, the Parquet file can be read in normally without encountering any schema conversion errors.
I hope this helps! Let me know if you have any further questions.