03-16-2023 05:02 AM
Hi,
I have a PySpark DataFrame with 11 million records. I created the DataFrame on a cluster. It is not saved on DBFS or storage account.
import pyspark.sql.functions as F
from pyspark.sql.functions import col, when, floor, expr, hour, minute, to_timestamp, explode, sequence
# Define start and end datetime for intervals
start_datetime = "1990-01-01 00:00:00"
end_datetime = "2099-12-31 23:55:00"
interval_minutes = 5 # minutes
# Create a DataFrame with sequence of intervals
df = spark.range(0, 1).select(
expr(f"sequence(to_timestamp('{start_datetime}'), to_timestamp('{end_datetime}'), interval {interval_minutes} minutes)").alias("time")
)
# Explode the sequence column to create individual intervals
df = df.select(explode(col("time")).alias("time"))
# Create a DataFrame with sequence of intervals
df = spark.range(0, 1).select(
expr(f"sequence(to_timestamp('{start_datetime}'), to_timestamp('{end_datetime}'), interval {interval_minutes} minutes)").alias("time")
)
# Explode the sequence column to create individual intervals
df = df.select(explode(col("time")).alias("time"))
# add a new column with interval_period
df = df.withColumn('Interval_Period', interval_period)
df = df.withColumn('Interval_Date', F.to_date(F.col("time")))
When I want to write in on Storage Account as a Delta table by:
df.write \
.format("delta") \
.mode("overwrite") \
.option("path", "table_path") \
.saveAsTable("my_table") # External table
And I get the error:
AnalysisException: Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.
Please upgrade your Delta table to reader version 2 and writer version 5
and change the column mapping mode to 'name' mapping. You can use the following command:
ALTER TABLE SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
Refer to table versioning at https://docs.microsoft.com/azure/databricks/delta/versioning
My question is how I can set those properties. Not that I don't have a delta table. What I have is a DataFrame created on a cluster.
I get the error when I want t write files to the table.
03-17-2023 07:31 AM
Hi @Mohammad Saber , According to my understanding, the issue is complaining about some of the columns in the delta table(where you are writing to) having ';{}()\n\t=' in the column names. You need to change the reader and writer versions of the delta table to be able to 2 and 5 respectively using below command
ALTER TABLE SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
03-16-2023 05:57 AM
Hi @Mohammad Saber , Are you getting the error while writing the file to the table? Or before that?
03-16-2023 02:38 PM
Hi @Lakshay Goel
I get the error when I want to write files to the table.
03-17-2023 07:31 AM
Hi @Mohammad Saber , According to my understanding, the issue is complaining about some of the columns in the delta table(where you are writing to) having ';{}()\n\t=' in the column names. You need to change the reader and writer versions of the delta table to be able to 2 and 5 respectively using below command
ALTER TABLE SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
03-17-2023 04:19 PM
Thanks,
I found that I should create a table first using the above properties, and then I write Dataframe to the table.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now