โ03-16-2023 05:02 AM
Hi,
I have a PySpark DataFrame with 11 million records. I created the DataFrame on a cluster. It is not saved on DBFS or storage account.
import pyspark.sql.functions as F
from pyspark.sql.functions import col, when, floor, expr, hour, minute, to_timestamp, explode, sequence
# Define start and end datetime for intervals
start_datetime = "1990-01-01 00:00:00"
end_datetime = "2099-12-31 23:55:00"
interval_minutes = 5 # minutes
# Create a DataFrame with sequence of intervals
df = spark.range(0, 1).select(
expr(f"sequence(to_timestamp('{start_datetime}'), to_timestamp('{end_datetime}'), interval {interval_minutes} minutes)").alias("time")
)
# Explode the sequence column to create individual intervals
df = df.select(explode(col("time")).alias("time"))
# Create a DataFrame with sequence of intervals
df = spark.range(0, 1).select(
expr(f"sequence(to_timestamp('{start_datetime}'), to_timestamp('{end_datetime}'), interval {interval_minutes} minutes)").alias("time")
)
# Explode the sequence column to create individual intervals
df = df.select(explode(col("time")).alias("time"))
# add a new column with interval_period
df = df.withColumn('Interval_Period', interval_period)
df = df.withColumn('Interval_Date', F.to_date(F.col("time")))
When I want to write in on Storage Account as a Delta table by:
df.write \
.format("delta") \
.mode("overwrite") \
.option("path", "table_path") \
.saveAsTable("my_table") # External table
And I get the error:
AnalysisException: Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.
Please upgrade your Delta table to reader version 2 and writer version 5
and change the column mapping mode to 'name' mapping. You can use the following command:
ALTER TABLE SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
Refer to table versioning at https://docs.microsoft.com/azure/databricks/delta/versioning
My question is how I can set those properties. Not that I don't have a delta table. What I have is a DataFrame created on a cluster.
I get the error when I want t write files to the table.
โ03-17-2023 07:31 AM
Hi @Mohammad Saberโ , According to my understanding, the issue is complaining about some of the columns in the delta table(where you are writing to) having ';{}()\n\t=' in the column names. You need to change the reader and writer versions of the delta table to be able to 2 and 5 respectively using below command
ALTER TABLE SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
โ03-16-2023 05:57 AM
Hi @Mohammad Saberโ , Are you getting the error while writing the file to the table? Or before that?
โ03-16-2023 02:38 PM
Hi @Lakshay Goelโ
I get the error when I want to write files to the table.
โ03-17-2023 07:31 AM
Hi @Mohammad Saberโ , According to my understanding, the issue is complaining about some of the columns in the delta table(where you are writing to) having ';{}()\n\t=' in the column names. You need to change the reader and writer versions of the delta table to be able to 2 and 5 respectively using below command
ALTER TABLE SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
โ03-17-2023 04:19 PM
Thanks,
I found that I should create a table first using the above properties, and then I write Dataframe to the table.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group