Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-29-2025 10:56 PM
Hi @lingareddy_Alva , Thank you very much for all the optimization techniques.
I have tried this technique but still same issue.
# Optimized write for 980M records
spark.conf.set("spark.sql.files.maxPartitionBytes", "268435456")
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")
spark.conf.set("spark.sql.parquet.block.size", "268435456")
final_df.coalesce(24) \ # Reduce from 100 partitions
.sortWithinPartitions("date_col") \ # Improve compression ratio
.write \
.partitionBy("year", "month") \
.option("maxRecordsPerFile", "20000000") \ # Larger files
.mode('overwrite') \
.format('parquet') \
.option('compression', 'zstd') \
.save(destination_path)While applying upper and lower bounds on date_col, I encountered the following error. I have tried to fix it but I am getting same issue. My date_col is a timestamp type.
ORA-01861: literal does not match format stringdf = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", oracle_query) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", driver_class) \
.option("fetchsize", "50000") \
.option("partitionColumn", "date_col") \
.option("lowerBound", "2025-05-01") \
.option("upperBound", "2025-07-31") \
.option("numPartitions", "5")
.load()
df.limit(5).display() # Getting that error while displaying result.And I did not try the oracle query as I am not able to understand few columns where to use them.
Modify your Oracle query to be more partition-friendly:
oracle_query = """(
SELECT /*+ parallel(50) */ DISTINCT
t1.column1, ..., t2.column28,
t1.date_col,
EXTRACT(YEAR FROM t1.date_col) as year_partition, #1 where I can use this for partition
EXTRACT(MONTH FROM t1.date_col) as month_partition, #2 same
-- Add a numeric column for partitioning if available
ROW_NUMBER() OVER (ORDER BY t1.date_col) as row_id #3 Where I can use this col.
FROM schema.table1 t1
LEFT JOIN schema.table2 t2
ON t1.key_col = t2.key_col
AND t1.date_col = t2.date_col
AND t2.date_col >= DATE '2024-01-01'
WHERE t1.date_col >= DATE '2024-01-01'
AND t1.date_col < DATE '2027-01-01'
)"""Could you help me if you have any solution?
Thanks