Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-02-2024 01:07 PM
Hello,
To address the issue
Step 1: Load Data Without Partitioning
table = 'Transaction'
lower_bound = "2024-01-01"
upper_bound = "2024-03-31"
query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("driver", jdbc_driver) \
.option("dbtable", query) \
.option("fetchsize", 1000) \
.option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
.load()Step 2: Repartition the Data Within Spark Once you have the DataFrame df loaded, you can repartition it based on the lastmodifieddate field:
# Convert the 'lastmodifieddate' column to date type if it's not already
from pyspark.sql.functions import col, to_date
df = df.withColumn("lastmodifieddate", to_date(col("lastmodifieddate"), "yyyy-MM-dd"))
# Repartition the DataFrame
num_partitions = 10
df_repartitioned = df.repartition(num_partitions, "lastmodifieddate")
# Now you can proceed with further processing on df_repartitioned
df_repartitioned.show()
Best regards,
Mehdi Tajmouati
mehdi.tajmouati@wytasoft.com
06 68 23 18 42
www.wytasoft.com
Mehdi Tajmouati
mehdi.tajmouati@wytasoft.com
06 68 23 18 42
www.wytasoft.com