mtajmouati
Contributor

Hello,

To address the issue

Step 1: Load Data Without Partitioning

table = 'Transaction'
lower_bound = "2024-01-01"
upper_bound = "2024-03-31"

query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""

df = spark.read.format("jdbc") \
 .option("url", jdbc_url) \
 .option("driver", jdbc_driver) \
 .option("dbtable", query) \
 .option("fetchsize", 1000) \
 .option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
 .load()

Step 2: Repartition the Data Within Spark Once you have the DataFrame df loaded, you can repartition it based on the lastmodifieddate field:

# Convert the 'lastmodifieddate' column to date type if it's not already
from pyspark.sql.functions import col, to_date

df = df.withColumn("lastmodifieddate", to_date(col("lastmodifieddate"), "yyyy-MM-dd"))

# Repartition the DataFrame
num_partitions = 10
df_repartitioned = df.repartition(num_partitions, "lastmodifieddate")

# Now you can proceed with further processing on df_repartitioned
df_repartitioned.show()

 

 

 

 

 

Best regards,
Mehdi Tajmouati
 mehdi.tajmouati@wytasoft.com
 06 68 23 18 42
 www.wytasoft.com