How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2024 11:48 AM
I'm attempting to fetch an Oracle Netsuite table in parallel via JDBC using the Netsuite Connect JAR, already installed on the cluster and setup correctly. I can do successfully with a single-threaded approach using the `dbtable` option:
table = 'Transaction'
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("driver", jdbc_driver) \
.option("dbtable", table) \
.load()
I'd like to partition the fetch on a date field, but this code receives an error - java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10104]:
lower_bound = "2024-01-01"
upper_bound = "2024-03-31"
query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("driver", jdbc_driver) \
.option("dbtable", query) \
.option("partitionColumn", "lastmodifieddate") \
.option("lowerBound", lower_bound) \
.option("upperBound", upper_bound) \
.option("numPartitions", 10) \
.option("fetchsize", 1000) \
.option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
.load()
I believe this is because the string is not being cast to a date using TO_DATE because of how partitioning works in Spark. Oracle/Netsuite wants the strings cast to dates in the SQL statement. Sure enough, the Spark debug logs show that no TO_DATE is being applied to the WHERE clauses:
INFO JDBCRelation: Number of partitions: 10, WHERE clauses of these partitions: "lastmodifieddate" < '2024-01-09 23:54:00' or "lastmodifieddate" is null...
How do I address this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-02-2024 10:27 AM - edited 07-02-2024 10:28 AM
Thank you for your reply. The answers and the Stack Overflow post you link appear to be heading off in the wrong direction. I don't have a named partition in the source Netsuite table. Additionally, the source table isn't partitioned by a date field. Lastly, the answer SQL you and SO provide only covers one hardcoded day of data:
PARTITION FOR (DATE '2024-07-24')
My time range spans 3 months, so I would need PARTITION FOR parameterized. The Spark logs don't help, and the Netsuite logs merely give this information: "Invalid Month."
I'd appreciate it if you could propose a code refactor that fits my use case more closely. Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-02-2024 01:07 PM
Hello,
To address the issue
Step 1: Load Data Without Partitioning
table = 'Transaction'
lower_bound = "2024-01-01"
upper_bound = "2024-03-31"
query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("driver", jdbc_driver) \
.option("dbtable", query) \
.option("fetchsize", 1000) \
.option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
.load()
Step 2: Repartition the Data Within Spark Once you have the DataFrame df loaded, you can repartition it based on the lastmodifieddate field:
# Convert the 'lastmodifieddate' column to date type if it's not already
from pyspark.sql.functions import col, to_date
df = df.withColumn("lastmodifieddate", to_date(col("lastmodifieddate"), "yyyy-MM-dd"))
# Repartition the DataFrame
num_partitions = 10
df_repartitioned = df.repartition(num_partitions, "lastmodifieddate")
# Now you can proceed with further processing on df_repartitioned
df_repartitioned.show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2024 09:27 AM
@mtajmouati I appreciate your response. This approach resulted in a generic "bad SQL" error in Netsuite: "java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10104]"
My response was delayed because I have since decided to work around this issue and partition the fetch by the long integer ID field, since that doesn't require casting to a DATE data type. Thank you.

