topic Re: How to partition JDBC Oracle read query and cast with TO_DATE on partition date field? in Data Engineering

How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

joshuat — Fri, 28 Jun 2024 18:48:23 GMT

I'm attempting to fetch an Oracle Netsuite table in parallel via JDBC using the Netsuite Connect JAR, already installed on the cluster and setup correctly. I can do successfully with a single-threaded approach using the `dbtable` option:

table = 'Transaction' 
df = spark.read.format("jdbc") \ 
.option("url", jdbc_url) \ 
.option("driver", jdbc_driver) \ 
.option("dbtable", table) \ 
.load()

I'd like to partition the fetch on a date field, but this code receives an error - java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10104]:

lower_bound = "2024-01-01"
upper_bound = "2024-03-31"
query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""
df = spark.read.format("jdbc") \
 .option("url", jdbc_url) \
 .option("driver", jdbc_driver) \
 .option("dbtable", query) \
 .option("partitionColumn", "lastmodifieddate") \
 .option("lowerBound", lower_bound) \
 .option("upperBound", upper_bound) \
 .option("numPartitions", 10) \
 .option("fetchsize", 1000) \
 .option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
 .load()

I believe this is because the string is not being cast to a date using TO_DATE because of how partitioning works in Spark. Oracle/Netsuite wants the strings cast to dates in the SQL statement. Sure enough, the Spark debug logs show that no TO_DATE is being applied to the WHERE clauses:

INFO JDBCRelation: Number of partitions: 10, WHERE clauses of these partitions: "lastmodifieddate" < '2024-01-09 23:54:00' or "lastmodifieddate" is null...

How do I address this?

Re: How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

joshuat — Tue, 02 Jul 2024 17:28:46 GMT

Thank you for your reply. The answers and the Stack Overflow post you link appear to be heading off in the wrong direction. I don't have a named partition in the source Netsuite table. Additionally, the source table isn't partitioned by a date field. Lastly, the answer SQL you and SO provide only covers one hardcoded day of data:

PARTITION FOR (DATE '2024-07-24')

My time range spans 3 months, so I would need PARTITION FOR parameterized. The Spark logs don't help, and the Netsuite logs merely give this information: "Invalid Month."

I'd appreciate it if you could propose a code refactor that fits my use case more closely. Thank you!

Re: How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

mtajmouati — Tue, 02 Jul 2024 20:07:42 GMT

Hello,

To address the issue

Step 1: Load Data Without Partitioning

table = 'Transaction' lower_bound = "2024-01-01" upper_bound = "2024-03-31" query = f""" ( SELECT * FROM {table} WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd') AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd') ) AS t """ df = spark.read.format("jdbc") \ .option("url", jdbc_url) \ .option("driver", jdbc_driver) \ .option("dbtable", query) \ .option("fetchsize", 1000) \ .option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \ .load()

Step 2: Repartition the Data Within Spark Once you have the DataFrame df loaded, you can repartition it based on the lastmodifieddate field:

# Convert the 'lastmodifieddate' column to date type if it's not already from pyspark.sql.functions import col, to_date df = df.withColumn("lastmodifieddate", to_date(col("lastmodifieddate"), "yyyy-MM-dd")) # Repartition the DataFrame num_partitions = 10 df_repartitioned = df.repartition(num_partitions, "lastmodifieddate") # Now you can proceed with further processing on df_repartitioned df_repartitioned.show()

Re: How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

joshuat — Mon, 15 Jul 2024 16:27:45 GMT

@mtajmouati I appreciate your response. This approach resulted in a generic "bad SQL" error in Netsuite: "java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10104]"

My response was delayed because I have since decided to work around this issue and partition the fetch by the long integer ID field, since that doesn't require casting to a DATE data type. Thank you.

Re: How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

pavlosskev — Wed, 16 Jul 2025 14:51:16 GMT

@joshuat Hello. Did you manage to solve this issue? We are having a similar problem, where we can't read Oracle data partitioned, unless the lower and upper bounds are integrers.

Re: How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

joshuat — Wed, 16 Jul 2025 23:21:58 GMT

@pavlosskev I did not and have to do partitioned reads via the ID.