Databricks Community

joshuat · ‎06-28-2024

I'm attempting to fetch an Oracle Netsuite table in parallel via JDBC using the Netsuite Connect JAR, already installed on the cluster and setup correctly. I can do successfully with a single-threaded approach using the `dbtable` option:

table = 'Transaction' 
df = spark.read.format("jdbc") \ 
.option("url", jdbc_url) \ 
.option("driver", jdbc_driver) \ 
.option("dbtable", table) \ 
.load()

I'd like to partition the fetch on a date field, but this code receives an error - java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10104]:

lower_bound = "2024-01-01"
upper_bound = "2024-03-31"
query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""
df = spark.read.format("jdbc") \
 .option("url", jdbc_url) \
 .option("driver", jdbc_driver) \
 .option("dbtable", query) \
 .option("partitionColumn", "lastmodifieddate") \
 .option("lowerBound", lower_bound) \
 .option("upperBound", upper_bound) \
 .option("numPartitions", 10) \
 .option("fetchsize", 1000) \
 .option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
 .load()

I believe this is because the string is not being cast to a date using TO_DATE because of how partitioning works in Spark. Oracle/Netsuite wants the strings cast to dates in the SQL statement. Sure enough, the Spark debug logs show that no TO_DATE is being applied to the WHERE clauses:

INFO JDBCRelation: Number of partitions: 10, WHERE clauses of these partitions: "lastmodifieddate" < '2024-01-09 23:54:00' or "lastmodifieddate" is null...

How do I address this?

joshuat · ‎07-02-2024

Thank you for your reply. The answers and the Stack Overflow post you link appear to be heading off in the wrong direction. I don't have a named partition in the source Netsuite table. Additionally, the source table isn't partitioned by a date field. Lastly, the answer SQL you and SO provide only covers one hardcoded day of data:

PARTITION FOR (DATE '2024-07-24')

My time range spans 3 months, so I would need PARTITION FOR parameterized. The Spark logs don't help, and the Netsuite logs merely give this information: "Invalid Month."

I'd appreciate it if you could propose a code refactor that fits my use case more closely. Thank you!

mtajmouati · ‎07-02-2024

Hello,

To address the issue

Step 1: Load Data Without Partitioning

table = 'Transaction'
lower_bound = "2024-01-01"
upper_bound = "2024-03-31"

query = f"""
(
SELECT * FROM {table}
WHERE TO_DATE(lastmodifieddate, 'yyyy-MM-dd') >= TO_DATE('{lower_bound}', 'yyyy-MM-dd')
AND TO_DATE(lastmodifieddate, 'yyyy-MM-dd') <= TO_DATE('{upper_bound}', 'yyyy-MM-dd')
) AS t
"""

df = spark.read.format("jdbc") \
 .option("url", jdbc_url) \
 .option("driver", jdbc_driver) \
 .option("dbtable", query) \
 .option("fetchsize", 1000) \
 .option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'") \
 .load()

Step 2: Repartition the Data Within Spark Once you have the DataFrame df loaded, you can repartition it based on the lastmodifieddate field:

# Convert the 'lastmodifieddate' column to date type if it's not already
from pyspark.sql.functions import col, to_date

df = df.withColumn("lastmodifieddate", to_date(col("lastmodifieddate"), "yyyy-MM-dd"))

# Repartition the DataFrame
num_partitions = 10
df_repartitioned = df.repartition(num_partitions, "lastmodifieddate")

# Now you can proceed with further processing on df_repartitioned
df_repartitioned.show()

joshuat · ‎07-15-2024

@mtajmouati I appreciate your response. This approach resulted in a generic "bad SQL" error in Netsuite: "java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10104]"

My response was delayed because I have since decided to work around this issue and partition the fetch by the long integer ID field, since that doesn't require casting to a DATE data type. Thank you.

Databricks Community

How to partition JDBC Oracle read query and cast with TO_DATE on partition date field?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences