Performance Issue – Writing Large Dataset to ADLS from Oracle via JDBC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-29-2025 05:39 AM
Hi there,
I am currently working on a notebook where I pull data from an Oracle database using an Oracle SQL script with a JDBC connection. Due to the large dataset size and joins in my query, I’ve implemented the /*+ parallel(n) */ hint, which works efficiently—returning results in seconds for one month of data and a few minutes for two years of data, which is fine.
However, when writing the entire result set (~980M records) to my ADLS (Azure Data Lake Storage) path, the process is taking excessively long (over 4 hours without completion). I’ve tried multiple optimization methods but haven’t seen any improvement. I would deeply appreciate any insights or solutions.
I have tried for one month and it is taking around 10-15 min.
My cluster configuration:
Databricks runtime: 16.4 LTS
Worker Type: Standard_D32ds_v5 ( 128 Gb memory, 32 core)
Enable autoscaling: min(1) and max(8)
Driver Type: Standard_d16ds_v5 ( 64 Gb memory and 16 cores)
Example Code:
oracle_query = """( SELECT /*+ parallel(100) */ DISTINCT
t1.column1......t2.colum28
FROM schema.table1 t1
LEFT JOIN( SELECT *+ parallel(50) */ key_col, date_col, column
FROM schema.table2
WHERE date_col >= to_date( '2024-01-01', 'yyyy-MM-dd')
) t2
ON t1.key_col = t2.key_col AND t1.date_col = t2.date_col
WHERE t1.date_col >= to_date( '2024-01-01', 'yyyy-MM-dd')
)"""JDBC connection:
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", oracle_query) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", driver_class) \
.option( fetchsize = "10000000")
.load()Some transformation:
final_df = df.withcolumn("Year", year("date_col")\
.withcolumn("month", month("date_col")\
.withcolumn("Action_timestp", from_utc_timestamp(current_timestamp(),'America'))Saving final dataframe to ADLS:
final_df.repartition(100,'year','month').write.option("maxrecordsPerFile",1000000).mode('Overwrite').format('parquet').option('compression','snappy').save(destination_path)
Thanks