Performance Issue – Writing Large Dataset to ADLS from Oracle via JDBC

Raj_DB
Contributor

Hi there, 

I am currently working on a notebook where I pull data from an Oracle database using an Oracle SQL script with a JDBC connection. Due to the large dataset size and joins in my query, I’ve implemented the /*+ parallel(n) */ hint, which works efficiently—returning results in seconds for one month of data and a few minutes for two years of data, which is fine.

However, when writing the entire result set (~980M records) to my ADLS (Azure Data Lake Storage) path, the process is taking excessively long (over 4 hours without completion). I’ve tried multiple optimization methods but haven’t seen any improvement. I would deeply appreciate any insights or solutions.

I have tried for one month and it is taking around 10-15 min.

My cluster configuration:

Databricks runtime: 16.4 LTS

Worker Type: Standard_D32ds_v5 ( 128 Gb memory, 32 core)

Enable autoscaling: min(1) and max(8) 

Driver Type: Standard_d16ds_v5 ( 64 Gb memory and 16 cores)

Example Code:

oracle_query = """( SELECT /*+ parallel(100) */ DISTINCT

                  t1.column1......t2.colum28

FROM schema.table1 t1

LEFT JOIN( SELECT  *+ parallel(50) */  key_col, date_col, column

                    FROM schema.table2 

                    WHERE date_col >= to_date( '2024-01-01', 'yyyy-MM-dd')

                   ) t2

ON t1.key_col = t2.key_col AND t1.date_col = t2.date_col

WHERE  t1.date_col >= to_date( '2024-01-01', 'yyyy-MM-dd')

)"""

JDBC connection:

df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", oracle_query) \
    .option("user", jdbc_username) \
    .option("password", jdbc_password) \
    .option("driver", driver_class) \
    .option( fetchsize = "10000000")
    .load()

Some transformation:

final_df = df.withcolumn("Year", year("date_col")\
              .withcolumn("month", month("date_col")\
              .withcolumn("Action_timestp", from_utc_timestamp(current_timestamp(),'America'))

Saving final dataframe to ADLS:

final_df.repartition(100,'year','month').write.option("maxrecordsPerFile",1000000).mode('Overwrite').format('parquet').option('compression','snappy').save(destination_path)

 

Thanks