a month ago
Hi everyone.
I have a very wide table (600 columns) on an Oracle database that I need to extract to my data lake.
The table has approximately 700K rows.
I am currently trying to extract and load this information on my data lake, but I am struggling to make this process complete effectively and with good performance.
Currently I am doing around 100K rows per HOUR (with parallel 10 and fetch 5000)!
Doing a the simple count(*) takes 1 minute.
Which all of it is insanely high. I have configured JDBC with other different fetch sizes and with different parallel settings but always with similar results.
Anyone has any idea how this can be fixed and exported fast?
Thank you all.
2 weeks ago
Thanks for posting this -- extracting wide tables over JDBC is one of the trickiest performance scenarios, and 600 columns is definitely pushing the limits of what a single JDBC fetch can handle efficiently. The good news is there are several concrete things you can do to dramatically improve throughput. I will walk through them from highest to lowest impact.
UNDERSTANDING THE BOTTLENECK
With 600 columns and 700K rows, the main bottleneck is almost certainly the sheer volume of data per row being serialized over JDBC. Each row fetch has to serialize all 600 column values, and Oracle's JDBC driver is not optimized for extremely wide result sets. A count(*) taking 1 minute also suggests the Oracle side itself may have some overhead (possibly a full table scan on a large physical table, or the table lacks proper statistics).
Your fetchSize of 5000 with 600 columns means each fetch round-trip is pulling 5000 x 600 = 3 million cell values. Depending on the data types (especially if there are LOBs, CLOBs, or wide VARCHAR2 columns), this could be consuming significant memory per fetch and causing GC pressure on the Spark executors.
RECOMMENDATION 1: COLUMN PRUNING (HIGHEST IMPACT)
If you do not actually need all 600 columns downstream, the single most impactful thing you can do is select only the columns you need. Use the "dbtable" option with a subquery instead of pointing at the full table:
pushdown_query = "(SELECT col1, col2, col3, ... FROM your_table) t"
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", pushdown_query)
.option("user", username)
.option("password", password)
.option("fetchSize", "1000")
.option("numPartitions", 10)
.option("partitionColumn", "your_id_column")
.option("lowerBound", "1")
.option("upperBound", "700000")
.load())
Even reducing from 600 to 200 columns can cut your transfer time by 60% or more because you are reducing the bytes-per-row that Oracle's JDBC driver has to serialize.
If you DO need all 600 columns, consider splitting the extraction into two or three reads (e.g., columns 1-200, 201-400, 401-600) joined on the primary key afterward. This can actually be faster than one monolithic read because each fetch is much lighter.
RECOMMENDATION 2: TUNE FETCHSIZE FOR WIDE TABLES
With 600 columns, a fetchSize of 5000 may actually be too high. Each fetch batch has to fit in executor memory, and wide rows consume significantly more memory per batch. Try reducing to 500 or even 100 and measure the throughput:
.option("fetchSize", "500")
Oracle's default fetchSize is only 10, so you are already well above that. But for extremely wide tables, the sweet spot is often lower than for narrow tables because each row carries so much data. Monitor your executor memory (SparkUI -> Executors tab) to see if you are hitting GC overhead.
RECOMMENDATION 3: VERIFY PARALLEL PARTITIONING IS ACTUALLY WORKING
Setting "parallel 10" is only effective if you are using the partitionColumn, lowerBound, upperBound, and numPartitions options together. All four must be set for Spark to actually parallelize the JDBC read. Without partitionColumn configured, Spark reads everything on a single thread regardless of numPartitions.
Here is a proper parallel configuration:
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "your_table")
.option("partitionColumn", "your_numeric_id_column")
.option("lowerBound", "1")
.option("upperBound", "700000")
.option("numPartitions", "10")
.option("fetchSize", "500")
.option("user", username)
.option("password", password)
.load())
Key requirements for partitionColumn:
- It must be a numeric, date, or timestamp column
- It should be indexed in Oracle
- It should have a reasonably uniform distribution of values
- lowerBound and upperBound do NOT filter rows -- they only define how Spark divides the range into partitions
Also verify in SparkUI (Stages tab) that you actually see 10 tasks running. If your cluster has fewer than 10 cores, some partitions will queue and wait.
RECOMMENDATION 4: USE sessionInitStatement FOR ORACLE-SPECIFIC TUNING
You can pass Oracle session-level hints to optimize the read:
.option("sessionInitStatement",
"ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS'")
Or enable parallel query on the Oracle side:
.option("sessionInitStatement",
"ALTER SESSION ENABLE PARALLEL QUERY")
This tells Oracle's optimizer to use its own internal parallelism when serving your query, which can help especially if Oracle's execution plan is doing a full table scan.
RECOMMENDATION 5: CHECK YOUR CLUSTER SIZING
For a 600-column, 700K-row extraction:
- Make sure you have at least 10 worker cores to match your numPartitions of 10
- Each executor should have enough memory to hold the fetch buffer. With wide rows, consider 8GB+ per executor
- Use a Databricks Runtime with Photon enabled for faster deserialization on the Spark side
RECOMMENDATION 6: INVESTIGATE THE count(*) PERFORMANCE
A count(*) taking 1 minute for 700K rows on Oracle is slower than expected. This could indicate:
- The table lacks optimizer statistics (run DBMS_STATS.GATHER_TABLE_STATS on the Oracle side)
- The table has many LOB columns that slow down full table scans
- Network latency between your Databricks workspace and the Oracle database is high
You can test network latency by running a simple "SELECT 1 FROM DUAL" through JDBC and timing it. If that alone takes multiple seconds, you have a network issue that no amount of Spark tuning will fix.
RECOMMENDATION 7: CONSIDER LAKEHOUSE FEDERATION AS AN ALTERNATIVE
If this is a recurring extraction, Databricks supports Oracle through Lakehouse Federation (query federation). This lets you create a foreign catalog in Unity Catalog pointing to your Oracle database, and queries are pushed down to Oracle automatically. It handles projection and predicate pushdown natively:
CREATE CONNECTION oracle_conn TYPE oracle
OPTIONS (host 'your-host', port '1521', user secret('scope','user'), password secret('scope','pass'));
CREATE FOREIGN CATALOG oracle_catalog USING CONNECTION oracle_conn
OPTIONS (database 'your_database');
-- Then query directly:
SELECT col1, col2 FROM oracle_catalog.schema.your_table;
This is especially useful if you want ad-hoc access without a full ETL pipeline. For bulk extraction, the JDBC approach with the tuning above will likely still be faster.
Requirements: Databricks Runtime 16.1+, Unity Catalog enabled, Pro or Serverless SQL warehouse (version 2024.50+).
QUICK CHECKLIST
1. Select only needed columns (or split into multiple reads)
2. Set fetchSize between 100-500 for wide tables
3. Verify partitionColumn + lowerBound + upperBound + numPartitions are all set
4. Confirm cluster has enough cores (>= numPartitions) and memory (8GB+/executor)
5. Gather Oracle table statistics with DBMS_STATS
6. Check network latency between Databricks and Oracle
7. Consider Lakehouse Federation for recurring access
DOCUMENTATION REFERENCES
- Databricks JDBC configuration guide: https://docs.databricks.com/en/connect/external-systems/jdbc.html
- Apache Spark JDBC Data Sources (all options): https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
- Lakehouse Federation with Oracle: https://docs.databricks.com/en/query-federation/oracle.html
- Lakehouse Federation overview: https://docs.databricks.com/en/query-federation/index.html
Hope this helps get your extraction running faster. If you try these and still see issues, sharing your full Spark read code and the SparkUI stage details would help the community diagnose further.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
a month ago - last edited a month ago
Hello @LSIMS ,
Based on my understanding you are seeing slow reads in general including the count(*) from Oracle itself.. based on that I'd suggest the following:
- If fetch(read) is slow that means the upstream data is not optimized, a good optimization in this case is to index your oracle data and use the index column in your spark JDBC read query.
- I understand you mentioned 10 parallel, 10 parallel will mean 10 workers' cores can run at a time otherwise if your cluster has less cores available the 10 parallel runs will not happen at the same time, in this case please check SparkUI to confirm how many executors are spun and how many cores are used
2 weeks ago
Thanks for posting this -- extracting wide tables over JDBC is one of the trickiest performance scenarios, and 600 columns is definitely pushing the limits of what a single JDBC fetch can handle efficiently. The good news is there are several concrete things you can do to dramatically improve throughput. I will walk through them from highest to lowest impact.
UNDERSTANDING THE BOTTLENECK
With 600 columns and 700K rows, the main bottleneck is almost certainly the sheer volume of data per row being serialized over JDBC. Each row fetch has to serialize all 600 column values, and Oracle's JDBC driver is not optimized for extremely wide result sets. A count(*) taking 1 minute also suggests the Oracle side itself may have some overhead (possibly a full table scan on a large physical table, or the table lacks proper statistics).
Your fetchSize of 5000 with 600 columns means each fetch round-trip is pulling 5000 x 600 = 3 million cell values. Depending on the data types (especially if there are LOBs, CLOBs, or wide VARCHAR2 columns), this could be consuming significant memory per fetch and causing GC pressure on the Spark executors.
RECOMMENDATION 1: COLUMN PRUNING (HIGHEST IMPACT)
If you do not actually need all 600 columns downstream, the single most impactful thing you can do is select only the columns you need. Use the "dbtable" option with a subquery instead of pointing at the full table:
pushdown_query = "(SELECT col1, col2, col3, ... FROM your_table) t"
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", pushdown_query)
.option("user", username)
.option("password", password)
.option("fetchSize", "1000")
.option("numPartitions", 10)
.option("partitionColumn", "your_id_column")
.option("lowerBound", "1")
.option("upperBound", "700000")
.load())
Even reducing from 600 to 200 columns can cut your transfer time by 60% or more because you are reducing the bytes-per-row that Oracle's JDBC driver has to serialize.
If you DO need all 600 columns, consider splitting the extraction into two or three reads (e.g., columns 1-200, 201-400, 401-600) joined on the primary key afterward. This can actually be faster than one monolithic read because each fetch is much lighter.
RECOMMENDATION 2: TUNE FETCHSIZE FOR WIDE TABLES
With 600 columns, a fetchSize of 5000 may actually be too high. Each fetch batch has to fit in executor memory, and wide rows consume significantly more memory per batch. Try reducing to 500 or even 100 and measure the throughput:
.option("fetchSize", "500")
Oracle's default fetchSize is only 10, so you are already well above that. But for extremely wide tables, the sweet spot is often lower than for narrow tables because each row carries so much data. Monitor your executor memory (SparkUI -> Executors tab) to see if you are hitting GC overhead.
RECOMMENDATION 3: VERIFY PARALLEL PARTITIONING IS ACTUALLY WORKING
Setting "parallel 10" is only effective if you are using the partitionColumn, lowerBound, upperBound, and numPartitions options together. All four must be set for Spark to actually parallelize the JDBC read. Without partitionColumn configured, Spark reads everything on a single thread regardless of numPartitions.
Here is a proper parallel configuration:
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "your_table")
.option("partitionColumn", "your_numeric_id_column")
.option("lowerBound", "1")
.option("upperBound", "700000")
.option("numPartitions", "10")
.option("fetchSize", "500")
.option("user", username)
.option("password", password)
.load())
Key requirements for partitionColumn:
- It must be a numeric, date, or timestamp column
- It should be indexed in Oracle
- It should have a reasonably uniform distribution of values
- lowerBound and upperBound do NOT filter rows -- they only define how Spark divides the range into partitions
Also verify in SparkUI (Stages tab) that you actually see 10 tasks running. If your cluster has fewer than 10 cores, some partitions will queue and wait.
RECOMMENDATION 4: USE sessionInitStatement FOR ORACLE-SPECIFIC TUNING
You can pass Oracle session-level hints to optimize the read:
.option("sessionInitStatement",
"ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS'")
Or enable parallel query on the Oracle side:
.option("sessionInitStatement",
"ALTER SESSION ENABLE PARALLEL QUERY")
This tells Oracle's optimizer to use its own internal parallelism when serving your query, which can help especially if Oracle's execution plan is doing a full table scan.
RECOMMENDATION 5: CHECK YOUR CLUSTER SIZING
For a 600-column, 700K-row extraction:
- Make sure you have at least 10 worker cores to match your numPartitions of 10
- Each executor should have enough memory to hold the fetch buffer. With wide rows, consider 8GB+ per executor
- Use a Databricks Runtime with Photon enabled for faster deserialization on the Spark side
RECOMMENDATION 6: INVESTIGATE THE count(*) PERFORMANCE
A count(*) taking 1 minute for 700K rows on Oracle is slower than expected. This could indicate:
- The table lacks optimizer statistics (run DBMS_STATS.GATHER_TABLE_STATS on the Oracle side)
- The table has many LOB columns that slow down full table scans
- Network latency between your Databricks workspace and the Oracle database is high
You can test network latency by running a simple "SELECT 1 FROM DUAL" through JDBC and timing it. If that alone takes multiple seconds, you have a network issue that no amount of Spark tuning will fix.
RECOMMENDATION 7: CONSIDER LAKEHOUSE FEDERATION AS AN ALTERNATIVE
If this is a recurring extraction, Databricks supports Oracle through Lakehouse Federation (query federation). This lets you create a foreign catalog in Unity Catalog pointing to your Oracle database, and queries are pushed down to Oracle automatically. It handles projection and predicate pushdown natively:
CREATE CONNECTION oracle_conn TYPE oracle
OPTIONS (host 'your-host', port '1521', user secret('scope','user'), password secret('scope','pass'));
CREATE FOREIGN CATALOG oracle_catalog USING CONNECTION oracle_conn
OPTIONS (database 'your_database');
-- Then query directly:
SELECT col1, col2 FROM oracle_catalog.schema.your_table;
This is especially useful if you want ad-hoc access without a full ETL pipeline. For bulk extraction, the JDBC approach with the tuning above will likely still be faster.
Requirements: Databricks Runtime 16.1+, Unity Catalog enabled, Pro or Serverless SQL warehouse (version 2024.50+).
QUICK CHECKLIST
1. Select only needed columns (or split into multiple reads)
2. Set fetchSize between 100-500 for wide tables
3. Verify partitionColumn + lowerBound + upperBound + numPartitions are all set
4. Confirm cluster has enough cores (>= numPartitions) and memory (8GB+/executor)
5. Gather Oracle table statistics with DBMS_STATS
6. Check network latency between Databricks and Oracle
7. Consider Lakehouse Federation for recurring access
DOCUMENTATION REFERENCES
- Databricks JDBC configuration guide: https://docs.databricks.com/en/connect/external-systems/jdbc.html
- Apache Spark JDBC Data Sources (all options): https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
- Lakehouse Federation with Oracle: https://docs.databricks.com/en/query-federation/oracle.html
- Lakehouse Federation overview: https://docs.databricks.com/en/query-federation/index.html
Hope this helps get your extraction running faster. If you try these and still see issues, sharing your full Spark read code and the SparkUI stage details would help the community diagnose further.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
2 weeks ago
Hi Steve,
I really appreciate your amazingly detailed post! These are the contributions that really shine through.
I can confirm I ended up in the meantime to resolve the situation and its totally aligned with what you mentioned. I had reduced the number of columns and also have effectively ensured partitioning was being used.
One thing I can also confirm is that JDBC seems to be way faster than the native oracle connection that creates an external catalog. Would be great to have some influence over the parameters of this connection I guess.
Thank you!!