Greetings @MaximeGendre , thanks for the detailed context โ a few things here are likely at play.
Is a Databricks โstaging areaโ a common behavior?
Yes. Many thirdโparty tools and ISV integrations use Unity Catalog (UC) Volumes or cloud object storage as a temporary staging location to move larger batches of data efficiently (for example, bulk loaders stage files and then perform fast serverโside loads). This is an integration best practice Databricks recommends to partners: stage files into UC Volumes and then use operations like COPY INTO or bulk ingestion; the pattern exists because large, batched operations are more reliable and performant when done via files rather than rowโbyโrow JDBC inserts. UC Volumes are managed or external storage locations governed by Unity Catalog and are explicitly designed to support staging and fileโoriented workflows (and commonly used by ISVs). As a concrete example of this pattern in the ecosystem, Alteryxโs Databricks bulk loader supports staging to S3 or ADLS before loading โ i.e., it writes files to a staging bucket/container and then loads them, which is similar to what youโre suspecting in your case. Databricks has an official integration with Dataiku (Partner Connect supports SQL warehouses and clusters), and tools can choose either direct JDBC/ODBC or stagingโbased flows depending on recipe and connector settings.
Why might small datasets work and larger ones fail?
When the connector switches to a bulk/staging path or simply moves more data, you can hit:
- Network/egress constraints between onโprem Dataiku and cloud object storage backing UC managed tables or UC Volumes (e.g., presigned URL downloads/uploads blocked, firewall/proxy, idle timeouts). This only surfaces on larger transfers because the job lasts longer and moves more bytes. Itโs consistent with a โnetworkโ error showing up only at scale.
-
Connector timeouts and batch settings (fetch size, batch size, commit frequency). With more rows, longโrunning connections can hit idle or TLS keepalive limits, or the job can exceed default timeouts.
-
Oracle-side limits during bulk insert (buffer size, max rows per batch, transaction duration). Some Oracle servers drop long sessions or large packets if not tuned for bulk loads.
Note: Databricks itself doesnโt force staging for โreading from UCโ โ you can read via JDBC/ODBC directly โ but many partner connectors choose file staging for performance or reliability on large writes/reads. Whether Dataikuโs specific recipe uses staging depends on its configuration.
Quick checks in Dataiku
To narrow it down without major changes:
- Confirm the Dataiku connection type to Databricks (SQL warehouse via JDBC/ODBC vs. Spark/cluster). Partner Connect supports SQL warehouses; some setups also connect clusters manually.
-
Review the
recipe settings used to write to Oracle:
- Try disabling any โbulk loaderโ behavior and use smaller JDBC batches (e.g., batch size 500โ2000) with commitโeveryโbatch.
- Increase connector timeouts (read/write) and enable keepalive if available.
-
If Dataiku is staging content to UC Volumes or managed storage, ensure your onโprem node can reach the underlying cloud endpoints (S3/ADLS/GCS) used by UC managed tables/volumes. Some ISV patterns rely on presigned URLs; those must not be blocked by proxies or firewalls.
-
Check job logs for clues: look for signs of โupload to volume/path,โ โpresigned URL,โ or โtemporary fileโ handling before the Oracle insert. That confirms whether a staging path is being used.
Workarounds and alternatives
If the network path to cloud storage is the blocker, these approaches typically resolve it:
-
Use an external UC Volume or cloud storage location deliberately configured for Dataiku access (e.g., ADLS/S3 with proper credentials), and have Dataiku read/write files there as its staging area. This keeps the staging pattern but removes the opaque โmanagedโ path under UC. UC Volumes are explicitly intended to support this and are governed by UC.
-
Keep everything
JDBC-only end to end:
- Read from Databricks via JDBC/ODBC (paginate, increase fetch size modestly) and write to Oracle via JDBC with tuned batch size and commits. This avoids file staging entirely and often sidesteps firewall rules on object storage.
-
As a Databricks-side alternative to Dataiku for the transfer, use Spark JDBC to Oracle from Databricks with partitioning and batch sizing (this is simple and robust for large transfers): python
df.write \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:@//host:port/service_name") \
.option("dbtable", "SCHEMA.TARGET_TABLE") \
.option("user", "oracle_user") \
.option("password", "oracle_pwd") \
.option("batchsize", 10000) \
.option("numPartitions", 8) \
.mode("append") \
.save()
Then let Dataiku consume downstream once data is in Oracle.
Summary
Yes, itโs common for ISV tools to use UC Volumes / cloud storage as a staging area; Databricks explicitly recommends it for partner integrations because staging files substantially improves reliability and throughput in bulk operations. * Your error pattern (small OK, large fails) strongly suggests a staging + network/timeout issue between onโprem Dataiku and the storage behind UC managed tables/volumes or an Oracle bulk insert timeout. Confirm the recipe path and adjust batch/timeout settings or make the staging location explicitly reachable.
Hope this helps, Louis.