Dataiku connector limitation

MaximeGendre — Wed, 19 Feb 2025 19:46:04 GMT

Hello,
I'm trying to read data from Unity Catalog and insert it into an Oracle Database using an "On Premise" Dataiku.

It works well for a small dataset ~600Kb/~150 000 rows.

[14:51:20] [INFO] [dku.datasets.sql] - Read 2000 records from DB [14:51:20] [INFO] [dku.datasets.sql] - Read 4000 records from DB ....

Unfortunately I have an error message with a larger one.

It seems that it is using Databricks-managed storage as a staging area and that something is not working on the Dataiku side.

Have you ever encountered this problem ?
Is that a common behaviour of Databricks to have this staging area ? ( I guess it would have been totally transparent without this "network" error)

Thank you for your help.

Re: Dataiku connector limitation

Louis_Frolio — Sun, 02 Nov 2025 22:48:23 GMT

Greetings @MaximeGendre , thanks for the detailed context — a few things here are likely at play.

Is a Databricks “staging area” a common behavior?

Yes. Many third‑party tools and ISV integrations use Unity Catalog (UC) Volumes or cloud object storage as a temporary staging location to move larger batches of data efficiently (for example, bulk loaders stage files and then perform fast server‑side loads). This is an integration best practice Databricks recommends to partners: stage files into UC Volumes and then use operations like COPY INTO or bulk ingestion; the pattern exists because large, batched operations are more reliable and performant when done via files rather than row‑by‑row JDBC inserts. UC Volumes are managed or external storage locations governed by Unity Catalog and are explicitly designed to support staging and file‑oriented workflows (and commonly used by ISVs). As a concrete example of this pattern in the ecosystem, Alteryx’s Databricks bulk loader supports staging to S3 or ADLS before loading — i.e., it writes files to a staging bucket/container and then loads them, which is similar to what you’re suspecting in your case. Databricks has an official integration with Dataiku (Partner Connect supports SQL warehouses and clusters), and tools can choose either direct JDBC/ODBC or staging‑based flows depending on recipe and connector settings.

Why might small datasets work and larger ones fail?

When the connector switches to a bulk/staging path or simply moves more data, you can hit:

Network/egress constraints between on‑prem Dataiku and cloud object storage backing UC managed tables or UC Volumes (e.g., presigned URL downloads/uploads blocked, firewall/proxy, idle timeouts). This only surfaces on larger transfers because the job lasts longer and moves more bytes. It’s consistent with a “network” error showing up only at scale.
Connector timeouts and batch settings (fetch size, batch size, commit frequency). With more rows, long‑running connections can hit idle or TLS keepalive limits, or the job can exceed default timeouts.
Oracle-side limits during bulk insert (buffer size, max rows per batch, transaction duration). Some Oracle servers drop long sessions or large packets if not tuned for bulk loads.

Note: Databricks itself doesn’t force staging for “reading from UC” — you can read via JDBC/ODBC directly — but many partner connectors choose file staging for performance or reliability on large writes/reads. Whether Dataiku’s specific recipe uses staging depends on its configuration.

Quick checks in Dataiku

To narrow it down without major changes:

Confirm the Dataiku connection type to Databricks (SQL warehouse via JDBC/ODBC vs. Spark/cluster). Partner Connect supports SQL warehouses; some setups also connect clusters manually.
Review the recipe settings used to write to Oracle:
- Try disabling any “bulk loader” behavior and use smaller JDBC batches (e.g., batch size 500–2000) with commit‑every‑batch.
- Increase connector timeouts (read/write) and enable keepalive if available.
If Dataiku is staging content to UC Volumes or managed storage, ensure your on‑prem node can reach the underlying cloud endpoints (S3/ADLS/GCS) used by UC managed tables/volumes. Some ISV patterns rely on presigned URLs; those must not be blocked by proxies or firewalls.
Check job logs for clues: look for signs of “upload to volume/path,” “presigned URL,” or “temporary file” handling before the Oracle insert. That confirms whether a staging path is being used.

Workarounds and alternatives

If the network path to cloud storage is the blocker, these approaches typically resolve it:

Use an external UC Volume or cloud storage location deliberately configured for Dataiku access (e.g., ADLS/S3 with proper credentials), and have Dataiku read/write files there as its staging area. This keeps the staging pattern but removes the opaque “managed” path under UC. UC Volumes are explicitly intended to support this and are governed by UC.
Keep everything JDBC-only end to end:
- Read from Databricks via JDBC/ODBC (paginate, increase fetch size modestly) and write to Oracle via JDBC with tuned batch size and commits. This avoids file staging entirely and often sidesteps firewall rules on object storage.
As a Databricks-side alternative to Dataiku for the transfer, use Spark JDBC to Oracle from Databricks with partitioning and batch sizing (this is simple and robust for large transfers): python df.write \ .format("jdbc") \ .option("url", "jdbc:oracle:thin:@//host:port/service_name") \ .option("dbtable", "SCHEMA.TARGET_TABLE") \ .option("user", "oracle_user") \ .option("password", "oracle_pwd") \ .option("batchsize", 10000) \ .option("numPartitions", 8) \ .mode("append") \ .save() Then let Dataiku consume downstream once data is in Oracle.

Summary

Yes, it’s common for ISV tools to use UC Volumes / cloud storage as a staging area; Databricks explicitly recommends it for partner integrations because staging files substantially improves reliability and throughput in bulk operations. * Your error pattern (small OK, large fails) strongly suggests a staging + network/timeout issue between on‑prem Dataiku and the storage behind UC managed tables/volumes or an Oracle bulk insert timeout. Confirm the recipe path and adjust batch/timeout settings or make the staging location explicitly reachable.

Hope this helps, Louis.