<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Dataiku connector limitation in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/dataiku-connector-limitation/m-p/110667#M3025</link>
    <description>&lt;P&gt;Hello,&lt;BR /&gt;I'm trying to &lt;STRONG&gt;read&amp;nbsp;&lt;/STRONG&gt;data from Unity Catalog and insert it into an Oracle Database using an "On Premise" Dataiku.&lt;BR /&gt;&lt;BR /&gt;It works well for a small dataset ~600Kb/~150 000 rows.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;[14:51:20] [INFO] [dku.datasets.sql] - Read 2000 records from DB
[14:51:20] [INFO] [dku.datasets.sql] - Read 4000 records from DB
....&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;Unfortunately I have an error message with a larger one.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="MaximeGendre_0-1739993758870.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14970i456E961BAAA21761/image-size/large?v=v2&amp;amp;px=999" role="button" title="MaximeGendre_0-1739993758870.png" alt="MaximeGendre_0-1739993758870.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;It seems that it is using Databricks-managed storage as a staging area and that something is not working on the Dataiku side.&lt;/P&gt;&lt;P&gt;Have you ever encountered this problem ?&lt;BR /&gt;Is that a common behaviour of Databricks to have this staging area ? ( I guess it would have been totally transparent without this "network" error)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Thank you for your help.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 19 Feb 2025 19:46:04 GMT</pubDate>
    <dc:creator>MaximeGendre</dc:creator>
    <dc:date>2025-02-19T19:46:04Z</dc:date>
    <item>
      <title>Dataiku connector limitation</title>
      <link>https://community.databricks.com/t5/administration-architecture/dataiku-connector-limitation/m-p/110667#M3025</link>
      <description>&lt;P&gt;Hello,&lt;BR /&gt;I'm trying to &lt;STRONG&gt;read&amp;nbsp;&lt;/STRONG&gt;data from Unity Catalog and insert it into an Oracle Database using an "On Premise" Dataiku.&lt;BR /&gt;&lt;BR /&gt;It works well for a small dataset ~600Kb/~150 000 rows.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;[14:51:20] [INFO] [dku.datasets.sql] - Read 2000 records from DB
[14:51:20] [INFO] [dku.datasets.sql] - Read 4000 records from DB
....&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;Unfortunately I have an error message with a larger one.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="MaximeGendre_0-1739993758870.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14970i456E961BAAA21761/image-size/large?v=v2&amp;amp;px=999" role="button" title="MaximeGendre_0-1739993758870.png" alt="MaximeGendre_0-1739993758870.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;It seems that it is using Databricks-managed storage as a staging area and that something is not working on the Dataiku side.&lt;/P&gt;&lt;P&gt;Have you ever encountered this problem ?&lt;BR /&gt;Is that a common behaviour of Databricks to have this staging area ? ( I guess it would have been totally transparent without this "network" error)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Thank you for your help.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Feb 2025 19:46:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/dataiku-connector-limitation/m-p/110667#M3025</guid>
      <dc:creator>MaximeGendre</dc:creator>
      <dc:date>2025-02-19T19:46:04Z</dc:date>
    </item>
    <item>
      <title>Re: Dataiku connector limitation</title>
      <link>https://community.databricks.com/t5/administration-architecture/dataiku-connector-limitation/m-p/137317#M4334</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106088"&gt;@MaximeGendre&lt;/a&gt;&amp;nbsp;, thanks for the detailed context — a few things here are likely at play.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3 class="paragraph"&gt;Is a Databricks “staging area” a common behavior?&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;Yes. Many third‑party tools and ISV integrations use &lt;STRONG&gt;Unity Catalog (UC) Volumes&lt;/STRONG&gt; or cloud object storage as a temporary staging location to move larger batches of data efficiently (for example, bulk loaders stage files and then perform fast server‑side loads). This is an integration best practice Databricks recommends to partners: stage files into UC Volumes and then use operations like COPY INTO or bulk ingestion; the pattern exists because large, batched operations are more reliable and performant when done via files rather than row‑by‑row JDBC inserts. UC &lt;STRONG&gt;Volumes&lt;/STRONG&gt; are managed or external storage locations governed by Unity Catalog and are explicitly designed to support staging and file‑oriented workflows (and commonly used by ISVs). As a concrete example of this pattern in the ecosystem, Alteryx’s Databricks bulk loader supports staging to S3 or ADLS before loading — i.e., it writes files to a staging bucket/container and then loads them, which is similar to what you’re suspecting in your case. Databricks has an official integration with &lt;STRONG&gt;Dataiku&lt;/STRONG&gt; (Partner Connect supports SQL warehouses and clusters), and tools can choose either direct JDBC/ODBC or staging‑based flows depending on recipe and connector settings.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Why might small datasets work and larger ones fail?&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;When the connector switches to a bulk/staging path or simply moves more data, you can hit:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;&lt;STRONG&gt;Network/egress constraints&lt;/STRONG&gt; between on‑prem Dataiku and cloud object storage backing UC managed tables or UC Volumes (e.g., presigned URL downloads/uploads blocked, firewall/proxy, idle timeouts). This only surfaces on larger transfers because the job lasts longer and moves more bytes. It’s consistent with a “network” error showing up only at scale.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Connector timeouts and batch settings&lt;/STRONG&gt; (fetch size, batch size, commit frequency). With more rows, long‑running connections can hit idle or TLS keepalive limits, or the job can exceed default timeouts.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Oracle-side limits&lt;/STRONG&gt; during bulk insert (buffer size, max rows per batch, transaction duration). Some Oracle servers drop long sessions or large packets if not tuned for bulk loads.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;Note: Databricks itself doesn’t force staging for “reading from UC” — you can read via JDBC/ODBC directly — but many partner connectors choose file staging for performance or reliability on large writes/reads. Whether Dataiku’s specific recipe uses staging depends on its configuration.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Quick checks in Dataiku&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;To narrow it down without major changes:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Confirm the &lt;STRONG&gt;Dataiku connection type&lt;/STRONG&gt; to Databricks (SQL warehouse via JDBC/ODBC vs. Spark/cluster). Partner Connect supports SQL warehouses; some setups also connect clusters manually.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Review the &lt;STRONG&gt;recipe settings&lt;/STRONG&gt; used to write to Oracle:
&lt;UL&gt;
&lt;LI&gt;Try disabling any “bulk loader” behavior and use smaller JDBC batches (e.g., batch size 500–2000) with commit‑every‑batch.&lt;/LI&gt;
&lt;LI&gt;Increase connector &lt;STRONG&gt;timeouts&lt;/STRONG&gt; (read/write) and enable keepalive if available.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If Dataiku is staging content to &lt;STRONG&gt;UC Volumes or managed storage&lt;/STRONG&gt;, ensure your on‑prem node can reach the underlying cloud endpoints (S3/ADLS/GCS) used by UC managed tables/volumes. Some ISV patterns rely on presigned URLs; those must not be blocked by proxies or firewalls.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Check &lt;STRONG&gt;job logs&lt;/STRONG&gt; for clues: look for signs of “upload to volume/path,” “presigned URL,” or “temporary file” handling before the Oracle insert. That confirms whether a staging path is being used.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Workarounds and alternatives&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If the network path to cloud storage is the blocker, these approaches typically resolve it:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Use an &lt;STRONG&gt;external UC Volume&lt;/STRONG&gt; or cloud storage location deliberately configured for Dataiku access (e.g., ADLS/S3 with proper credentials), and have Dataiku read/write files there as its staging area. This keeps the staging pattern but removes the opaque “managed” path under UC. UC Volumes are explicitly intended to support this and are governed by UC.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Keep everything &lt;STRONG&gt;JDBC-only&lt;/STRONG&gt; end to end:
&lt;UL&gt;
&lt;LI&gt;Read from Databricks via JDBC/ODBC (paginate, increase fetch size modestly) and write to Oracle via JDBC with tuned batch size and commits. This avoids file staging entirely and often sidesteps firewall rules on object storage.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;As a Databricks-side alternative to Dataiku for the transfer, use &lt;STRONG&gt;Spark JDBC to Oracle&lt;/STRONG&gt; from Databricks with partitioning and batch sizing (this is simple and robust for large transfers): &lt;CODE&gt;python
df.write \
  .format("jdbc") \
  .option("url", "jdbc:oracle:thin:@//host:port/service_name") \
  .option("dbtable", "SCHEMA.TARGET_TABLE") \
  .option("user", "oracle_user") \
  .option("password", "oracle_pwd") \
  .option("batchsize", 10000) \
  .option("numPartitions", 8) \
  .mode("append") \
  .save()
&lt;/CODE&gt; Then let Dataiku consume downstream once data is in Oracle.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Summary&lt;/H3&gt;
&lt;P&gt;Yes, it’s common for ISV tools to use &lt;STRONG&gt;UC Volumes / cloud storage as a staging area&lt;/STRONG&gt;; Databricks explicitly recommends it for partner integrations because staging files substantially improves reliability and throughput in bulk operations. * Your error pattern (small OK, large fails) strongly suggests a &lt;STRONG&gt;staging + network/timeout issue&lt;/STRONG&gt; between on‑prem Dataiku and the storage behind UC managed tables/volumes or an Oracle bulk insert timeout. Confirm the recipe path and adjust batch/timeout settings or make the staging location explicitly reachable.&lt;/P&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Louis.&lt;/DIV&gt;</description>
      <pubDate>Sun, 02 Nov 2025 22:48:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/dataiku-connector-limitation/m-p/137317#M4334</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-02T22:48:23Z</dc:date>
    </item>
  </channel>
</rss>

