<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/missing-rows-when-reading-data-from-impala-kudu-to-databricks/m-p/105655#M42227</link>
    <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m working on a data ingestion process where I need to read data from an Impala Kudu table into Databricks using the JDBC connector. However, I’m experiencing an issue where some rows are missing in the data read. For instance, if there are 100,000 rows in the source table, only 99,000 rows get loaded into the df. Below is the code I’m using for this process:&lt;/P&gt;&lt;P&gt;df = ( spark.read.&lt;SPAN class=""&gt;format&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"jdbc"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"url"&lt;/SPAN&gt;, url) .option(&lt;SPAN class=""&gt;"driver"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"com.cloudera.impala.jdbc.Driver"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"dbtable"&lt;/SPAN&gt;, query) .option(&lt;SPAN class=""&gt;"partitionColumn"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"col1"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"lowerBound"&lt;/SPAN&gt;, min_value) .option(&lt;SPAN class=""&gt;"upperBound"&lt;/SPAN&gt;, max_value) .option(&lt;SPAN class=""&gt;"numPartitions"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"4"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"fetchsize"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"10000"&lt;/SPAN&gt;) .load() )&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Has anyone encountered a similar issue with missing rows when reading data from Impala Kudu using JDBC?&lt;/LI&gt;&lt;LI&gt;Are there additional troubleshooting steps or configuration options I should try?&lt;/LI&gt;&lt;LI&gt;Would it help to use an alternative approach, such as exporting data from Impala to Parquet/CSV and then loading it into Databricks?&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Wed, 15 Jan 2025 02:52:21 GMT</pubDate>
    <dc:creator>Kuke</dc:creator>
    <dc:date>2025-01-15T02:52:21Z</dc:date>
    <item>
      <title>Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC</title>
      <link>https://community.databricks.com/t5/data-engineering/missing-rows-when-reading-data-from-impala-kudu-to-databricks/m-p/105655#M42227</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m working on a data ingestion process where I need to read data from an Impala Kudu table into Databricks using the JDBC connector. However, I’m experiencing an issue where some rows are missing in the data read. For instance, if there are 100,000 rows in the source table, only 99,000 rows get loaded into the df. Below is the code I’m using for this process:&lt;/P&gt;&lt;P&gt;df = ( spark.read.&lt;SPAN class=""&gt;format&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"jdbc"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"url"&lt;/SPAN&gt;, url) .option(&lt;SPAN class=""&gt;"driver"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"com.cloudera.impala.jdbc.Driver"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"dbtable"&lt;/SPAN&gt;, query) .option(&lt;SPAN class=""&gt;"partitionColumn"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"col1"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"lowerBound"&lt;/SPAN&gt;, min_value) .option(&lt;SPAN class=""&gt;"upperBound"&lt;/SPAN&gt;, max_value) .option(&lt;SPAN class=""&gt;"numPartitions"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"4"&lt;/SPAN&gt;) .option(&lt;SPAN class=""&gt;"fetchsize"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"10000"&lt;/SPAN&gt;) .load() )&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Has anyone encountered a similar issue with missing rows when reading data from Impala Kudu using JDBC?&lt;/LI&gt;&lt;LI&gt;Are there additional troubleshooting steps or configuration options I should try?&lt;/LI&gt;&lt;LI&gt;Would it help to use an alternative approach, such as exporting data from Impala to Parquet/CSV and then loading it into Databricks?&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Wed, 15 Jan 2025 02:52:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/missing-rows-when-reading-data-from-impala-kudu-to-databricks/m-p/105655#M42227</guid>
      <dc:creator>Kuke</dc:creator>
      <dc:date>2025-01-15T02:52:21Z</dc:date>
    </item>
    <item>
      <title>Re: Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC</title>
      <link>https://community.databricks.com/t5/data-engineering/missing-rows-when-reading-data-from-impala-kudu-to-databricks/m-p/105760#M42258</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/122436"&gt;@Kuke&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Have you checked whether the partitioning is configured correctly?&lt;BR /&gt;If disabling partitioning (creating a single partition) allows you to retrieve 100,000 rows, but enabling partitioning results in only 99,000 rows, it is likely that the partition settings are the cause.&lt;/P&gt;&lt;P&gt;In particular, you should verify that options like "min_value" and "max_value" are set correctly.&lt;BR /&gt;For example, if the specified range for "partitionColumn" is narrower than the actual minimum and maximum values, any rows outside that range will not be retrieved.&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jan 2025 16:19:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/missing-rows-when-reading-data-from-impala-kudu-to-databricks/m-p/105760#M42258</guid>
      <dc:creator>Takuya-Omi</dc:creator>
      <dc:date>2025-01-15T16:19:53Z</dc:date>
    </item>
  </channel>
</rss>

