cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC

Kuke
New Contributor

Hi everyone,

Iā€™m working on a data ingestion process where I need to read data from an Impala Kudu table into Databricks using the JDBC connector. However, Iā€™m experiencing an issue where some rows are missing in the data read. For instance, if there are 100,000 rows in the source table, only 99,000 rows get loaded into the df. Below is the code Iā€™m using for this process:

df = ( spark.read.format("jdbc") .option("url", url) .option("driver", "com.cloudera.impala.jdbc.Driver") .option("dbtable", query) .option("partitionColumn", "col1") .option("lowerBound", min_value) .option("upperBound", max_value) .option("numPartitions", "4") .option("fetchsize", "10000") .load() )

  • Has anyone encountered a similar issue with missing rows when reading data from Impala Kudu using JDBC?
  • Are there additional troubleshooting steps or configuration options I should try?
  • Would it help to use an alternative approach, such as exporting data from Impala to Parquet/CSV and then loading it into Databricks?
1 REPLY 1

TakuyaOmi
Valued Contributor II

@Kuke 

Have you checked whether the partitioning is configured correctly?
If disabling partitioning (creating a single partition) allows you to retrieve 100,000 rows, but enabling partitioning results in only 99,000 rows, it is likely that the partition settings are the cause.

In particular, you should verify that options like "min_value" and "max_value" are set correctly.
For example, if the specified range for "partitionColumn" is narrower than the actual minimum and maximum values, any rows outside that range will not be retrieved.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonā€™t want to miss the chance to attend and share knowledge.

If there isnā€™t a group near you, start one and help create a community that brings people together.

Request a New Group