Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC

Kuke — Wed, 15 Jan 2025 02:52:21 GMT

Hi everyone,

I’m working on a data ingestion process where I need to read data from an Impala Kudu table into Databricks using the JDBC connector. However, I’m experiencing an issue where some rows are missing in the data read. For instance, if there are 100,000 rows in the source table, only 99,000 rows get loaded into the df. Below is the code I’m using for this process:

df = ( spark.read.format("jdbc") .option("url", url) .option("driver", "com.cloudera.impala.jdbc.Driver") .option("dbtable", query) .option("partitionColumn", "col1") .option("lowerBound", min_value) .option("upperBound", max_value) .option("numPartitions", "4") .option("fetchsize", "10000") .load() )

Has anyone encountered a similar issue with missing rows when reading data from Impala Kudu using JDBC?
Are there additional troubleshooting steps or configuration options I should try?
Would it help to use an alternative approach, such as exporting data from Impala to Parquet/CSV and then loading it into Databricks?

Re: Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC

Takuya-Omi — Wed, 15 Jan 2025 16:19:53 GMT

@Kuke

Have you checked whether the partitioning is configured correctly?
If disabling partitioning (creating a single partition) allows you to retrieve 100,000 rows, but enabling partitioning results in only 99,000 rows, it is likely that the partition settings are the cause.

In particular, you should verify that options like "min_value" and "max_value" are set correctly.
For example, if the specified range for "partitionColumn" is narrower than the actual minimum and maximum values, any rows outside that range will not be retrieved.

topic Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC in Data Engineering

Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC

Re: Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC