Missing Rows When Reading Data from Impala Kudu to Databricks Using JDBC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā01-14-2025 06:52 PM
Hi everyone,
Iām working on a data ingestion process where I need to read data from an Impala Kudu table into Databricks using the JDBC connector. However, Iām experiencing an issue where some rows are missing in the data read. For instance, if there are 100,000 rows in the source table, only 99,000 rows get loaded into the df. Below is the code Iām using for this process:
df = ( spark.read.format("jdbc") .option("url", url) .option("driver", "com.cloudera.impala.jdbc.Driver") .option("dbtable", query) .option("partitionColumn", "col1") .option("lowerBound", min_value) .option("upperBound", max_value) .option("numPartitions", "4") .option("fetchsize", "10000") .load() )
- Has anyone encountered a similar issue with missing rows when reading data from Impala Kudu using JDBC?
- Are there additional troubleshooting steps or configuration options I should try?
- Would it help to use an alternative approach, such as exporting data from Impala to Parquet/CSV and then loading it into Databricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā01-15-2025 08:19 AM
Have you checked whether the partitioning is configured correctly?
If disabling partitioning (creating a single partition) allows you to retrieve 100,000 rows, but enabling partitioning results in only 99,000 rows, it is likely that the partition settings are the cause.
In particular, you should verify that options like "min_value" and "max_value" are set correctly.
For example, if the specified range for "partitionColumn" is narrower than the actual minimum and maximum values, any rows outside that range will not be retrieved.
Takuya Omi (å°¾ē¾ęå)

