Hi everyone,
Iām working on a data ingestion process where I need to read data from an Impala Kudu table into Databricks using the JDBC connector. However, Iām experiencing an issue where some rows are missing in the data read. For instance, if there are 100,000 rows in the source table, only 99,000 rows get loaded into the df. Below is the code Iām using for this process:
df = ( spark.read.format("jdbc") .option("url", url) .option("driver", "com.cloudera.impala.jdbc.Driver") .option("dbtable", query) .option("partitionColumn", "col1") .option("lowerBound", min_value) .option("upperBound", max_value) .option("numPartitions", "4") .option("fetchsize", "10000") .load() )
- Has anyone encountered a similar issue with missing rows when reading data from Impala Kudu using JDBC?
- Are there additional troubleshooting steps or configuration options I should try?
- Would it help to use an alternative approach, such as exporting data from Impala to Parquet/CSV and then loading it into Databricks?