Same path producing different counts on Databricks and EMR

Vishwanath_Rao — Fri, 19 Jan 2024 18:59:35 GMT

We're in the middle of migrating to Databricks and found that the same path on s3 is producing different counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) it is a simple spark.read.parquet().count(), tried multiple solutions like making the spark configs across both the same, but still consistently giving the same counts.

Re: Same path producing different counts on Databricks and EMR

Walter_C — Sat, 20 Jan 2024 17:24:43 GMT

The discrepancy in counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) could be due to several reasons:
1. Different versions of Spark: The two environments are running different versions of Spark which might have different optimizations or behaviors that could affect the count.
2.Differences in reading data from S3: There might be differences in how the two environments read data from S3. For example, they might handle partition discovery differently, or there might be differences in how they handle corrupt or invalid data.
3. Use of Delta format: Databricks recommends using Delta format instead of Parquet for efficiency and ACID transaction guarantees. If you are using Parquet in Databricks, it might be worth trying to convert the data to Delta format and see if that resolves the discrepancy.
Based on the given information few validations or suggestions will be:
- Check for corrupt or invalid data: If there are corrupt or invalid Parquet files in S3, they might be handled differently by the two environments.
- Try using Delta format: As recommended by Databricks, try converting the data to Delta format and see if that resolves the discrepancy.

Additional resources: https://docs.databricks.com/en/migration/spark.html#

topic Same path producing different counts on Databricks and EMR in Data Engineering

Same path producing different counts on Databricks and EMR

Re: Same path producing different counts on Databricks and EMR