The discrepancy in counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) could be due to several reasons:
1. Different versions of Spark: The two environments are running different versions of Spark which might have different optimizations or behaviors that could affect the count.
2.Differences in reading data from S3: There might be differences in how the two environments read data from S3. For example, they might handle partition discovery differently, or there might be differences in how they handle corrupt or invalid data.
3. Use of Delta format: Databricks recommends using Delta format instead of Parquet for efficiency and ACID transaction guarantees. If you are using Parquet in Databricks, it might be worth trying to convert the data to Delta format and see if that resolves the discrepancy.
Based on the given information few validations or suggestions will be:
- Check for corrupt or invalid data: If there are corrupt or invalid Parquet files in S3, they might be handled differently by the two environments.
- Try using Delta format: As recommended by Databricks, try converting the data to Delta format and see if that resolves the discrepancy.
Additional resources: https://docs.databricks.com/en/migration/spark.html#