Databricks Community

Vishwanath_Rao · ‎01-19-2024

We're in the middle of migrating to Databricks and found that the same path on s3 is producing different counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) it is a simple spark.read.parquet().count(), tried multiple solutions like making the spark configs across both the same, but still consistently giving the same counts.

Walter_C · ‎01-20-2024

The discrepancy in counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) could be due to several reasons:
1. Different versions of Spark: The two environments are running different versions of Spark which might have different optimizations or behaviors that could affect the count.
2.Differences in reading data from S3: There might be differences in how the two environments read data from S3. For example, they might handle partition discovery differently, or there might be differences in how they handle corrupt or invalid data.
3. Use of Delta format: Databricks recommends using Delta format instead of Parquet for efficiency and ACID transaction guarantees. If you are using Parquet in Databricks, it might be worth trying to convert the data to Delta format and see if that resolves the discrepancy.
Based on the given information few validations or suggestions will be:
- Check for corrupt or invalid data: If there are corrupt or invalid Parquet files in S3, they might be handled differently by the two environments.
- Try using Delta format: As recommended by Databricks, try converting the data to Delta format and see if that resolves the discrepancy.

Additional resources: https://docs.databricks.com/en/migration/spark.html#

Databricks Community

Same path producing different counts on Databricks and EMR

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences