cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Same path producing different counts on Databricks and EMR

Vishwanath_Rao
New Contributor II

We're in the middle of migrating to Databricks and found that the same path on s3 is producing different counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) it is a simple spark.read.parquet().count(), tried multiple solutions like making the spark configs across both the same, but still consistently giving the same counts.

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

The discrepancy in counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) could be due to several reasons:
1. Different versions of Spark: The two environments are running different versions of Spark which might have different optimizations or behaviors that could affect the count.
2.Differences in reading data from S3: There might be differences in how the two environments read data from S3. For example, they might handle partition discovery differently, or there might be differences in how they handle corrupt or invalid data.
3. Use of Delta format: Databricks recommends using Delta format instead of Parquet for efficiency and ACID transaction guarantees. If you are using Parquet in Databricks, it might be worth trying to convert the data to Delta format and see if that resolves the discrepancy.
Based on the given information few validations or suggestions will be:
- Check for corrupt or invalid data: If there are corrupt or invalid Parquet files in S3, they might be handled differently by the two environments.
- Try using Delta format: As recommended by Databricks, try converting the data to Delta format and see if that resolves the discrepancy.

Additional resources: https://docs.databricks.com/en/migration/spark.html#

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group