cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Same path producing different counts on Databricks and EMR

Vishwanath_Rao
New Contributor II

We're in the middle of migrating to Databricks and found that the same path on s3 is producing different counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) it is a simple spark.read.parquet().count(), tried multiple solutions like making the spark configs across both the same, but still consistently giving the same counts.

2 REPLIES 2

Walter_C
Valued Contributor II
Valued Contributor II

The discrepancy in counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) could be due to several reasons:
1. Different versions of Spark: The two environments are running different versions of Spark which might have different optimizations or behaviors that could affect the count.
2.Differences in reading data from S3: There might be differences in how the two environments read data from S3. For example, they might handle partition discovery differently, or there might be differences in how they handle corrupt or invalid data.
3. Use of Delta format: Databricks recommends using Delta format instead of Parquet for efficiency and ACID transaction guarantees. If you are using Parquet in Databricks, it might be worth trying to convert the data to Delta format and see if that resolves the discrepancy.
Based on the given information few validations or suggestions will be:
- Check for corrupt or invalid data: If there are corrupt or invalid Parquet files in S3, they might be handled differently by the two environments.
- Try using Delta format: As recommended by Databricks, try converting the data to Delta format and see if that resolves the discrepancy.

Additional resources: https://docs.databricks.com/en/migration/spark.html#

Kaniz
Community Manager
Community Manager

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 
 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.