Re: Why do we need CRC files in Delta logs. How do...

VasuBajaj · ‎03-16-2025

A .CRC file (Cyclic Redundancy Check) is an internal checksum file used by Spark (and Hadoop) to ensure data integrity when reading and writing files.

Data Integrity Check – .CRC files store checksums of actual data files. When reading a file, Spark/Hadoop verifies the checksum to detect corruption.
Auto-Generated by HDFS – When a file is written to an HDFS-backed storage (like DBFS in Databricks or S3 with Hadoop connectors), Hadoop’s FileSystem API may generate .CRC files.
Prevents Partial/Corrupt Reads – If a .CRC file exists, Spark ensures the data file hasn’t been tampered with.
If you're using Databricks with S3 or ADLS and don't want .CRC files, you can disable checksum verification:
spark.conf.set("spark.hadoop.fs.s3a.fast.upload", "true") spark.conf.set("spark.hadoop.fs.s3a.fast.upload.buffer", "disk")