Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-16-2025 10:27 PM
A .CRC file (Cyclic Redundancy Check) is an internal checksum file used by Spark (and Hadoop) to ensure data integrity when reading and writing files.
- Data Integrity Check – .CRC files store checksums of actual data files. When reading a file, Spark/Hadoop verifies the checksum to detect corruption.
- Auto-Generated by HDFS – When a file is written to an HDFS-backed storage (like DBFS in Databricks or S3 with Hadoop connectors), Hadoop’s FileSystem API may generate .CRC files.
- Prevents Partial/Corrupt Reads – If a .CRC file exists, Spark ensures the data file hasn’t been tampered with.
If you're using Databricks with S3 or ADLS and don't want .CRC files, you can disable checksum verification:
spark.conf.set("spark.hadoop.fs.s3a.fast.upload", "true") spark.conf.set("spark.hadoop.fs.s3a.fast.upload.buffer", "disk")